I write a lot here on the challenges that research computing and data teams face, and the challenges that their managers and leaders face in particular. And that’s because I want them to have fewer challenges! I hope that by highlighting the challenges they at least don’t face them unawares, and that as they grow their skills they bring along the next generation of managers and leaders under clearer, less stressfull, and more suppported environments than we found ourselves in.
Which is all well and good, but it can get a little gloomy in here sometimes. I don’t think I spend enough time highlighting successes.
Do you have management or leadership wins you want to share, no matter how small? Just with me, or with the community (anonymously or otherwise)? Let me know - just hit reply and send me an email at [email protected] I think it would be good for everyone here to see what some of those wins look like.
(And let me know if you’d like to chat about those wins, or if you have some challenges you’d like to talk out - always happy to chat with readers.)
I had a really good discussion earlier this week about support with stratgy and the importance of having them aligned with your stakeholders. They gently reminded me that it isn’t everywhere as dire as I made it sound. In a lot of parts of our institutions, particularly if you’re closer to reporting to the CIO or VPO of your research institute than the VPR, there probably is a fair amount of clarity about what good looks like, and what’s next! Even on the VPR side, there may be a lot of support and enthusiasm for discussions about priorities, even if it may be up to you to initiate those conversations.
One of the fantastic and exciting things about research computing and data is that it’s getting unimaginably broad. Which is great! But the point is coming - is already here in some parts of our community - where being a generalist isn’t feasible. We who go into this line of work want to help everyone, but we’re past the one-size-fits-all stage of research computing, where we can’t do everything. That means making decisions, and is why I’m always hammering on topics like strategy and specialization. Making sure your team is pointed in the same direction as your leadership not only takes the weight of those decisions somewhat off your shoulders, but also means you’re supporting efforts the institution is trying to encourage.
Anyway, let’s get straight to it - on with the roundup:
A lot of begin our techincal leadership/management career by us growing with a team - we start off as effectively the senior sysadmin/data scientist/software developer on a team of two or three, and then the team grows. While numbers vary a bit based on the work of the team, below is a pretty typical trajectory with team size:
Obviously, this being the research world, no one talks us through any of that, or the different skills needed at each level.
One big reason why this is a problem is that not only do our skills need to be growing during this evolution, so do our team members’; at each step of this path the team is taking on a greater level of responsibility. Even at that 2-3 person team size, they’d benefit from us coaching them to take on new responsibilities as the team grows to the point where we’re full time managers; and then you need to identify and grow team leads, and eventually find a peer manager.
This blog from lighthouse covers things we’ve talked about before, but it’s a good overview of the process above, covering:
Respectful, healthy conflict is fine, even good, for a team - it’s essential, even, in the storming phase of the team formation cycle. But it does have to be managed, or at least monitored, and acknowledged. There are some points here to work through if the conflict becomes A Big Thing, which needn’t be the case. But even for more quotidian cases, there’s a useful breakdown of types of conflict which are worth keeping in mind (especially since conflict can “look” like one when it’s really another)
And the fact that explicitly calling the conflict out is the first step in moving to a resolution.
Not My Job - Silvia Botros
As a junior IC, one has the comfort of knowing exactly what one’s job is. As the scope of one’s responsibility grows, that stops being true - there’s a lot more to potentially be responsible for. But that doesn’t mean we’re responsible for everything and anything! Botros writes to push back against well-intentioned but dangerously un-nuanced points of view like `There is no “that’s my job” anymore’.
She makes an excellent and useful distinction between glue work and “gap filling”. Glue work is something all managers and technical leads genuinely are responsible for; being the connective tissue which stitches together individuals and teams and external stakeholders to make progress to a common goal. But sometimes there are gaps - missing expertise, absent leadership, no one to take on needed work - and we can’t selflessly fling ourselves into every such breach. That way leads to burnout for us, while the underlying problem remains unfixed.
Botros counsels being vigilant to avoid this, and to stay in contact with your manager and leadership to make sure the work you’re doing lines up with what your organization needs most:
Assessing whether what you are doing day to day needs to be an intentional process, something you and your manager re-assess routinely and compare to your goals and the organization goals. Be very aware of being pulled into projects with no measurable milestones. […]. You should use your experience and influence to shed a light on gaps and risks. But you cannot fix them all.
In our line of work I think this article and its recommendations are important not just at the individual level, but for whole teams. We’re doing our work because we want to advance science. There’s a noble but foolhardy tendency for research computing and data teams to want to jump in and solve any researchers problem even tangentially connected to our remit. But we can’t fill every gap. We advance science best by focussing on where we excel, doing our best work there, identifying gaps elsewhere, and then either flagging them to leadership or connecting researchers to other teams elsewhere. We can’t, and shouldn’t, do everything.
June 2022 database release - Nathan Benaich, Spinout.fyi
Interesting database from spinout.fyi of University spinouts, slightly over half of which were software-based, and so may be of interest to this community. Typical time to spinout was 9-12 months, although some Universities seem to have extremely capable departments for this and get it done in 3 months or less. Lots of breakdowns on negotiated royalties, equity, and the like. Generally people who went through the standard University processes were fairly unhappy with the results.
Story in nature about how it’s increasingly difficult to recruit postdocs:
“This year is hard for me to wrestle with: … we received absolutely zero response from our posting,” one wrote. “The number of applications is 10 times less than 2018-2019,” another wrote.
Is this consistent with what you team is seeing with your researcher clients?
A couple of great biology projects to highlight this week:
The Venerable UniProt project maintains a comprehensive protein sequence and function information. There’s lots of linked information there - the sequences, genes, external data sets, functional information, co-expression informations, etc. The results are a graph of knowledge, and the SPARQL query interface now queries 105.6 billion RDF triples, “the largest graph database free to query on the web”.
Elsewhere, there’s an article in the most recent Nature Scientific Data by Nishimura & Yoshizawa describing a catalogue with over 50,000 prokaryotic (bacterial, etc) genomes from various marine environments (including low-oxygen deep-water habitats and polar regions), spanning 8,466 species clusters in 59 phyla. The data is publicly available.
Both of these data resources are going to be highly valuable, and are cutting edge in different ways, while being based on an enormous amount of data collection and curation work. And both of them required data, software development, and systems work. The highest impact projects are going to be those that span software, systems, and data, which is why avoiding the accidental construction of silo walls between expertise in those areas is so important.
One thing I like about this set of guidelines is that it’s not one-size fits all; there’s three categories of software with increasing requirements. Most software has a brief and lonely existence, and that’s likely doubly true for research software. For one-off software, having it be publicly downloadable somewhere with a clear license is likely more than enough. Ramping up the requirements of software management plans as it becomes more vital to an effort - and, crucially, starts being used by others - makes sense. If anything here I’d say that ramp-up isn’t sharp enough.
Common DB schema change mistakes - Nikolay Samokhvalov, Postgres AI
Samokhvalov, a veteran of over 1,000 data base changes large and small, has an outline (and associated slide deck) of 18 common mistakes. He breaks them into three categories - concurrency-related mistakes when doing the migration, correctness mistakes in applying the changes, and some miscellaneous issues. For Postgres specifically, his company has a tool with an open-source community version to quickly clone Postgres DBs to testing migration steps, which seems like it could be very handy.
Now that exascale is finally done with, what’s next? While research cyberinfrastructure has been sort of hijacked to mean modest numbers of large computers, Reed muses here thoughtfully about the scientific opportunities about edge - lots of intelligent sensors with modest compute. Handling these problems are more technically interesting and scientifically motivated than just building the same systems over and over again but larger.
Intel’s Clear Linux Outpacing Ubuntu 22.04 LTS, Fedora 36 & Other H1’2022 Distros - Michael Larabel, Phoenix
In research computing we tend to recompile a bunch of our stacks anyway, but it’s interesting to see how big a performance difference there still is between linux distros, including on absolutely fundamental things like networking as well as codes we care about (LAMMPS!). Here Larabel runs a number of benchmarks against current out-of-the-box linux distress, and Clear Linux comes tut clearly ahead (interestingly, CentOS Stream also does very well).
Declare early, declare often: why you shouldn’t hesitate to raise an incident - Isaac Seymour, Incident.io
Oops, That Almost Happened - Vanessa Huerta Granda, Jeli
If something hurts, it might be because you’re not doing it often enough. Incident reporting is an important enough skill, and produces useful enough document (for your teams ongoing use and for your clients) that it’s worth going through, even for minor things, as Seymour says. And Huerta Granda even encourages writing up near misses, incidents that didn’t occur.
I think we’ll see more instances of agreements like this - modest sized purchases of locally provided specialized cloud services for very specific purposes - here, for grad students and courses in computer vision and machine learning. The use for course work is particularly interesting, because the teaching part of a University’s mission requires a lot more of a professional IT approach to system stability than the usual research computing requirements can support.
Chip Roadmaps Unfold, Crisscrossing and Interconnecting, at AMD - Timothy Prickett Morgan, The Next Platform
Morgan here goes through the recent AMD roadmap updates and does a typically through job of putting them in context. Between AMD’s existing on-chip fabric, the upcoming Zen x86 cores, the intention to have “APUs” combining x86 cores and GPUs on-die, and the “adaptive silicon” of the FPGA purchase, there’s a lot going on!
The complexity of upcoming CPUs - and we’ve seen this a bit from Intel too - means, I think, we’re reaching the end of one-size-fits-most “general purpose” high-end compute CPUs, and systems. (Exascale was supposed to get us there with “co-design”, but like so much of the exascale project, that was a disappointment). We’ve seen hints of it with some workloads making use of GPUs and some not, and the widely varying I/O requirements of simulation-heavy vs data-heavy workloads; but up to this point, a lot of research computing teams have been able to just buy the current generation of two-socket high-end servers and then just decide about GPUs or no and storage systems.
But even now, AWS has 51 current and immediate past generation instance types for a reason. U Michigan (my go-to example of a relatable university with a well-run research computing and data organization) has a research computing centre has 6 different systems plus cloud systems plus three different storage systems for different use cases. We’re getting to the point of having the mandatory opportunity to specialize thrust upon us, and nimble research computing systems teams are going to do well.
Quantum Algorithm Implementations for Beginners - Abhijith et al, ACM Transactions on Quantum Computing
LANL Publishes Guide to Quantum Computer Programming - HPC Wire
A sizable group from LLNL has released this lovely 90 page introduction to quantum computing, with hands on exercises (based on code samples in Jupyter Notebook for both local simulation and for running on IBM’s real quantum systems) covering 20 different algorithms.
Meta Platforms Hacks CXL Memory Tier into Linux - Timothy Prickett Morgan, Next Platform
TPP: Transparent Page Placement for CXL-Enabled Tiered Memory - Al Maruf et al, arXiv:2206.02878
We’ve talked about CXL here several times, it’s great to see it not just being used but with code upstreamed into the Linux kernel!
Facebook here is continuing long work pooling memory over RDMA using infiniband, to making use of Intel’s CXL protocol 1.0 atop PCIe5. This is going to be much slower than what’s planned for later CXL and PCIe versions, but it’s a great hint at what’s to come.
The idea here is to use CXL memory sitting in a card on a system, available locally (and even remotely, eventually) with a delay like a NUMA hop, but without another CPU around it. The trick then is to decide how to manage that memory without a CPU there - first how to understand what’s being used, and second to decide when and where to migrate memory.
The paper describes a new protocol, TPP, for handling pages motion onto and off of CXL memory, and tests it on a number of workloads. They also report on a tool, Chameleon, for monitoring the memory performance, and run it on some web-services, caching, and data warehouse workloads.
With the hyperscalers taking notice, and with the CXL roadmap fairly concrete for the next couple of years, and with work already being done in the Linux kernel, this is going to be something to keep an eye on in the next couple of years.
HPCWire alerts me to a paper from ISC I hadn’t read about - work, much of which is already integrated into LLVM, for doing OpenMP 5.1’s GPU offloading onto the GPUs of remote nodes, with a proof of concept of farming out tasks to 120 GPUs! I’m a big believer about using standards (OpenMP, standard language parallelism) for making use of accelerators, so this was nice to read about.
A single beaver caused a massive, hours long cell and internet outage in northern British Columbia, Canada.
Great to see that Arm servers have gotten to the point that now we can track down weird multi-core or multi-socket performance bugs on non-x86 systems! Hunting a NUMA performance bug on ARM.
Running Windows NT 4 for MIPS systems using QEMU, for some reason.
Generating true random numbers from bananas, by tracking the radioactive decay of potassium therein.
Animations of arithmetic on elliptic curves, as for cryptography.
I don’t think I know you could group_by, map, or flatten values with jq?
How fast can 1975’s 6502 processor transfer memory? About 57kB/s on the C64, up to 664.2kB/s with a new(!) 6502 motherboard.
Too young for that ARPANET/BITNET/early internet experience? Or just an oldster like me missing it? Welcome to telehack. (Anyone still running any MUDs?)
Decoding weird magic numbers, backwards compatible to to the Burroughs 5700, in old Fortran code.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.