My discussion last week about when research computing and data work is and isn’t a research output in and of itself produced some replies, as the topic often does, that ranged from intrigued to puzzled to grumpy. I’ll write some more next week on this (I even have a diagram sketched out), so people can have some really specific things to be intrigued or puzzled or grumpy with, but I first want to talk about the multi-dimensional aspect of research computing and data efforts.
Here’s the thing - all research computing and data work is advanced research computing.
It might not feel like that. Reading about the cool cutting edge tech things that are being done in industry, or comparing to the huge HPC systems at the DOE labs or at PRACE, it’s easy to feel a bit in a backwater. (I’ve written about this in the past).
But research computing and data efforts are multidimensional objects, with boundaries along each of those dimensions. And trying to be at the frontier, at the cutting edge, of several of those dimensions at once is madness - it’ll never succeed. This is the gist of the choose boring technology talk I keep referring to. You can only spend so many “innovation tokens” at once before you have bought yourself more unknowns, more uncertainty, than the effort can handle. You have to choose a single frontier to push forward at any given moment. We tell undergraduates in their intro labs to change one variable at a time so they can confidently collect results. It’s the same for research computing and data efforts.
For the tech giants, the technology is the frontier they’re aiming to advance, for use on the same web services or data processing problems they’ve been working on for a decade. At the big HPC centres, they’re mostly running the same codes by the same groups solving the same PDEs for the same science problems for the past twenty years, and the focus is figuring out how to handle another factor of a few in computational size for the new system.
There’s nothing wrong with advancing those frontiers - they need advancing! But there’s other dimensions that matter in research than scale and technology. Often the advance is in the form of a novel research question being asked - then you want to assemble super well-understood technology and focus the novelty on the problem itself. Or there’s a novel kind of data that could be used that needs to be handled for a familiar problem. Or a new data source for familiar data type but that now needs to be cleaned in a new way. Or the simulation workflow is just different enough from existing workflows to need careful thought. Or we need to do the same thing as we’ve done before but over federated resources. Or…
In an recent article, “You’re probably on a cutting edge”, Randy Au gives some good indicators that you’re on a cutting edge of something:
One of the above is often - not always, but often - the case in research computing and data work!
But even more importantly, he reminds us that being on the cutting edge is not a goal in and of itself.
Does being on the cutting edge mean anything? In the day-to-day? Absolutely nothing at all. But that’s the whole point. It doesn’t really mean anything.
In our case, the important thing is advancing the research. New technologies, new scale, new code, new hardware, is not an end; at best, it’s a means. At worst, it’s a distraction. And as technologists, it’s a self-indulgent one, at that.
On that note, on to the roundup!
Rubik shares a guideline — a meeting with more than eight people isn’t effective as a decision-making meeting, and so:
When you have more than eight participants, you either need to change the format of the meeting, or you need to restructure the participants (and you usually want to do some deeper work on communication and organizational structure).
There are some good examples of redesigning meetings, either shifting their focus or attendee list. But I think this is an example of the broader point that clarity of what the meeting is for can and should guide everything else about how the meeting is designed. Otherwise it’s very difficult to have the meetings be effective and efficient uses of peoples’ time.
Lucid is a company that sells meeting software and consulting, and they have a pretty nice taxonomy of sixteen meeting types in three broad categories. With the meeting type and purpose in mind, a lot of choices about the meeting (like does it have to be a synchronous meeting) start falling into place. Seeing that you have a lot of people in meeting is one symptom that such clarity is lacking, but it’s not the only one.
Entropy Crushers - Michael Lopp
Lopp talks about the need for project (or, increasingly, “delivery”) managers. I think this is a vital point:
If you’re a lead on a growing team, you have your unique version of this process [project work], and to me, a huge chunk of this work is for the project manager. You already have a project manager and it’s you.
As a team’s efforts grow, the keep-the-ship-moving-forward work grows superlinearly - maybe a bit, maybe a lot, depending on the kind of work and how the team’s organized. At some point there’s roughly a full time employee’s worth of work there - and, if they focus on it and know their craft well, they’ll do it better than you and someone else doing it of the sides of your desks will get it done.
Sometimes, after giving lots of feedback, you still need to have The Talk with a team member about them underperforming and it becoming a serious problem. If you find yourself and your team member in that unpleasant situation, this is a pretty good outline of how to have the conversation with the right balance of firmness and compassion.
Get More Feedback This Week - Laura Tacho
Specific guidance along the lines of what we discussed last week about getting more feedback:
I’m in a new workplace, with a lot going on, and it is super easy to get distracted.
Worse still, I can effortlessly justify to myself being distracted and flitting between different things. “I’m still learning the landscape; it’s important for someone in my role to have a broad view of what’s going on everywhere. Who knows what might come up in that conversation tomorrow?” That’s not even really wrong! But it’s far too easy to let that very comfortable and convenient attitude keep my mind stuck on “scan”, noticing every slack ping and reading every email as it comes in (we have a LOT of emails) and watching every talk being given. All those little dopamine hits are way more entertaining than doing the hard work of buckling down to get something accomplished.
Eyal advocates for an approach that has come up frequently before - timeboxing. Actually blocking off time on your calendar to accomplish something specific. This has the huge advantages of (a) actually setting aside time, and (b) implicitly setting the scope of what you’ll do.
Eyal adds two things, First, it’s easy to start small. Just start with one thing that’s been on your to-do list.
Second, if you doubt the need for this, track what you do through the day in two buckets - reactive work and reflective, pre-planned work. Are you spending as much time in reflective planned work as you thought you were? Maybe you need to start time boxing.
Ten simple rules for building a successful science start-up - Tobias Reichmuth , Collin Y. Ewald, PLOS Comp Bio
Even when there isn’t an actual start-up company involved, research computing software and systems as they mature from proof of concepts and prototypes need to be turned into “products” to have their full impact, and I think we can learn a lot from actual start-up companies coming out of academia to understand that process. Besides - increasingly, real commercial spin-offs that come from research have a research computing and data component. So it’s worth knowing more about how commercialization of academic results work.
Reichmuth and Ewald have been involved in commercialization and productization from multiple points of view, and offer advice that squares with that of many other sources, but in language that’s more academic and less business speak.
If I might paraphrase their 10 rules into six bullet points, I’d summarize them as:
How to Run an Organized Town Hall Meeting - Alexandria Hewko, Fellow
Town Halls are a pretty common format in our discipline, and… they’re often not great. They’re ad-hoc, mostly prepared talks, and so generally not super well-received (and, thus, not generally well-attended). Why bother if you can read the slides and the Q&A afterwards, right? We’re all busy.
Hewko gives some advice for running a town hall which is actually a community event rather than a broadcast from HQ:
How to release 2 years of unfinished code, the “Agile way” - Jonathan Hall
Sometimes you get yourself into a hole and need to find a way out. Hall recommends releasing something that works right away - some teeny change to the last related version - just to practice the (now quite out of date!) release process and give your team a quick win. It could even be a version bump and a change to the docs! Release something, then find a way to maybe fold in a little of the new code, and release again.
I expect the most controversial recommendation is:
Treat the last 2 years worth of code as “warehouse of old parts” That is to say: Don’t try to force the 2 years worth of code into production. Instead, put it to the side, and essentially start over, using an agile approach of building one valuable thing at a time. As you do this, feel free to rummage through the “warehouse of old parts” and pull in bits and pieces that are relevant, of course.
How to Freaking Find Great Developers By Having Them Read Code - Freaking Rectangle
We know code is read more than it’s written, and that debugging, code maintenance, and incremental addition is more common and time consuming than “green field” code development. And yet, the entire software development community tends to vastly over-value writing code from scratch over understanding existing code. That’s true of research software development, too, which famously almost never starts completely from scratch.
Here the article’s author recommends focusing a “coding” interview on reading code. The article describes a process where the candidate reads increasingly complicated code, predicts the outcome of some commented-out line or routine, and then they check their answer with the opportunity given where necessary to “debug” their initial answer.
I don’t think there’s much doubt that this would be a perfectly good tech screening, be closer aligned to actual work than “reverse a linked list” nonsense, and be less stressful than creating code from scratch. I’ve also seen people start to do “reverse code review” where, e.g., the candidate reviews a PR on a sample code base.
Interestingly I find research systems teams by and large are better than research software teams at focussing on understanding existing stuff. You don’t often see people interviewing potential ops staff with mostly questions like “how would you set up a linux server from scratch to do X, Y, Z”, it’s much more about existing systems and debugging.
Maybe relatedly, a lot of people swear by the newly possible practice of watching people code live on, e.g. Twitch streams. Sort of one-way pair programming with a luminary in a field. I get it, in the abstract, but my increasingly aged Gen-X mind just can’t wrap itself around watching people code on Twitch. Has anyone on your team, or one of the trainees you work with, done this and found it useful? What did they find especially helpful?
Here’s a good measure of how important data management is becoming, even in traditional HPC circles - DOE Announces $26M for Data Management and Scientific Data Visualization Research.
Why Atlassian Failed So Hard - Steven J. Vaughan-Nichols, The New Stack
Systems teams have one of two types of reaction to huge embarrasing outages like the (ongoing!) Atlassian fiasco, or the Gitlab one before that. The first, by either novice or wildly overconfident teams, is to scoff and say “they should have just ….”. The second, by teams who have been around the block a few times, is to pause, murmur some thanks under their breath, and to quickly tabletop the scenario on their systems to double check that they could handle it.
Vaughan-Nicols gives a quick overview of the confluence (hah!) of two bad things - an incautiously written script and a lack of partial restore capabilities - which could plausibly lead to heartache at any of a number of places. What’s maybe less understandable is the poor communications and baffling prioritization after the incident.
Why Enzymit Decided to Build its Own On-Prem HPC Infrastructure - Lior Zimmerman
This got cited a lot in the last couple days by people crowing about an on-prem HPC victory, but I have to think that it got linked to more frequently than it got read.
First, this is a story of it being good that they started out on the cloud:
Like many other startups who are short on staff, we made the most of [free cloud credits], but not without architectural blunders that cost us thousands of dollars of credit money.
A flexible environment like the cloud is the right place to make architectural blunders, while you’re sorting out your workloads, and making mistakes there is far less expensive than doing it with servers and networking gear you bought and installed in your machine room. Try some things, adjust, learn, and get things right.
In this case, once they figured out in the cloud how they need to run, they decided they could do it on-prem cheaper. That’s success, and it’s great!
But I think the kicker is the description of the “On-Prem HPC Infratructure” in question:
[…] we decided to purchase three workstations at a total cost of $17k. One is a GPU-based workstation with two RTX 3090s and an Intel i9–12900 CPU, and another two workstations with 16 cores AMD Ryzen 5950X CPUs.
I mean, I am a big fan of cloud for a lot of use cases. But if my compute needs were met by two gamer GPUs and a couple of desktops then I too probably wouldn’t sign a contract with a cloud provider for kit that I could pick up on a quick trip to the neighbourhood Best Buy.
It’s too late to sign up for this round, but you can get on the waitlist for upcoming versions of Ohio Supercomputing Centre’s AI Bootcamp for Cyberinfrastructure Professionals.
These sorts of trainings are needed and getting more so - and it’s not just an AI/data intensive science thing, either. As research computing workflows get more complex - that could be “big data” pipelines of one sort or another, sure, or even increasingly multi-stage simulations that make use of different configurations - they require more of systems and systems teams. I’m seeing a lot of teams being asked to support these more complex use cases and given zero support or guidance.
I hope OSC (which always punches well above its weight!) continues to offer these trainings, and we start seeing more offerings too.
Speaking of, NSF has a call for Training-Based Workflow Development, including for “CI Professionals” (e.g. members of RCD teams) that I hope gets used for similar or related efforts!
Ugh, be careful with your OAuth2 tokens, folks. GitHub has noticed that some of its user tokens issued to two companies who I would have expected to have pretty decent practices, Heroku and Travis-CI, have been used maliciously.
With Aquilla, Google Abandons Ethernet to Outdo Infiniband - Timothy Prickett Morgan, The Next Platform
Aquila: A unified, low-latency fabric for datacenter networks - Dan Gibson et al., USENIX NSDI’22
Really interesting article about a paper and talk given for NSDI by Google.
All of the hyperscalers want high-performance interconnects for AI workloads (which will, as a side effect, benefit HPC use cases). This is Google’s entry, a still-being-prototyped cell(*) network which, as Morgan says, “pods up” units of ~1000 nodes with a specialized layer two networks for both message and RDMA type traffic, with its own dragonfly network, connecting into the larger data centre network at the spine of the CLOS topology (see diagram below). As both Morgan and Glenn Lockwood point out, this is reminiscent of Cray’s 2013 Aries network, which is irritatingly not referred to at all in the paper.
As an aside, I think it’s remarkable how in-character the different approaches to high-speed networking are for the big three cloud providers. Microsoft’s first go is to just provide what customers ask for and buy some Infiniband. AWS does something clever and cheap building on existing standards with their scalable reliable datagrams. And Google says “We’re Google, we’ll invent our own layer two network protocol and build the rest of the stack from scratch”. There’s a lot of pros and cons to each approach!
Since Google famously generally publishes infrastructure work only once it’s been in production for years, this “work in progress” paper makes one wonder if they’re getting tired of being asked when their accelerated networking is coming. Because whenever someone wants to make the point “cloud networking is slow”, they pull up a benchmark run on GCP.
There are several aspects that unite the two custom-made solutions, AWS’s and GCP’s. First is a laser focus not merely on mean latency but on tail latency. If you’re wiring up 1000 nodes, your p99.9 latency matters a lot! One use case Google explicitly calls out is composable infrastructure, and if the latency to a device in your composed ‘node’ varies between 10-500μs, that’s not going to look very local.
The second is to push more smarts into the NIC. In Aquila’s case, they disaggregate the top-of-rack switch, pushing low-radix switches into the NIC hardware itself, combining the two into something they call the TiN (ToR-in-NIC).
I strongly suspect those two trends will continue at the hyperscalers, and we’ll likely be able to benefit from that work in research computing. I’ll follow Google’s Aquila networking with interest.
(*) Google’s pretty consistent about referring to Aquila as cell networking, but I have to say I couldn’t tell you what a distinguishes a cell-based network from a connection-oriented protocol atop a packet-switched network if the cell-based network has variable-size packets and adaptive routing? If someone wants to educate me, I’m all ears!
Oh interesting, a third party hardware maker is planning to sell PCIe cards that would make Optane persistent memory available to non-Intel systems.
A more detailed than usual walk through of automating kickstart for deploying RHEL/CentOS/Rocky/etc updates.
Chrome is a good browser, but watching all its privacy issues laid out at once is a lot.
Julia Evans has a crowd sourced list of new command line tools - replacements and novel.
Ever wanted to use 8 glow in the dark stickers as 1 byte of RAM? Here’s how, treating the refresh cycle of stickers like a very-slow DRAM.
Interval arithmetic with multiranges in Postgres 14.
Using LaTeX for taking linked notes in mathematics.
One of the cool things about teletypes is that they always provided persistent logs of sessions. Here’s a read through of some 1979-1980 transcripts of users logging into The Source, a BBS like service with news and email and stock information, a question forum, and other features.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.