One of the things I wrestle with a bit with the newsletter is trying to cover the range of topics that research computing and data managers need to think about without being overwhelming.
But we have genuinely challenging jobs, that span a range of concerns. I gave a presentation about this earlier in the year, but I haven’t quite talked about it the same way here.
We have three main areas of concern which we have to make sure we’re paying attention to - the people in our team (individually and collectively), the work output/products our team produces, and the processes by which we operate. All of those need careful tending to, both for the future (nurturing growth and development), performing regular maintenance care, and dealing with occasional urgent matters. And they’re all essential; people firing on all cylinders but held back by incoherent processes and working on the wrong outputs aren’t going to have the impact they could.
I like a gardening metaphor for this, because it reflects the fact that nurturing growth takes sustained effort, where as blight and weeds can spread with shocking quickness if allowed to do so (and takes much more effort to fight back than it does to prevent). It’s also true that if the teams, processes, and products are healthy and vigorous, they can make it hard for a weed to find purchase.
In steady state, we want to be spending most of our time in the top 2/3 of the diagram - nurturing our people and team, maintaining lines of communication, and making sure our work outputs are those that have the most impact. In any given week, we want to make sure we’re spending much of our ongoing work on maintenance efforts across all three columns, and putting some effort into targeted nurturing.
If we’re spending a lot of time putting out weed or blight outbreaks, that’s normally a symptom that there’s not enough maintenance care happening. Sometimes the urgent issues have an external source, of course; we’re being roiled by some outwardly driven change. But more often they’re internal. And the under-maintenance isn’t necessarily coming from the column where the outbreak is happening. The process issue may be an issue of the team not communicating well together, or the problem with product quality may stem from processes.
I spend most of my time writing about the people column, and I think that’s appropriate. People are the area probably least familiar to many of us who were trained in highly technical areas. And strong teams of people can better handle (or fix!) poor processes or work product focus, while excellent processes or products can’t counteract people problems.
In the last year or so I’ve been writing more about products in the sense of strategy - what is the most impactful work we can be doing? What are the most important work products we can be producing, and how do we decide which to drop and which new ones to start on?
Whereas I don’t really write nearly enough about processes. They’re important, even though as researchers and technologists we tend to downplay them.
We shouldn’t be creating process for its own sake, of course. But this isn’t about creation so much as ongoing development, just like people and products.
We have countless unwritten processes in our team, and we can’t really improve them in any meaningful way until we document them. This is just like wet lab protocols; if you haven’t recorded the steps you took, how can anyone reproduce them? How can others be onboarded into doing that work? And how can the processes be incrementally made better, or compared to other approaches?
Process is the mechanism by which craft work becomes professionalized, becomes automatable, becomes something that can be handed off. The first few times your team does something, it’ll still be in the mode of searching for a good way of doing things. Once it’s found an initial good way, that’s something that can be usefully turned into a process, and documented with not just the steps taken, but the goal of that process, and why this way works. Checklists work.
On the technical side, we as a community tend to pretty good about making sure everything’s documented, that there are scripts for automating tasks, etc. But for more people-oriented processes we’re often not great. If we find ourselves saying things like “Francis isn’t here today, and they’re the person who handles that; we’ll have to wait”, that’s a pretty good indication that there are important undocumented processes. It can be a work task, or how meetings are run, but processes bring clarity.
One failure mode is that unless they’re documented alongside why things are done via that process and what a good result is, they can become ossified.
But when that context is included with the processes, and permission is given to try doing things different ways and see if that’s better, then it becomes a key piece of continuous improvement. There are really mature tools for thinking about, developing, maintaining, and updating processes - many come from the world of project management (or program/portfolio management), and we can learn from or use them.
What do you think - are there areas where too little (or too much) process have hindered your team? Are there project management tools or approaches you use for keeping track of multiple processes “in flight” in your team? How do you handle documenting not just factual knowledge but procedural knowledge in your team? Let me know - just hit reply, or email me at [email protected].
PS: I got a lot of great responses last week about our first RCT interview, with Matthew Smith. Are you interested in being interviewed for RCT? There are lots of particular topics I’d like to hear about - you see them covered in the newsletter - but it’s also valuable for the community to hear from other managers and leaders and learn from their experiences. Don’t hesitate to email me if you’re interested! Just email [email protected].
And now, on to the roundup!
Peer One On Ones: How To Unlock Great Collaboration Across Teams - Lighthouse Blog
One-on-ones are a fundamental tool for developing trusting working relationships with direct reports.
But regular meaningful conversations can always build trust and strengthen lines of communication, regardless of reporting relationship.
Peer one-on-ones are a great and simple way to develop effective professional relationships and communication channels with peers on other teams. Feel free to call them something vague like “check-ins” or “sync-ups” if you prefer.
The Lighthouse blog gives some suggestions - they don’t have to be weekly, for one, even every four weeks or longer is a lot better than nothing. And there’s a great list of possible topics:
(This is a timely article, because I just advised someone who was having some issues with a peer in a different reporting structure, and who wanted to give them feedback. Feedback works much better if there’s already a working relationship and communications. They will take the feedback more seriously, and you can express the feedback in terms they are more likely to care about, if you’ve already had several working conversations. Since this will be an ongoing working relationship, and the feedback wasn’t time sensitive, I advised starting peer one-on-ones with the colleague, and only after a couple of those raising the issue).
Leader as Shock Absorber - Ed Batista
Maybe a couple of decades ago, the idea of a manager or lead as a “sh*t umbrella” became popular; it probably came from a good place, an urge to protect team members from the vagaries of the larger organization.
There’s a pretty widespread understanding now that this just isn’t a good mental model. It’s infantilizing; it is fundamentally built on a lack of transparency; it models the rest of the organization as an uncontrollable, exogenous, force that randomly produces excrement to be handed down; and if the manager or lead guesses wrong about what the team needs to know, it leads to bad and preventable surprises.
Batista’s analogy of a shock absorber, I think, captures the well-intentioned pieces of the umbrella analogy but without the problems:
How to stop firefighting and start working proactively - George Sudarkoff
Tying nicely into the management garden discussion, Sudarkov talks about the problems caused by being in constant fire-fighting mode. He also alludes to one of the causes: it feels good and important to be putting out fires, and fire-extinguishing (since it’s so visible) can often be respected and rewarded in a way that running a quietly effective organization too often isn’t.
But fire fighting is exhausting, and it takes energy away from investing time and energy into growth and sustainable development.
Sudarkoff urges us to spend the time to stop fires from happening:
Neither of those is necessarily easy to do, especially while your energies are still being taken up with conflagrations! But the alternative is to keep spending energy on firefighting.
Addressing Tech Debt - Abi Noda
Noda gives a nice overview of technical debt (along with a taxonomy of ten different kinds of technical debt). More importantly, I think, he talks about signs that the tech debt is causing problems. To my mind, the four most important factors are:
And when it does start to cause one of the above problems, he advises:
If you are a native English speaker who uses Microsoft Teams for internal meetings, I highly recommend turning on Speaker Coach. That feature gives you a report after scheduled calls, flagging repeated or filler words (it turns out I say “you know”, “uh”, and “cool” a lot), talking too much during a meeting (something we managers need to look out for), flat speaking tone, and other speaking issues.
I can’t honestly say it’s been enjoyable to go through Speaker Coach reports, but it has been extremely valuable, and having those nudges and data after every meeting makes improving much easier.
Note that Speaker Coach currently only works in English, and I don’t know how well the word recognition functionality would work for people who have accents in English that come from speaking other languages too. (I imagine the flat speaking tone and speaking-too-much functionality would work, however)
Get straight to the point - James Stanier
Many of us were trained in academia, or have been in academic circles long enough to pick up some habits of thought.
That explicit or implicit academic training serves us well in a lot of ways! But not when it comes to communication.
The very stylized (and verbose…) form of communicating in scholarly journals and research presentations is an active hinderance when we’re trying to get things accomplished communicating with busy people.
I like tools like Hemmingway App to help me tighten up my text and make it more readable. But that can’t help with the structure or organization of what I write.
Stanier urges us to get straight to the point, with three clear recommendations:
None of this has to come off as assertive or obnoxious - it’s just how you structure and order the information and request. It’s a matter of kindness to the reader - giving them what they need to provide an answer immediately (and get the email out of their inbox) if that’s possible.
When you’re not making a request but just communicating information, the Minto Pyramid is also a good approach.
GreptimeDB is Now Open Source - Xiaodan Zhuang, Greptime blog
Supporting and processing near-real-time data collection, especially from IoT sensors, is going to become a bigger piece of research computing and data. In many cases, ingesting and processing the data in large batches, as we’ve always done, will work just fine. But in others we’re going to need access to the data nearly as soon as it comes in, and some way of persisting it while analysis is being done. That’s going to require data solutions like time series databases.
Luckily, hyperscalers and even large enterprise deployments have been building time series databases for some time, to handle incoming telemetry data from huge numbers of servers and/or users.
Greptime looks to part of the next generation of solutions, taking lessons from time-tested approaches like InfluxDB. I’ll be keeping an eye on it.
Is anyone currently using time series databases in your work? Are there any gotchas you want other readers to know about, or solutions you’re really happy with? Let me know.
Touching Grass With SLOs - Reid Savage
The Gordon Bell Special Award is always a nice glimpse into the future of research computing - this year’s winner involved training a 25 billion-element LLM diffusion model on huge stacks of data, fine tuning the model going through a subset of that data, running large molecular dynamics simulations, then running inference with another large model, OpenFold.
The hardware and software systems we are building to support these varied computational science studies with diverse interlocking computations are increasingly complex!
But our approach to thinking about “downtime” hasn’t always kept up.
For most of RCD history, research computing systems were unambiguously up or down.
We’re already well past the point where that isn’t true any more - parallel file systems can be “up” but in a clearly degraded state, queueing systems can be borked with running jobs continuing along unaffected, etc.
The broken/working binary is going to get increasingly untenable as more and more systems support broader functionality. Functions as a service, database services, streaming data, workflow managers - these are increasingly key pieces of modern research computing workloads, and they can be up, down, working but with undesirably high latency or individual request failures…
Savage gives another nice of Service Level Objectives (SLOs: see also #44, #57, #73, #134). SLOs are targets for the health of pieces of functionality, defined - crucially - from the point of view of the user. These are internal targets, which are alerted on, and which may or not inform explicit, user-facing commitments (SLAs, service level agreements).
Our research communities deserve computing systems which are not merely “up” in some technical sense but useable. SLOs are a way of defining internal monitoring thresholds to better achieve that.
Does your team have SLOs? How did you decide what they’d be, and how has monitoring of them worked?
In my day job I see a lot of teams who want to set up dynamic JupyterHub instances running reproducible workflows colocated with large data sets. NASA Openscapes is a cloud infrastructure for geospatial data, along with support for researchers migrating workflows to the cloud.
Lopez gives a quick overview of the 2i2c cloud infrastructure, with jupyterhub running on Kubernetes in a cloud-agnostic fashion, using GitHub for authentication as well as version control, a conda environment, and the ability to run across nodes with Dask.
I’m looking forward to there being standard, opinionated distributions of tooling to spin this kind of environment up from scratch; have you seen this kind of toolkit before? Do you run something like this? I’d love to hear about it.
The Journal of Fluid Mechanics will actively support supplementary materials in the form of runnable Jupyter Notebooks.
A computational biologist just won the Great British Bake-Off, for those who doubt the transferrable nature of RCD skills.
A lot of people are moving to Mastadon! I don’t have the bandwidth right now to be an early adopter, but if it pans out I might follow. If you’re thinking of hosting a Mastadon server, though, especially in the US, read this twitter thread on steps to take to protect yourself in the US from DCMA and other legal threats.
Why we call it boilerplate code.
Key chain fobs are ok, and apps for your phone are fine I guess, but wouldn’t you rather have your TOTP multi-factor authentication token generator running on a Commodore 64?
As you know, your loyal correspondent has a soft spot for embedded databases. Kùzu is a new embedded DB specifically for graph data.
Speaking of, SQL Teaching, an interactive SQL tutorial using sqlite in the browser.
Finding bugs in an alternate implementation of an algorithm without writing tests using property testing.
The flux framework is intended to be a next generation toolkit for building schedulers and resource managers for future HPC systems. In an attempt to provide a more converged HPC/cloud approach to jobs, a flux operator is being built to run on top of Kubernetes.
Technology investors are optimistic enough about the future of CXL “composable memory” approaches that companies like Astera are successfully raising money and increasing their valuation even during … all this.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Research computing and data - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can support science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
This week’s new-listing highlights are below in the email edition; the full listing of 190 jobs is, as ever, available on the job board.