It’s been an interesting week online for research computing; we have relevant news ranging from managing people to managing workflows at global scale.
The Subtle Signs Your Team Is in Trouble - Rachel Muenz, Laboratory Manager
An older article that Laboratory Manager sent around on twitter this week. It’s written in the context of laboratories, but carries over very clearly to research computing teams. They mention two opposite signs to look out for:
In both cases the solutions are the same - do lots of constant listening (e.g., through one-on-ones but also just generally) and give lots of feedback (positive and negative; positive for any input even if disagreeing, and positive for doing so constructively while negative for doing so otherwise).
Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog
Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit
“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they have better awareness of how much stuff that gets done is repetitive across teams, and they put the resources in to ensuring those things get done smoothly.
Technical teams tend to hate documenting, but identifying these repetitive “toil” tasks and documenting what’s involved in getting them done in a “runbook” has three really important benefits:
Are there things that have to routinely get done in your team that aren’t documented clearly in (kept-up-to-date) runbooks? Would things be easier if there were fewer such tasks?
And related to this; here’s a list of “flight rules” (e.g., very short runbooks) for common nontrivial tasks in git.
I talked last week about Andy Grove’s book High Output Management, which is aimed at management and in particular senior management in technical organizations; this article talks about lessons from the book for those still in the trenches, as technical leads (but not managers).
A Chaos Test Too Far - Kathryn Downes & Arjun Gadhia, Financial Times
I’m a big fan of “chaos testing” (or, arguably more appropriate for a lot of research computing work, Slack’s “disasterpiece theatre”) for testing the hypotheses that “our system is robust to [bad thing]”; the software equivalent is probably fuzzing. We know stuff is going to break, usually at the least opportune time; systems crash, users or operators enter invalid information… best to test it in a controlled manner.
So in fairness I should probably post examples of it going - or nearly going - horribly wrong, like this one. I’d still argue though that the series of events they tested here would almost certainly have happened eventually; far better that it happen with all hands on deck when they were primed to respond rather than towards the end of the day on a lazy Friday afternoon when people were already heading home.
Research computing teams have a lot in common with open source communities - even if you aren’t developers or developing open source software. One of the joys of open source communities is that you’re part of a small, visible team solving problems for your users - and that’s exactly the situation we’re in. But there’s downsides to that, too. Users can be incredibly demanding, and when you’re a small visible team they can be directly demanding to you, personally. Members of huge proprietary teams don’t have that feedback. for good or ill; being one of thousands of devs on MS Word or staff in AWS it can be hard to feel like you’re making a difference, but you’re not getting emails from your users yelling at you because they don’t like that latest change you made.
There’s a lot of talk in FOSS communities these days about dealing with this feedback, including dealing with burnout, and I think this is something that managers should be keeping an eye on.
This was talked about FOSDEM this year, and will be of interest to some teams and/or of their users: it’s a tool for CI testing of data tables. It will perform checks of tabular data on push to github or S3.
There were a few other relevant presentations in the databases session, including on dqlite (simple high-availability distributed database on top of RAFT and sqlite) and LumoSQL, sqlite with faster key-value store and some other advantages. I find it fascinating how a small high-quality (with hundreds and hundreds of tests!) embedded tool like Sqlite ends up being the basis for so many fun experimental projects.
Most of HPC Happens Under the Radar - Michael Feldman, Next Platform
This won’t come as a surprise to our community, but it’s an important point worth emphasizing; most HPC doesn’t happen at big centres whose every procurement gets lots of press.
Workflow manager news: The Pan-Cancer Analysis of Whole Genomes results came out this week (see also: Unprecedented exploration generates most comprehensive map of cancer genomes charted to date; BSC Powers Pan-Cancer Project). Disclaimer: I was somewhat involved in this project.
From a research computing point of view, one of the big accomplishments of this project, particularly given that much of the computational work was done ~five years ago, was coordinating the running of uniform pipelines on ~2800 sets of cancer genomic sequencing data at centres across the globe. This really helped catalyze work in the genomics community around workflow execution - spurring standards like the Workflow Execution Service; repositories of software and workflows like Dockstore and helping accelerate the development of workflow runners like Toil, Cromwell, and Nextflow.
One of the things that fascinates me is the difference in workflow runners like these ones for research computing - which are really about running large complex executables that would otherwise have run with bash scripts - and evolving ML workflow runners like Lyft’s recently-announced Flyte. Are the differences in use cases fundamental, or will the tools start to converge? One of the things I’d like to do at some point if there’s interest is a bit of a deep dive into workflow managers of all types.
In Python, dictionaries are now guaranteed to be ordered by insertion - but sets aren’t. I can absolutely guarantee this is going to cause incredibly hard-to-track-down bugs in research software in the coming couple years.
Jeremy’s Notes on Fast.AI coding style
A bracing reminder (ternary operators! 120-character lines!) that there aren’t “correct” coding styles; the purpose is to make sure teams reduce internal barriers to collaboration by picking one style that works for them and sticking with it.
Are you tired of your desktop’s file system working too well to be able to do good robustness testing of software? Want to mess with a process’s view of system time? Want stderr output always colorized so you can tell the difference from stdout? Check out this list of ld-preload hacks.