#121 - 7 May 2022

Professionalism is process; Onboarding staff; Production bioinformatics; Tracking big-picture items; Why new outside gigs are hard; Sharding test suites; SchedMDs bad communications; TPUs for CFD


The (159-item!) intern process checklist from last week generated a few interesting discussions, and those plus Chris Dwan’s interesting talk discussed below return to a common theme here in the newsletter - of projects versus processes, or learning versus production.

In #91 I alluded to the fact that what distinguishes a hobbyist from a professional is repeatability. The hobbyist is still developing their skill, trying to make their one-off thing come out with high quality. The professional is no longer struggling to make one thing well; they’re trying to ensure that things reliably, repeatedly come out with the same high quality.

The hobbyist is thinking about the mean quality, and trying to get it to trend upward; the professional is concerned with the variance, and reducing outliers. (Mean matters, too, but for a professional, a high mean quality is table stakes). There’s a progression from focussing on skill to focussing on reliability, from craft to process. One of Dwan’s slides says:

Amateurs practice until they can play it right; professionals practice until they can’t play it wrong.

There’s something of that in leadership, too. You’re leading people, but you’re managing and creating processes so that the team can reliably do their best work. Those processes can’t be created de nihilo; they have to come from having learned how do the thing well once, and then you can start building a process around it so you can reliably do it well.

Whether you’re hiring, or adding a node to the cluster, or building and deploying a new release, or adding new data to the public resource, professionals have SOPs that they follow so that the thing gets done well. You don’t have one of those to start with, of course; it’s all figure-it-out-as-you-go at the beginning. Then it gets routine, then it’s documented in an SOP, and then as you find still better ways to do it, that make it even more reliable, you update the SOP. Or automate parts of it! Either way, instead of people spending time thinking “how does this go again?”, they free up time and brain space to improve the process, or to work on something new.

One of the things I like about Dwan’s slide deck is he implicitly (maybe explicitly in the talk itself) this evolution from learning into process into the spectrum from research to development, from discovery to infrastructure, that we discussed in #119. The research side of the spectrum is the learning. It’s very much the bumbling around, “maybe this will work” stage, where hopefully you discover something new. “So that’s how you do it!” That is good and important work, and it generates research outputs - new knowledge.

As you’ve it progresses and matures and solidifies enough to the point that others can also reproducibly do the thing, it’s becoming development, and then production. It’s an input that others can use to do research, rather than a research output. This is also good and important work, but it’s of a different kind.

On to the roundup!

Managing Teams

The Ultimate Guide to Onboarding Software Engineers - Csaba Okrona

Eventually I’ll try to do a similar onboarding process starting-point checklist for full time staff. Okrona’s article touches on what some of the biggest differences will be (and it isn’t particularly limited to software developers)

When you’re hiring someone you hope will be on your team and successful for a long time (at least a couple of years) doing work you don’t even know will happen yet, then there are other things you need to add to the onboarding. Okrona has on the list onboarding into the wider organization - getting them familiar with and maybe introduced to stakeholders, other departments, communities of practice of relevant, etc. Also there is onboarding into the team and organizational cultures - what matters most to those groups.

What I really like about this article though is the part about expectation management - for you, for stakeholders, and of course for the new team member. Okrona strongly suggests having 30, 60, 90, and 150 day milestones, what you expect them to have accomplished or to be able to do at those points. Not only does that give them something to aim towards, and milestones by which they can gauge their progress, it’s also something to use to set expectations for stakeholders or team members (“Frank just got here 3 weeks ago! We don’t expect new team members to be doing code reviews until month three, remember? He’ll get there, don’t worry”).

I’ll add that those milestones are great to share with potential new hires even when they’re just candidates - you can put them in the candidate packet or even job description. A more ambitious version is to have defining those milestones be the first part of the hiring process, before even writing the job ad, and to back out all the candidate requirements from those milestones. Having that kind of clarity adds a coherence to the hiring process, makes it very clear to job seekers if this is a job that they’re interested in, and provides them with some kind of evidence that there’s a plan for their success if they do join the team.

Ditch Your To-Do List and Use These Docs To Make More Impact - Brie Wolfson, First Round Review

We start off in tech with ticket trackers and to-do lists, and tend to carry that through to our first leadership jobs. But they’re inadequate when you become a manager or lead.

As a leader you no longer have the comfort of merely being responsible for set of discrete tasks that can be independently ticked off. You’re probably not even only responsible for individual projects. No, you’re responsible for much more nebulous things like ‘priorities’ and ‘efforts’ and ‘directions’ and ‘staff development’ and ‘stakeholder management’. Stuff that you can’t just cross off the list one day and be done with. How to keep track of stuff like that? They each certainly generate individual to-do tasks with alarming frequency, but you need some higher-level meta tracking too.

There isn’t a single answer, a single thing that you can do. (Isn’t being a leader awesome!) This article offers a number of documents Wolfson finds helpful, broken down into daily, monthly, and quarterly documents.

Ones I think are or might be particularly useful are:

  • An ongoing stack rank of all those nebulous ‘efforts’ and whatnot, with their priorities and broad themes of current work - it’s helpful and clarifying to have a list of all such things in one place
  • Updated lists of things you’re proud of, things you got stuck on, and things you learned
  • Going back periodically (how often will depend on the kind of work, I think) and updating a list of things you or your team got done that had real impact, and what the impact was

Technical Leadership

How to run a Retrospective - Chase Seibert

Siebert writes this in the context of sprints, but this short and solid how-to for running retrospectives applies to any project. (A sprint is just a a mini-project, after all - it has well-defined objectives, along with a beginning, middle, and end).

Siebert probably feels that actually following up on the retrospective is out of scope of an article on how to run the retrospective meeting, which is fair. Don’t take that as a sign that the followup is secondary, though! People will stop contributing if they don’t see good-faith efforts to incorporate the results of the feedback into improving the work. After all, that’s the whole point, right? For a recurring retrospective after a sprint, you can introduce things in the next sprint kickoff, or some other regular part of the cycle. For longer one-off projects, you might write something up distilling the results and circulate it.

I’d also add that these retrospectives, like standups, can work really well if you as the manager don’t run it, and rotate leadership of the meeting through willing members of the team. It helps develop skills, gets everyone more invested, helps you delegate a task, and lets you pay more attention to the content of the meeting (or even participate!) than the operation of the meeting.

Managing Your Own Career

Why Success Is Often Elusive at the Highest Echelons - Cindy Sridharan

It’s good and healthy to move between organizations, to both gain and bring new perspectives. But doing so is way easier in more junior positions, whether as an IC or a manager.

Sridharan walks through the challenges of moving into new organizations in a more senior leadership position, and why so many who do don’t succeed. That doesn’t necessarily mean fail - it can end up meaning just “meh”. But a year or two of “meh” in a senior leadership position can mean a loss of trust. You get longer to prove yourself in a leadership position than a more junior one, but not years.

It’s laid out very clearly in Sridharan’s article: at some point mere competent execution isn’t enough, you need to be both executing and improving things around you. But being able to improve things at leadership levels is a team sport. You need a support structure of colleagues to get anything substantial done. You need deep knowledge of the context to make sure what you think is an improvement actually is. And you need knowledge of the culture to know how to communicate and effect the improvement. Those require trust and connections and communication, which take time.

As high achievers, and as people who probably have done well at hard things in the research world, we disproportionately tend to march into a new situation pretty sure we can handle this. We probably can! Succeeding in a senior leadership position at a new organization certainly isn’t impossible. There are good practices - The First 90 Days is kind of a classic text on the subject, but there’s been a lot since.

But knowing that it’s a real challenge, and understanding the shape of the challenge, is the crucial first step. Sridharan lays out the challenges well.

Preparing to Tell Your Boss “I Quit” - Nihar Chhaya and Dorie Clark, HBR

So knowing that, you’re going to take that new position anyway? Great! You’ve got this. Now you just have to tell your boss.

(And by the way — you have to tell your manager before your peers or your team members. You just do. I’ve seen people try to do it the other way. It’s worse for everyone.)

Like preparing for any awkward conversation at work, there’s really no way forward but through. Think about what you’re going to say, and it’s good to be prepared for a few different kinds of reactions so you can know how you’ll respond rather than playing it by ear in the heat of the moment. Chhaya and Clark lay out the possibilities. But don’t spend hours or days dwelling on it - just do it, the sooner the better, and do it as “face-to-face” as the situation allows.

These conversations tend to be awkward but generally positive in the research computing world, in my experience. Yeah, your manager will be surprised and disappointed, but the research environment is kind of built around trainees (undergrads, grad students, and postdocs) doing their thing, building their skills, and them moving on to the next thing.

Product Management and Working with Research Communities

Production Bioinformatics, emphasis on Production - Chris Dwan, Sema4

I don’t usually include slide decks in the roundup, but this is so nice and ties so clearly into the research output vs research input discussion two weeks back in #119 that I can’t skip it.

Dwan has worked on the research computing and IT side bioinformatics in a number of roles since 2000, three years before the human genome project declared success, so he’s seen a fair bit. Dwan’s worked along the research to development to production infrastructure spectrum. Right now he heads up production clinical bioinformatics in his company, so his version of “production” goes a big step further than we’d normally see on the research side - production pipelines are validated, changing the pipelines is a big deal, there’s 24/7 support, etc.

There’s lots of good stuff here. Besides what I spoke about before, it really speaks to the large number of moving parts - data quality, governance, technology, technical processes, working with collaborators at different stages of the process - that’s involved on the production side of research computing and data. One of the unofficial themes here is that research computing isn’t about the computing, it’s about the research; it’s embedded in a much larger system that we need to both rely on and support. These slides are a nice overview of what that looks like in Dwan’s context.

Slide 8 from Dwan’s presentation, describing how as a process matures from research to R&D to development to production, the requirements and the support models change.  In the diagram, while there’s a linear process from knowledge coming from research into R&D prototyping things, into development of reusable tools, into reliable production infrastructure, it’s also iterative.  The production infrastructure produces data (and expertise, in the form of people) which can be embedded into or offset support to teams working on earlier stages of the model.

ACI-REF Leading Practices of Facilitation - Advanced CyberInfrastructure - Research and Education Facilitators

Speaking of learning and documenting into processes - I guess lots of you already knew about this, but this is the first I’ve seen of it: a document outlining what exactly a research computing and data facilitator does. The role gets called a lot of different things. A million years ago I was a “Technical Analyst”, one of the many completely opaque job titles we give ourselves in this line of work, but it was this role.

When I took on this role, the centre was just starting up and we were trying to figure out what exactly we’d do all day. At first, we just helped out wherever we could - easy because we only had a few friendly users at the time. As the centre grew, we started doing more types of things, and unofficially specialized in different parts of the role, but there was never actually a clear list of responsibilities. This is a good starting point for those starting to map out job responsibilities or even service offerings for research systems support work.

Research Software Development

The Niche Programmer - Asko Nõmm

Whether you’re hiring software developers, or maybe looking to be hired yourself, Nõmm makes a good point about niches within that now very broad field compared to - well, let’s say “mass market” software developers.

There are things we absolutely can learn from the larger software development ecosystem in industry and elsewhere, but other things that are very different. Front end or back end web development - there are approximately a zillion jobs out there, and a zillion job-seekers out there. Lots of needles out there, but also just so much hay. To try to find some successful matchings industry has come up with all these incredibly baroque steps - leetcode or hackerrank and full-day on-sites and…

Nõmm describes his experience moving into a niche - Closure development. Within that much much smaller niche, it was clearer who was an expert, what people knew, and what jobs might be a good match. Demonstrating competence was much more straightforward.

There’s a few lessons here:

Most immediately, if you’re going looking for a job that uses your research software development skills, you’re mostly going to find that they don’t make you jump through the generic software developer hoops that are otherwise standard in industry. Even big tech: if you’re going for jobs in the parts of their orgs that do research computing type work, you’re probably not going to get leetcode problems. They’re not needed to establish competence within the research computing niche.

Second, if you’re hiring for research software developers, don’t do things that are best practices in industry if they don’t make sense for the job you’re hiring for. Evaluate for the skills and behaviours you actually need, and in a sub-discipline like ours, that’s much much easier.

Third - as always, specialization makes matching easier on both sides of the match. It cuts away the noise. Whether you’re a developer or developing service offerings, broad generic mass-market offerings always end up standardized, commoditized, devalued, and going to the lowest bidder. Specializing within a niche short-circuits that.

How we sharded our test suite for 10x faster runs on GitHub Actions - Fantix King

Well that’s cute - I’m sure it’s been done before, but I haven’t seen this described elsewhere.

King describes parallelizing EdgeDB’s test suite to run on more resources in a shorter amount of time, to get feedback to developers quicker. They use GitHub Actions, but it could probably work equally well on Travis or CircleCI or..

King first describes the process of manually dividing up the test suite into roughly equal sized chunks. That’s the tedious but straightforward part.

The cute part is the off-label use of matrix functionality that most CI pipelines support (where you’d test against e.g. multiple versions of your compiler and OS versions). They create a list 1..N of shards, and matrix over that, passing the value to the test suite, and the test suite dutifully runs the shard i/N tests.

Here’s a pretty good article that gives a clearer verison of the usual argument for Python’s performance being good enough for many use cases, even performance-sensitive uses that have specific hotspots. (As the article admits, if the whole code is the hot spot, this doesn’t really apply).

It also introduces a new-to-me tool, mypyc, which makes use of the mypy type annotations to allow compiling of an increasing amount of python into C extensions. This is exactly the sort of thing I was hoping would happen as Python started introducing typing…. I’m excited to see where this goes.

Research Computing Systems

Systems news is 100% vulnerabilities this week.

If you run slurm you’ve hopefully heard about the critical security updates that are necessary because of three CVEs. The first is bad enough, and the other two aren’t great either.

There has been grumbling about the communications around this. Part of the complaint was about paying customers getting a heads up that a notable rush update was coming two weeks prior to the announcement, with early versions of updates available upon request. That sounds terrible, but upon reflection I don’t know what else would be the right thing to do when you’ve identified the vulnerability but don’t have a solid fix.

Much less defensibly, when the public announcement came, it went out on the mailing list but schedmd was otherwise very quiet about it. There’s still nothing on their twitter feed, or on the front page of their website. Given the severity - local users can execute arbitrary commands as root - and how widely slurm is used, that’s completely unacceptable, and it’s going to damage trust.

Bad stuff is going to happen on systems (or software or data) under our responsibility sometimes. We should (and do!) work very hard to reduce how often that happens, and the severity of events that do occur, but bad things will still happen.

Our stakeholders (researchers, institutional decision makers, funders) pay very close attention to how we handle situations where their data or institutional reputation or funds are at stake. People gain or lose trust in us based on how responsibly they see us handling the resources we steward for them in those moments. Communicate with them accordingly.

Cloudflare Pages, part 1: The fellowship of the secret - AssetNote
The Cloudflare Bug Bounty program and Cloudflare Pages - Evan Johnson, Natalie Rogers

It’s not complely comparable, because Cloudflare runs services rather than just providing software. But still, compare how this plays out compared to the Slurm situation.

Full disclusure - Cloudflare Pages is a tool I use for the static front page and searchable archive hosting of this newsletter’s website - but not the mailing list with your email addresses, which is run by buttondown. I’m quite happy with both, for what it’s worth.

Anyway, Cloudflare Pages was catastrophically hacked by the white-hat security researchers at AssetNote. Total pwnage, as the kids said a decade or so ago. But it was done at least in part because Cloudflare (like a lot of tech companies) has a bug bounty program, it’s all been carefully written up, and the tool has been rearchitected. In their public report, Cloudflare claims to have evidence that several of the bugs weren’t previously exploited, but couldn’t provide such evidence for one remaining bug and so directly informed the affected users in addition to this public writeup. And the hackers gave an incredibly detailed (and really fun to read!) walk-through of what they had to do to get in.

As with many such security incidents, it didn’t happen because of one big vulnerability, it was a bunch of little things that, in combination, allowed pretty complete access to… well, everything. A couple unsanitized inputs, overly generous permissions on a key executable and some directories, overly-priviledged containers. Just everyday slip-ups that individually probably wouldn’t have been too bad, but in combination very much were.

I’d encourage you to read the three-part hacking writeup (and enjoy the ingenuity of the team), and read Cloudflare’s official public report. Do you have more, or less, trust that Cloudflare takes security seriously? Perhaps it’s less, but it’s probably not a lot less, given the scope of what AssetNote ended up with access to. Bad stuff happens, what matters is how you deal with it and are seen to deal with it.

I’ve not tried it yet, but vuls, an agentless linux (or FreeBSD, if you’re in to that) vulnerability scanner, looks pretty cool. It does quick or deep scans for a large and updated list of vulnerabilities, and has nice web- and tut-based interfaces.

Emerging Technologies and Practices

Testing out HPC on Google’s TPU Matrix Engines - Timothy Prickett Morgan, The Next Platform
Accelerating Physics Simulations with TPUs: An Inundation Modeling Example - Pierce et al., arXiv:2204.10323

Another datapoint for the hypothesis that AI hardware ends up being surprisingly good for traditional HPC-type calculations. The downside is that you may have to do some things differently..

Google’s TPUs are, like a lot of AI hardware, big matrix math machines, which work really well for a lot of scientific computation needs! In this case, Google researchers are doing shallow-water flood simulations, using a matrix math library written for these devices. The code in a Colaboratory notebook, is open source if you want to take a look at it. They get the requisite multiple orders of magnitude speedups that you can get on dense linear algebra over CPUs:

CPU vs TPU for this shallow-water simulation


Transformers (the deep learning model, not the 1980s toys) for software developers.

If Apple, Google, and Microsoft are all behind it and developing standards, maybe we do finally stand a chance of getting rid of passwords.

What’s that? You need to write YAML or JSON files, but in perl and Fortran? We got this.

This is really nice - a half day workshop (or just self-study materials) where you write a first real, if not production-quality, database - a disk-based log-structured key-value store.

The evolution of *nix cli conventions over time.

Getting a sqlite-backed python web application running entirely in the browser - datasette lite.

Getting gopher up and running on a palm pilot. If you’re too young to know what either of those things are I don’t want to hear about it.

As we now know, computers were a mistake. However, this week I found one valid use for them - a remarkable explanation, with interactive visualizations, of the working of a better class of technologies - the mechanical watch.

Don’t do automated search and replaces in your git repo. This of course is something everyone knows - eventually.

Good quick tutorial on mermaid.js diagrams, which are supported on GitHub now.

Need to burnish your “chaotic neutral” inventory management/software development alignement credentials? Generate your barcode scripts directly in the postscript.

There are very very few truly new ideas. Distributing workloads in heterogeneous distributed systems - in 1969.

Every problem is interesting if you push at it hard enough - doing really fast case conversion on unicode strings with efficient sparse array compression.

Ever wonder why the huge jump from Windows as a DOS GUI (Windows 3.1) to Windows as a real network operating system (Windows for Workgroups 3.11) had such a small version change? And why Windows NT’s first release was version 3.1? It was for legal reasons.

What if google translate, but for programming languages?

What if man pages, but nicely formatted and organized and with the examples first?

What if pandoc, but for http api content types?

That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,


About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.