This week the COVID-19 saga is starting to have a big impact in the world of research computing. Conferences are being cancelled or switched to videoconferecing, and institutions are considering work-from-home policies.
Working from home works well for many of us in research computing. But it has management challenges that I’ve talked about before in the context of distributed teams. Without the opportunities to bump into each other in in the office, it takes more deliberate effort to keep lines of communications open. Maintaining good working relationships, staying on top of work done, and giving useful constructive feedback becomes harder.
That means that you or your team mates working from home does not absolve you of management tasks like having one-on-ones or providing helpful performance communication. Just the opposite, it makes it more important! Remote one-on-ones work extremely well — when travelling or people are away I quite happily use Zoom or Slack (paid slack with their steep academic discount is worth every penny, easy to apply for, and their video calls works pretty well). Here’s one post I like on doing remote one-on-ones.
I have a little stockpile of recent articles about remote work that I’ve been saving up for a blog post, but this seems like a good time to list them here and discuss them in this context. So this week will have a focus on the distributed.
Let’s get started!
The State of Remote Work - Buffer and AngelList
This is a summary of a survey of 3,500 distributed workers, largely in tech. By and large people like it — 97% would recommend it to others and only 11% wish they could work remotely less often. But the workers identify problems coming from the communication challenges of working remotely - problems with collaboration, loneliness, and staying motivated.
Their “insight number 1” drives this home: People who don’t recommend remote work are on teams split between offices and remote workers. Unless you really make an effort, the on-site team members are going to get much better communications and support than the off-site team members. Equivalently, if people are working from home when they’re used to being an in-person team, they’re going to feel the effects of significantly worse communications. This is something that can be fixed by routinely checking in on your team mates (preferably with video, but even in text), and keeping up weekly one-on-ones.
The Basecamp Guide to Internal Communication - Basecamp
Basecamp has a very well-thought out management style for its remote workforce (they’ve even written a book about it). They need effective internal communications, and this short document spells out their principles and practice. They strongly downplay instant messaging — not that they don’t use it, they just don’t prioritize it — in favour of longer-form asynchronous write-ups. This means fewer interruptions and more clearer, better-thought-through communication. You may remember from the end of January I shared a post from them saying to break ties in hiring in favour of the better writer. Also interesting is their automated end-of-day daily “what did you work on today” question — which is aimed at reflection as much as communication — and a start-of-week “what will you be working on” question.
Contrast this with the recent Ars Technica article on how they handle being a remote team (did you know there’s not really an Ars Orbiting HQ?). At Ars, instant messaging is a key tool — but they’re a publication with tight deadlines and dozens of stories in play at any given time, so they function more like a newsroom. I don’t think this approach would help most of our teams.
It’s Time To Kill the Daily Standup Meeting - Claire Lew, Know Your Team
I’m not even sure I agree with this — if you are a team that operates with a daily standup and it’s been working well for you I’d hesitate to stop doing them. All of our practices though should be subject to revisiting every now and then though. Are our meetings serving their original purpose? Is that purpose still relevant? Are we getting enough value out of this meeting to be worth the people time and interruption? If so, are there changes that would make them even better?
If you go through a period of having people working from home for a while, that change or when people come back is a great opportunity to do this sort of reflection with the team and brainstorm new and better processes, especially when some of those processes won’t work so well with a distributed team. Part of your new process could be to do the same brainstorming again in six months or a year.
This is less about having individual people working from home and more about moving from having a single office to two or more. Now, in research computing we’re never going to “start up a Chicago office” or whatever, but it’s very common for us to have multi-institutional collaborations. This article gives good advice as to how to structure that - I don’t think any of the advice is particularly surprising to us, but it’s good to review in this context. We should talk about distributed teams, not a central and a bunch of remote teams; alternate time zones as appropriate; have good lines of communication; ensure occasional face-to-face meetings; and make sure the various teams are well-aligned (which can be easier in business than in a multi-institutional collaboration).
Remote Work Insights You’ve Never Heard Before - Sarah Milstein
In this article the author builds on her own experience and research on the surprising success of temporary ad hoc teams to describe how successful distributed teams can actually work better than successful in-person teams. That greater success is exactly because they have to try harder; once those habits of communication are built the team can develop better trust (a key reason for one-on-ones!). With that in place, aligning peoples goals can be easier, having difficult conversations can be easier (even between peers), and drama can be lessened.
Obviously there’s a huge selection effect here - successful teams are successful - but the argument that successful distributed teams are successful differently than in-person teams is an interesting one.
SETI at home, which was an introduction to distributed high-throughput computing for many of us, is shutting down for reasons that sort of warm my ex-astronomer heart — the volunteer computing done worldwide has done so much work that there’s a backlog of results to go through!
SETI at home took off so quickly in 1999 that the developers made the backend more general and the new Berkeley Open Infrastructure for Network Computing (BOINC) began supporting other projects. Currently a big supported project is Folding@Home is one for protein folding; appropriately enough for this week, Folding@Home is doing some work on the coronavirus.
A BOINC paper was just recently published, and it’s interesting to see the architecture and how its changed since what I remember, particularly on the client side. It was originally built to handle a much more diverse range of CPUs than was common through the 2000s (remember Alpha and MIPS?), but in an age where GPUs are common and ARM is making inroads, and where mobile devices can have significant capabilities that you might want to take advantage of, that flexibility is paying off again. It supports both native code and VMs behind a shim interface (“wrapper”), which means that for instance docker containers are deployable now. There is also a mechanism for allowing client side visualization - which has always been really important for engagement of the volunteers. It’s a lot easier to put up with a cooling fan whining and more sluggish interactions if you can see a real-time visualization of the cool work for science your computer is doing!
On the server side, a key component (to avoid result-poisoning problems from malicious clients) has always been result validation, which requires a bit more sophistication in the handling of returned results than you’d need with say HTCondor jobs, and there’s some support for harvesting debugging information from failed jobs as well. Servers typically rely on bog-standard MySQL or MariaDB databases for handling returned results and do their processing from there.
Scientifically, technically, and as a public engagement mechanism for citizen science, BOINC has been a really successful project and it’s an effort that’s worth learning from.
Michaela Greiler on Code Reviews - Software Engineering Radio Episode 400
How to do High-Bar Code Review Without Being a Jerk - Andrew King
How to do a Code Review - Google Engineering Practices Documentation
Reading Chelsea Troy’s blog has kind of convinced me that Code Reviews are a way of doing asynchronous, distributed pair programming. And even if you do them within an in-person team, they require good communication skills to be productive and drama-free, both in the review itself and “out of band”. So it seems worth addressing code reviews in a roundup themed on managing distributed teams.
I don’t normally point to podcast episodes here, but the SE-Radio episode on Code reviews with Michaela Greiler is worth listening to. Greiler has worked at Microsoft and helped analyze their literally decades worth of internal code reviews, and so has some really important insights:
Writing these things out sounds suspiciously like process, which research computing teams hate; but paraphrasing Michael Lopp, process is (or should be) just documented culture: “this is how we do things here”. It’s only when the whys of the process get forgotten and the process becomes a goal in and of itself that it starts to become onerous.
King’s article makes the same points about clarity of expectations, and goes into more detail about a specific process and where having an actual in-person meeting might be useful. I particularly like the addition of being clear on what “done” looks like for both a code review and the underlying PR.
The Google best practices for reviews are useful too if you want to see how a large organization that writes a lot of quite successful code goes about it.
Syntax highlighting is backwards - Ben Kuhn
This article makes a very interesting argument - that typical syntax highlighting is completely backwards. Boilerplate like language key words should be faded, and code comments should be highlighted. What do you think?
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5 - Itamar Turner-Trauring
As data volumes we’re asked to analyze grow, external memory/out-of-core computing are becoming more common again. I often see research computing tools using
mmap as a cheap and easy way to transparently access data on disk as if it is all in memory, making use of modern OSes extremely sophisticated file system caching.
This can work great if your data is sitting on a local POSIX file system, and the access pattern lines up with how the data is arranged on disk; but if either of those things aren’t true, you need to turn to something else.
This article is a nice introduction to mmap and where it does and doesn’t work. It briefly touches on HDF5 as a tool that can help, especially if the data is nice big rectangular arrays or data tables. There’s also a sophisticated new python tool as part of the pydata ecosystem, Zarr which it talks about; but the basic ideas aren’t ecosystem or language specific.
A lot of us in research computing started in non-internet-facing systems, or in making open data more widely available, and so considerations of security are a little new to us.
The Software Security course is a new course by Heymann, Kohnfelder, and Miller at UWisc which is being put together (not all the materials are online yet), and give a good from-the-basics mental model of how to approach building in security from the beginning to tools.
The old but still extremely relevant Ops Report card takes a look more from the operations side. There are some very specific pointed questions here, and a system that had good answers to all of those 32 questions would be well above average in the research computing world.
I’ve been watching AWS Lambda-style functions-as-a-service with interest as to how they were going to be used for research computing for a while now. At first I thought the impact was going to be limited to typical web services applications, where one could stand up endpoints without having to go through a lot of boilerplate.
That changed when I read the demented genius of numpywren: serverless linear algebra (source on github) where they applied Lambda to absolutely standard large-scale numerical linear algebra operations like matrix multiply, SVD, and Cholesky decomposition, and achieved computational efficiencies greater than ScaLAPACK in some cases because even many common dense linear algebra operations are more dynamic than we normally appreciate. By using S3 (which has absurdly high bandwidth) as backing store for external memory approaches, they managed even without maintaining state to get ~50% peak flops on 3000 cores.
Cloudburst is a new effort out of the same group (the RiSE lab at Berkeley) which aims to make even higher efficiency available by maintaining a cache and load balancer, and allowing “hot” data to migrate to a particular FaaS endpoint. It also supports task DAGs in addition to single functions.
I’ll be very curious to see if anyone successfully makes use of this for interesting research computing applications.
Bioinformatics with Rootless Podman - Bryan Hepworth
I’ve mentioned podman as sort of a compromise container format supported on new versions of RHEL and CentOS; it can consume docker images and build from dockerfiles, but doesn’t require root. This is a walkthrough of using it for a particular bioinformatics too, WhatsHap.
Maybe I’m just a bad or insecure person, but I always find it very reassuring to read about computing mistakes that other people make. Here’s a list of this year’s leap year bugs.
A great handbook for the care and feeding of nginx.
I read somewhere once, and can’t find it, something along the lines of “If you create a tool that reads a string and has behaviour that depends on the contents of that string, you’ve written a parser. Write fewer parsers.” Parsing is hard and error prone, and you shouldn’t build one by hand. There are some old tools you might be familiar with for doing so - here’s an argument about Why you should not use (f)lex, yacc and bison and instead use modern tooling like ANTLR.