Most of us have made so many half-plans about the future after COVID (“In a couple months, when this blows over..”, “By August, we should be able to…”) we’re a little wary of putting time into deciding on next steps.
But planning for the future is part of the job. Earlier this week I wrote up my thoughts on what academic research computing specifically is likely to look like after the pandemic is fully controlled.
There’s a few things I think we can take as givens that are going to shape the near-term future. First, research communities are now pretty comfortable with remote provision of support. Health research has taken on a new urgency, and infrastructure that supports data sharing is now a must-have rather than a nice-to-have. Organizationally, two quite different things have happened: governments and institutions are very over-spent, and the best research computing teams have learned to communicate and operate even better than they were before, while others have just sort of muddled through.
I think those things are going to have some pretty clear consequences in 2021 and beyond. Research budgets will stay flat or drop (except health), and research support budgets in particular are going to shrink. With more health data on more systems, we’re going to have to get better as a community about running research computing systems with something closer to industry best practices around security, monitoring, uptime, and breach reporting.
But there are bigger changes coming. We’re almost certainly never going back to 100% work-from-office, which means ever more virtual support. That plus likely budget cuts makes consolidation of research computing teams - within, but also between institutions - extremely likely:
Midsized teams with vague goals and priorities who can’t communicate the value they bring are going to be called upon to justify their existence.
This is at least a half-year away from even starting, but as I said in the article:
It is not too early to start upping your game when it comes to the adminstration, your researchers, and your team members. For the administration, you’re going to have to ensure that you can justify every budget item in terms the administration recognize and value, and that you have clear and focussed goals and priorities. For researchers, you can start making sure that your systems, processes, and practices are as high-quality and researcher-focussed and -friendly as possible. For your team members, if you’re not regularly communicating with them to make sure they’re happy in their current roles and with their career development, this is the time to start.
I’ve talked a lot and shared a lot of resources in this newsletter about upping our game with team members - that’s vital, because without a capable engaged team nothing else matters. But there’s been a lot less about communicating our value to administrations, or product management working with researchers, partly because there’s precious few resources out there on those topics to share.
Given the lack of other resources, I’m going to spend some time putting together some material on what groups can do to better make our value clear to administrations/funders and to be more indispensable and less interchangeable with researchers. If you do have some resources that have been valuable to you on those topics, share them with the community! Reply to this newsletter or just email me at [email protected] and I’ll make sure our fellow readers get them.
And with that, on to this week’s roundup:
One Rubric Changed Box’s Engineering Performance — Here’s How - First Round Review
A nice description of how the ex-leader of Google Docs now running the technical organization at Box set up performance management. It was guided by five principles:
The first is the most important; without clear expectations about what performance looks like - and that probably means goals and pretty high-level standards about things like how self-directed (say) they are. Those goals are work to set up and the high-level standards are hard to evaluate, but that’s really the only kind of tools at our disposal; we can’t monitor performance on how many widgets per month they produce.
From that they describe creating rubrics for various roles which can be used to evaluate performance, can be used in job descriptions or ads for clarity, and which can be use to guide promotions (probably less something we can take less advantage of in research computing where teams are small and structures are pretty flat).
But without that clarity of expectations around performance there’s not much else you can do.
Brooker, who leads development on AWS’s Lambda product, writes about his approach to getting big things done and done well; his approach is outlined below:
This maps pretty straightforwardly to research computing work too.
Key to Brooker’s approach, I think, is the centrality of writing a document in the early stages, both for communicating the problem and proposed solution to others (and having something concrete for them to give feedback on), but also to clarify his own thoughts and realize where gaps or problems lie.
This lines up even with smaller decisions; we’ve talked about architecture decision records before as a way both to flesh out an idea and communicate it, both in the present and to the future (“here’s why we did it this way - the tradeoffs and constraints we faced”) so that future knows the why’s and so therefore when it would be worth revisiting.
Write Down Your Team’s Unwritten Rules - Liz Fosslien, Mollie West Duffy
It’s very likely that my team is going to grow significantly over the coming months. That’s a huge opportunity but also a challenge - we have a pretty good team culture now and we don’t want to make any unintentional changes to that.
In addition, we’ve had a few discussions recently where it’s been clear that expectations I thought were clear - about learning, contributing to others’ work, working hours - were not at all clear to some team members, because it wasn’t explicitly stated anywhere.
As a result of both of these things, we’ve started writing up a “culture deck” to make explicit what our expectations are, and by extension what we would look for in new team members. So this article by Fosslien and Duffy comes at a good time to check that effort against. The basic idea of the article is summed up nicely in the title; by making team norms and expectations explicit (and having the team members contribute to the document!) you not only communicate them widely but make it much easier to maintain those expectations and norms.
Of course, then you have to make sure both you and the team behave according to those expectations, and provide transparency or feedback when (inevitably) you or a team member slip.
Another reminder that our sudden working from home doesn’t affect all team members in the same way. Keast, who is profoundly deaf, relies on lip reading for verbal communication in person; but that works much more poorly over videoconference where there’s a lot less information.
This article is a good reminder that while many people require various accommodations, and many of those accommodations “work”, they are still a lot harder than not needing the accommodation. In Keast’s case, he’s been relying on live auto captioning, but those tools are still pretty poor. The whole team, to their credit, chipped in and tried tools that might work - Zoom + Otter, which had decent accuracy and kept history but lagged well behind the conversation; and Google Meet, which was after but didn’t keep history.
The whole team tried using the call with sound off and only captioning a few times, and quickly got a very hands-on understanding of the challenges Keast faced with these tools. And team members have changed their behaviour, watching captions of their own speech to make sure it’s getting it “right”, defaulting to Google Meet, and generally being more understanding (and using written communication more).
Own Your Feedback (Part 1): Receive Better Feedback by Asking - Padmini Pyapali
We’ve talked about giving feedback to our team members, but we need feedback, too - from our managers, or researcher’s we’re supporting, or other stakeholders. Pyapali makes some specific recommendations for getting good feedback from others. They all involve asking, and how to ask:
Urgent computing: presentations at SC20 - Rupert Nash, EPCC
As research computing needs grow in complexity, particular “types” of research computing tooling - HPC, web applications, data integration, workflow - become just individual parts of larger-scale applications.
EPCC has two presentations at SC20 in part of the Urgent Computing stream. Nash highlights the work (with links to preprints) in this article, as part of their efforts in the Visual Exploration and Sampling Tooling project. VESTEC is building a toolchain to support real-time data ingestion integrated with simulation, sampling, and visualization to support decision making during times of disaster like wildfires.
EPCC’s work to be presented at SC includes updated workflow management system for large scale and urgent computing, based on RabbitMQ, and an extension of the Common Workflow Language (CWL) to support MPI-based large simulation.
The accelerating adoption of Julia - Lee Phillips, LWN
Julia is plagued by erratic product management, but is a super cool language. It’s lisp-like fundamentals make it very natural to write domain-specific DSLs in, and its approach to types and overloading, while by no means unique, make it easy to compose functionality.
This article shows the power of some of that composability - the DifferentialEquations and package (which is very cool) is written pretty generically, and the Plots package is pretty flexible. By introducing uncertainty in the initial conditions using the Measurements package - setting the initial conditions to be 1.0 ± 0.1 instead of 1.0 - the differential equations and plots pick up and propagate the uncertainty automatically.
This isn’t unique to Julia, but it does show the power of traits and generics, and it’s pretty darn cool. The comments on the blogpost, of course, are a tire fire.
Another thing I learned from this article is the new but still really cool Pluto notebook, specific to Julia. Pluto avoids one of the big issues with notebooks like Jupyter: the somewhat hidden state of the order of execution of cells - by making the execution reactive. Change a value in one cell and it propagates across the notebook.
It still doesn’t solve the other issues - making version control and decent software engineering practices harder - but that approach is worth keeping an eye on.
The Research Software Engineer’s Toolkit: Information and tools to support the RSE community - Jeremy Cohen, Alex Botzki, Jonathan Frawley, Nick May, David Pérez-Suárez
Cohen and all introduce the nascent “RSE Toolkit” project, which aims to provide
A set of documentation, tools and guidance to support Research Software Engineers in developing reliable, sustainable and robust code”.
This intention seems to be a lens of material and best practices specifically for research software development, curating material out there for this community. Multidisciplinary work is a priority, since there tends to already be deep expertise in specific areas.
It’s just getting started, and they’re actively looking for contributions.
Docker build example: how to go from slow to fast docker builds - Geshan Manandhar
Manandhar quickly shows how using DOCKER_BUILDKIT, along with taking advantage of cacheing as much as possible, can speed up docker container builds. Docker buildkit also allows passing secrets into the image without having them in the file, which is very convenient, but then of course you have to be careful about storing the image anywhere.
This is from earlier in the year but I didn’t see it before now. For some use cases, standing up an entire separate web app just to implement one or two simple endpoints seems like lot of developmental boilerplate and operational burden.
OpenFaaS is a a kubernetes framework for supporting these function-as-a-service (“serverless”) apps. It’s a good choice if you want to provide on-prem FaaS as a service itself to a lot of researchers, but if you just have need for a few endpoints, it’s enormous overkill.
FaaSd is a single golang binary that runs in a container and provides much of the functionality of the full-strength OpenFaaS. It’s much easier to start up and play with either as a pilot or as a small production solution (with an HA proxy in front of it).
SORSE call for contributions - 31 Oct, 8pm GMT The monthly SORSE call for talks, demos, and workshops; these are great events and worth participating in. This is likely the last call for events to take place before 2021.
Configuring Sphinx from scratch: making your own documentation and making your documentation your own - 3 Nov 15:00 – 16:00 UTC, Sadie Bartholomew
SORSE demonstration on setting up and configuring Sphinx, a mature documentation generator used widely in the Python community, from scratch.
CANOPIE HPC Workshop @ SC20 - 12 Nov
If you’re thinking of virtually attending SC20, the Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC may be of interest. Topics include I/O virtualization, workflow execution, containers and MPI, and orchestration.
Also at SC20, a workshop on research software engineering as a career path and emerging profession, and building teams and groups dedicated to research software engineering.
Christmas-in-the-Cloud workshop - 10-11 Dec, Free
The Bristol RSE group and AWS are putting on a tutorial/hackathon on setting up a cluster in the cloud and building software defined infrastructure applications.
I’ve mentioned before that the increasing sophistication and availability of static analysis tools is great news for software developers everywhere, especially for subtle software development like in research. Semgrep is a nice example.
A nice crash course into ssh_config by the folks at gravitational, including proxy jump and a more disciplined approach to host naming so as to be able to take better advantage of host wildcarding than I have seen before.
Self-rendering markdown and LaTeX pages with texme.
A tutorial for using nix as a MacOS package manager instead of home-brew.
Hypothesis is one of many property-based testing packages, this one for python, including support for e.g. numpy. Does anyone use property-based testing in their research code? It seems like a potentially powerful approach for unit testing in particular but I don’t see a lot of use.
A nice performance debugging story on tracking down linux filesystem performance regressions.