Thanks for following along through the week. I don’t think this first round of through-the-week emails worked as well as I had hoped, so I’ll pause that for a week or two until I find a format that works better - thanks for your patience and, as always, any feedback.
On the other hand I feel like the link roundups are coming out pretty well, so I’ll certainly be continuing them. This week: exploding rockets, strangle old code, and how to improve your technical skills in just three horrifying steps.
One of the challenges leading research computing teams I mentioned this week was having to stay on top of technical and management work, while also understanding the relevant domain-knowledge use cases. Training people up in those two areas at the same time is an issue too; the hiring pool of people who already know both of those areas well is small, and much smaller than the number of people who know one or the other and are willing to learn.
In technology companies it’s becoming more common to split technical manager responsibilities into “people leadership” and “technical leadership” career paths; the technical leads have increasing levels of technical decision making responsibility, and help mentor and train junior colleagues, but aren’t responsible for managing. A discussion of why and how are given in this blogpost and description of the paths by Camille Fournier, and some companies have developed detailed training/mentoring plans and courses to help their junior technical staff progress along the career path. These two posts by Keavy McMinn describes what it looks like, day to day, to be on a technical leader at GitHub.
In research computing we don’t have large teams, so we tend not to make formal such distinctions, but informally there are certainly team members we implicitly or explicitly trust with such technical decision making. Does your team formalize that at all? Is trying to develop a technical leadership path worth trying to do, or does it require HR support and increasing pay scales to make it work (the way it happens at tech firms)? The modest promotion possibilities in research computing and lack of alternate career paths has been pointed out before as a hinderance for hiring and keeping staff. Have you seen any research computing teams go this way?
The Ariane 5 Failure: How a Huge Disaster Paved the Way for Better Coding - Sooraj Shah, Welcome to the Jungle
In 1996, the first Ariane 5 launch ended in flames just 37 seconds after takeoff when the (unmanned) craft rapidly and for no obvious reason flipped 90 degrees and then exploded. It was a software error, resulting from data conversion from 64-bit floating point value to 16-bit signed integer value. The code had part come from the control systems of the Ariane 4, where computational resources had been more scarce.
Although the Ariane 5 project went down in history as a monumental failure, the code was well written and a very good software engineering process had been followed throughout. […] Perhaps the fundamental issue was that the whole failure arose from the requirements that had been set.
Maybe more surprisingly, the catastrophic result wasn’t due to the overflow itself. The fault-detection software charged with checking for errors and potentially switching to secondary systems misinterpreted the overflow and believed it had been the result of hardware failure on the rocket itself.
But it shouldn’t be surprising; other studies more terrestrial and closer to our domains have seen this too. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems, Yuan et al., (2014) took a look at almost 200 user-reported major failures of data-intensive services, and found that
…the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design.
It sort of makes sense - it’s harder to regularly and realistically test error-handling code in the same way we test “happy-path” code; but Yuan et al. found a small number of common patterns which could be tested for. In research computing especially we’re often so focussed on getting the important stuff right that it can be easy to let the edge cases slide - and recovering from errors is really tough.
That’s what I was thinking about when I read The Let It Crash Philosophy Outside Erlang a few weeks ago; Erlang is a language for massively concurrent soft-real time systems developed in the mid-80s for telecoms. Erlang is maybe most famous for “let it crash”; code is made of small units that very much don’t handle errors. If they fail, they crash, and it’s up to the runtime to sort things out - that’s one of the runtimes main purposes, and handling it in one place can be easier than having every component handle it separately. The blog post points to best practices (like
set -euo pipefail) for bash scripts to hard fail as soon as possible; the sort of error handling that permits error recovery should probably wait until the code is nearing the “research infrastructure” phase of development, and even there thought should go into whether to handle errors in components or in the glue between components.
Source Code has a Brief and Lonely Existence - Derek Jones
Derek Jones has an interesting blog where he takes data-driven looks at software and software development. Here he points out:
The majority of source code has a short lifespan (i.e., a few years), and is only ever modified by one person (i.e., 60%).
I think this is worth coming to terms with, particularly in terms of research computing and tool maturity. Most new ideas, as they get put into code, will stall out at the proof-of-concept or prototype stage. That’s good! It means lots of new ideas are being tried and then being passed up in favour of more promising ones. But it also means that effort going into proper “software engineering” (like non-crash error handling!) too early is effort likely wasted.
This is consistent with a blog post from last year about Software Is about Developing Knowledge More than Writing Code, by Li Hongyi; the idea that the real value of code is in what is learned (by the team and by those the team and disseminates the knowledge to) more than the code itself. Therefore, it helps to:
(Another good lesson from that one: Software is limited not by the amount of resources put into building it, but by how complex it can get before it breaks down)
Avoid Rewriting a Legacy System from Scratch by Strangling It - Nicolas Carlo
Because the value of the code is what was learned from writing it rather than the code itself, when it comes time to grow from earlier maturity stages to the next, the recommendation for from other sectors is to migrate away from the earlier code, not to refactor a proof of concept until somehow becomes production quality. (See also Keavy McMinn’s recommendation to throw away code from a proof-of-concept “spike” in her second blog post, or this from Will Larson and his book “An Elegant Puzzle”). Here, the author proposes a specific approach, similar to that Will Larson uses: phase out particular key modules of functionality one piece at a time, replacing them with the new pieces. In some cases that’s easier than others - if your original code is very tightly coupled, the amount of refactoring you might have to do to make that possible might not be worth it.
The Four Pillars of Research Software Engineering - Cohen et al.
In this article, we have presented 4 pillars representing a set of areas and activities that we consider to be vital in offering coordinated and comprehensive support for research software and the people who build it. In turn, we hope this will demonstrate to professional developers and researchers alike that research is a viable, and interesting, environment for a software development career.
In this white paper, the authors present what they see as the key pillars around which support for research software development (which I will interpret more broadly to include research computing writ large):
I think this is a good resource for informing discussions with institutional and funding-agency decision makers about what modern research computing requires. The authors also point to a number of emerging national and sub-national efforts around the world to address these pillars, including groups in France, Germany, and the Nordics that I hadn’t known about.
Research computing teams do a lot of training both internally and with their research communities. Software Carpentry has helped a lot on getting researchers comfortable with the basics, and often there’s lots of more advanced material available, but for researchers or even recent CS grads there’s a “missing middle” which is only just starting to get filled in - filling the gaps from “can do some stuff in the terminal” to “actually productive, and using things like GitHub”.
These are two useful resources which address that missing middle ground. The first directly targets new grads with productivity tools around editors and using git and GitHub; the second was a winter school for research trainees, which includes GitHub too but also python packages and the like. These sets of resources may be useful to point people to for self study, or even to supplement our own training efforts internally.
Distributed HPC Applications with Unprivileged Containers - Felix Abecassis &
Jonathan Calmels, NVIDIA
I’m still catching up on FOSDEM; in this talk, the authors from NVIDIA talk about their own internal HPC system and how they developed a new very small and lightweight “container” tool suite called enroot to allow chroot-style packaging of applications with optional process namespacing. It can use docker images and squashfs, and has a plugin for Slurm.
In cloud environments NVIDIA uses Docker and Singularity containers and even provides Docker containers for their GPUs and software, so it’s interesting to see they rolled their own solution internally. I understand that HPC-style clusters’ security model is too brittle to support docker, but it’s not clear to me what the main advantages of enroot are over singularity other than the enroot devs plainly don’t like singularity. From this one github issue where it’s discussed, the one unambiguous advantage I can see is that unprivileged users can build/add to an image without needing setuid/setgid images installed, which.. fair enough, but that doesn’t seem like much of a show-stopper if you think (as I do) that developers should have control over their own workstations.
Regardless, either enroot or singularity are much simpler solutions than previous “docker for old-school clusters” like CharlieCloud or Shifter so maybe between repeated iterations and more open container standards and tooling to build on, we’re slowly converging on small lightweight approaches to containerized workflows on these environments.
In this paper, and accompanying twitter thread, the authors describe and provide their code for running simulations with a standard atmospheric science simulations on AWS up to 1000 cores. (The first author’s blog is excellent on the details of doing such work.) Other than just the simulations, they do a significant amount of benchmarking, measuring in the literature for the first time that I’m aware the noticeable improvement that AWS’ new Elastic Fabric Adapter (EFA) make. They also notice that IntelMPI is significantly faster than OpenAPI, and their microbenchmarks suggest some ways that code could be tuned for this platform.
Maybe most interestingly, they find that at modest core counts AWS costs no more than the equivalent use on NASA’s Pliades cluster; earlier work (pre-EFA) by a NASA team had found factors of 1.5-1.9x cost differences.
Speaking of capabilities of new hardware - following up on last week’s paper about non volatile memory and using it as disk cache, this new paper describes using Optanes as RAM or as DRAM cache for exemplars of “the seven dwarves” of numerical scientific computing. Depending on the locality of access, many codes perform quite well in this mode compared to using the equivalent amount of RAM. The authors propose specific tuning advice to improve the performance of various access patterns, particularly around write accesses (which are significantly slower.)
Relatedly, this paper published a few days earlier (did an embargo just lift?) by researchers at Japan’s National Institute of Advanced Industrial Science and Tech- nology (AIST) did microbenchmarking of Optane systems and found:
… latency of random read-only access was approximately 374 ns. That of random writeback-involving access was 391 ns. The bandwidths of read-only and writeback-involving access for interleaved memory modules were approximately 38 GB/s and 3 GB/s.
For comparison those are latencies of only about 4x DRAM, and bandwidths of about 37% (read) and 8% (write) of DRAM. Taking numbers from their table 2, and crudely estimating some similar numbers from a recent review of a high-end SSD, we find the optane fitting interestingly in between the two — which to me suggests there’s room for research computing uses more intricate than “use it as disk” (last week) or “use it as RAM” (this week) as people get more familiar with the technology, although these efforts are extremely important for benchmarking such work.
|Read Latency||94 ns||374 ns||1333 ns|
|Write Latency||96 ns||391 ns||17000 ns|
|Read Bandwidth||101 GB/s||38 GB/s||3.3 GB/s|
|Write Bandwidth||37 GB/s||3 GB/s||3 GB/s|
A new book on Naming Things, and an upcoming blog. The main focus is on software and code but I think we could use better naming of things all the way through research computing…
AWS has put all of its research-computing related materials in a new place - amazon.science. One of the big advantages AWS still has over say Azure is that over the years they (and third parties) have written up so much relevant material; it’s helpful to have more of it available in one place.
As more research computing involves internet of things for sensors and data gathering, privacy starts to become increasingly important. This recent article points out that while privacy is full of hard questions, it’s not difficult to start thinking about data and privacy in a principled way and making sure the choices being made are reasonable.
This twitter thread makes the argument that if packaging seems clumsy and awkward now - let’s face it, the main reasons containers are a thing in research computing is that other software package management doesn’t work - it’s because packaging is in the place that linking and loading was in the mid-80s, before dynamic linking. The difference is that we’re trying to describe views of filesystems, not of memory. I’m skeptical of this - versioning in linking/loading is much coarser - but it’s a fun argument.
Tweet old-school BASIC at this BBC-Micro bot and get screenshots back.
And finally, this post from Michael Malis talks about four exercises he goes through to become a better programmer. The first three are straightforward enough - read a paper, learn a new tool, read part of a book. In the fourth, he records screencasts of his own programming, and later plays it back to himself to take notes. He uses that to learn from mistakes or inefficiencies he’s made, and do better. I am completely confident that if I did this it would be (a) absurdly effective at improving my practice at almost any technical skill, and (b) a wretchedly cringe-inducing experience. Watching video of myself give a talk is bad enough.. Less humiliatingly, Martin Winkel points out the benefits of learning a new language for improving your skills in one you already know - in this case, how Elm helped him become better at Python.