#19 - Link Roundup, 10 Apr 2020

Hi everyone:

The past weeks have enormously increased awareness globally of some issues we in the research computing world face. The vital importance of data, the importance of sharing data, and yet the messiness of data and the fact that columns labeled “cases” or “deaths” in tables from two different sources don’t necessarily mean the same things. The importance of understandable, maintainable software, and yet the limitations of analyses and modelling.

It has also really highlighted importance of research, and research computing has shone brightly. The diversity of research computing brought to bear on COVID-19 is really an amazing demonstration of how broad the field is. You’ve probably read the “supercomputers do big simulations of spike protein” stories more than any other - HPCs gonna HPC, and those big systems have big press offices - but the much wider and honestly much more relevant research computing stories have been about high-throughput model-based screening of possible antibodies, careful pathogenomics of the evolution of the virus combined with incredibly detailed and labour intensive contact tracing, graph-based modelling of disease spread under different regimes, economic modelling, rapid deployment of sample trackers and automating pipelines, secure data sharing platforms..

This has all happened while research computing managers are trying to keep our teams, systems, and software ticking along after a huge disruption. For our team in particular, we’ve just completed our fifth week of working remotely, and we’ve sort of found our groove now. There’s new challenges coming up for us — we’ll be onboarding a new team member completely remotely for the first time — but we’re also using this opportunity to make more explicit our expectations of each other, and to document important processes.

How are things going for you and your team? Are there things you’ve tried that have worked really well, or things that didn’t work at all? Any interesting challenges you’ve faced? Do me a favour and reply to this email and let me know, and let me know if it’s ok to share what you’ve done (anonymized if you prefer) with our newsletter community here.

Meanwhile, on with the week in link roundups…

Managing Teams

What to do when you’re feeling a lack of motivation at work, as a manager? - Claire Lew, Know Your Team

I hope that you and your team, like ours, has been fortunate enough to find a smooth transition to this new normal and that you’ve found your footing. But even if so, this is a trying time and it’s easy as a manager to start getting burned out — maybe even easier now once the adrenaline has slowed down and some sense of routine is re-established.

Lew goes into more details in each of these, but the suggestions in the article are:

Create a wedge - carve out a small chunk of the day to relax and take a break.
The weight is heavy – and you’re allowed to admit it.
Reconnect to the aspects of the job that energize you.
Seek out the people who benefit most from your work.
Step Away.

The times aren’t normal, and it’s ok for things to feel overwhelming. Putting extra effort into helping our team members get through this while still having to get our own work done, handle extra family responsibilities, and deal with what’s going on in the world, is exhausting. Taking some time for ourselves, appreciating the research successes we’ve helped with, and just realizing that we’re stressed and tired can all help a little bit.

Data Engineering Teams Book - Jesse Anderson

I’ve commented before on the analogies between data engineering and research computing as a whole - both on the basic day-to-day of the job, and on how a data engineering team fits into a larger organization.

This ebook talks about building data engineering teams; I’m not sure there are great surprises inside, but it’s short read and you don’t get spammed if you give them your email. It’s cemented my opinion about the relationship between research computing and data engineering (Anderson points out how different data engineering is from the rest of software development by telling more general purpose software developers about unique things in data engineering like “jobs that take 10 hours to run” or “processing terabytes or petabytes of data at a go”; yes, heavens, imagine.)

Anderson’s list of necessary skills for a data engineering team are also a pretty good list of skills to keep in mind when building a research computing team:

Distributed systems
Programming
Analysis
Visual communication
Verbal communication
Project veteran
Schema
Domain knowledge

Obviously depending on the task of the research computing team some of those might be less or more relevant and there may need to be others, but I think it’s a solid starting point. The communications and ‘project veteran’ pieces are I think too often overlooked in favour of the more technical skills.

A system for quality one on ones - Fabian Camargo

In my recent blog post I outlined a guide to getting started with one-on-ones; but once you’re started (with one-on-ones or really any managerial practice) you need a system which (a) works for you, (b) allows you to get the most out the practice, and (c) allows you to look back and improve the process. This is a brief discussion of his system.

Research Software Development

Why Covid-19 has resulted in New Jersey desperately needing COBOL programmers — Steve Mollman, Quartz

We’ve all read about the enormous upturn in unemployment claims causing unprecedented loads on governmental software systems, leading to the New Jersey governor putting out an urgent call for COBOL programmers.

I think there’s a lot to think about here for research software, where we often talk about “software sustainability” and wish for reproducibility(*) of computational results over long timescales.

These governmental systems have often been chugging away with occasional tweaks for 40-50 years — an almost unfathomable success in the world of software.

There’s been a lot of talk about the “obsolescence” of COBOL, which strikes me as besides the point and probably wrong; COBOL is manifestly a perfectly serviceable domain-specific language for the sorts of things COBOL programs are asked do, and if (as now) there’s ways to connect to tools and services written in other languages, the lack of (say) a vibrant and rapidly-evolving ecosystem of third-party packages isn’t necessarily a bad thing. (How innovative and cutting-edge do you want the system that sends out your pension cheques to be, exactly, when the time comes?)

But there is no such thing as infrastructure which does not require maintenance, and the need for that maintenance is often quite bursty. This applies in research, too. Research software which goes into production needs to be written in a maintainable fashion, and there has to be funding support to continue maintaining that software, likely in bursts.

We genuinely still don’t know how to do that kind of funding in our research communities. Our funding models are all built around supporting experiments, observations, or theoretical works - short-term projects which start, proceed, result in publication(s), and are then done. The publications then consumed and then inform new, distinct, experiments, observations, or theoretical works, and so on. If experimental methods work develops new kinds of equipment or reagents which are useful to other researchers, then a vendor starts manufacturing and selling those items to researchers, with money that comes out of their grants - and that’s the sustainability model. We don’t have that for ongoing efforts in software (or shared hardware) yet.

PS - Udemy has cut the price of it’s 9.5 hour COBOL course 90%…

(*) An aside, because this is a running complaint I have: When software sustainability and reproducibility comes up, too often, the discussion takes the a degraded and simplistic view of reproducibility, where it becomes merely repeatability: the ability to just press the “go” button on the machine again and have the same numbers tumble out the slot. This is a github issue “can’t reproduce” meaning of reproducibility, not the scientific journal “can’t reproduce” failure of reproduction of a meaningful scientific result; conflating these two is a pun, not an advancement of scientific rigour. Handing off tools and analyses to others which can be easily re-run is a massive productivity win for research but doesn’t directly imply anything about correctness of the deeper results.

Even Faster Math Functions - Robin Green

In an update from Green’s 2002 Game Developers Conference talk, a longer talk about the state of the art in fast numerical computation, including recent work using FPGAs and fixed-point arithmetic. The talk is in the context of game development; but with high performance demands and often the resource constraint of embedded devices, this area has a lot in common with high performance computing needs. Interesting work.

To Catch a Failure: The Record-and-Replay Approach to Debugging — Robert O’Callahan, Kyle Huey, Devon O’Dell, Terry Coatta

An interesting discussion with the authors of the rr debugger, a C/C++ debugger which allows the capture and replaying of failures (including a chaos mode to try to make intermittent failures more easily captured), covering how it was built, the challengins

Emerging Data & Infrastructure Tools

CloudAcademy - Four free courses until June 1

CloudAcademy is offering four of it’s more basic cloud computing classes for free until June; it’s a good starting point if anyone on your team has been looking to get started with AWS, Azure, or building a simple webapp.

Building Secure and Reliable Systems - Google

Google’s books on software reliability engineering, as well as the older “Datacentre as a Computer”, are great reads to better understand how and why Google and other hyperscalers operate as they do, and one can walk away from them with several useful ideas, but they aren’t chock full of things we can actually use right away in research computing with smaller teams and less onerous SLAs.

This book seems to be a bit more widely applicable in our world — it looks at implementing security as part of the software reliability engineering approach. People who write user-facing software as well as people who operate systems have to juggle both reliability and security, and while I haven’t had a chance to do more than start this book, it seems like it follows the same pragmatic approach as the earlier ones in the series.

FROM: latest - an opinionated Dockerfile linter - Replicated

A new web app and an open source tool which provides correctness and maintainability suggestions for your Dockerfile.

Hyperscalers Set the Pace for 800Gb/s Ethernet - Timothy Prickett Morgan, The Next Platform

I try not to get sucked into FLOPS and bytes and Gb/s in this newsletter, but I think it is interesting to see Ethernet march along, as commodity infrastructure continues to be pushed forward not by the HPC community as it was up to the early 2000s but by the hyperscalers like Google and Amazon (and, increasingly, Azure).

It’s interesting to see whether there can really be a sustainable market for a separate, niche, high-performance low-latency interconnect like Infiniband as ethernet marches towards 1 TB/s. It’s absolutely true that infinband has much lower latency, but:

Predictable low latency is much more important than somewhat unpredictable much lower latency, according to Google and other hyperscalers, …

and it’s often possible to choose to tradeoff latency for bandwidth in research computing applications.

Conferences

US Research Software Engineering Association Virtual Workshop - 22-23 April 2020

A virtual workshop (via Zoom) by the US Research Software Engineering Association on RSE as a whole and some technical talks on what US-RSE associated RSEs are working on.

From the website:

This workshop will be held in three online sessions:

The RSE Landscape - Reports on RSE groups and activities, with discussion (Wednesday 4/22, 12:00-1:30 PM EDT)
Technical talks - Short talks about projects that RSEs are working on (Wednesday 4/22, 3:00-4:30 PM EDT)
Next steps for the US-RSE organization - Breakout sessions to discuss ideas. What do you want from US-RSE? (Thursday 4/23, 12:00-1:30 PM EDT)

FortranCon 2020 - 2-4 July 2020

Keynoted by Steve “Dr Fortran” Lionel, this virtual conference has two days of talks and a half-day of workshops. The call for abstracts is still open.

Random

Research computing teams aren’t the only infrastructure providers feeling the crunch these days - spare a thought too for the folks working at Slack or Zoom, or at Microsoft where they are supporting Teams usage up by 775% in some regions, 3x usage of Virtual Desktop usage worldwide, and 70%+ increase in BI dashboard usage in the space of weeks.

When all this is said and done, a lot of people are going to try using the sudden increase in remote work and virtual teaching as a kind of “natural experiment” to study the effectiveness of each; just a reminder that this “emergency remote work” and “emergency virtual education” as opposed to well planned and designed rollouts, and any such comparisons should be studied with that in mind.

A tool to do for tables what “grammar of graphics” has done for 2-d plots.

Concurrency, I/O, and systems is still hard - Trying to be too (io)nice created a ‘killer directory’.

A quick primer on introductory network debugging with mtr, iperf, and ping for those of us who suddenly find ourselves network admins for our new workplaces.

An article about differential analyzers: mechanical, continuous analog computers for solving differential equations. A quote from the article: ‘[they were] supposed to address what Bush described, in a 1931 paper about the machine, as the contemporary problem of mathematicians who are “continually being hampered by the complexity rather than the profundity of the equations they employ.”’ Replace “mathematicians” and equations they employ” with the more general “researchers”and “systems they study” and that sounds like a good rallying cry for research computing in general.