#82 - 10 July 2021

Hi, everyone!

I hope you’re having a good weekend, and that the past week was good for you and your team.

I’m working (with some help) on tidying up and improving the material from the hiring and performance management series from the newsletter at the start of the year; I’d like to turn those into expanded and more coherent writeups that can help other research computing team leaders (or those interested in becoming one). Other topics that came up when last I asked for suggestions were:

Strategy
Product management
Peer mentorship
Levelling & promotion in research computing

Let me know if there are others you’d like to see, or questions or input you have - just hit reply, or email me at jonathan@researchcomputingteams.org.

And now, the roundup:

Managing Teams

Why “start bringing solutions, not just problems” doesn’t work - Lara Hogan, Wherewithall

We’ve talked often about why it’s best for you and your team members for you to avoid trying to solve problems directly for them, and instead help coach them to find a solution yourself. But how do you do that well, in a way that makes them comfortable coming to you raising issues?

Hogan has some suggestions for questions to ask:

What could we try?
If you could wave a magic wand, what one thing would you change?
What, deep-down, do you truly want?
What do you need?
What’s in the way?
What’s holding you back?

and suggests giving feedback on approaches if that doesn’t work. She also cautions against trying to help the team member come to a solution if they’re just venting about something; venting is natural, we all do it sometimes, it’s even good in a way if they’re comfortable venting with you, so you want to allow it but without necessarily letting it spiral out of control or happen regularly.

Hogan also includes some approaches for if your manager tells you something similar, including looking for other people to raise the issue with.

Good Meetings - Sara Drasner

The purpose of the meeting is clear
There’s an agenda (we’ll dive in to the complexity of this in a moment)
There are the right people in the room. Not too many where communication is overly complicated, not too few where the people you need to move forward aren’t there.
There’s some order. People aren’t dropping in and out, talking over each other, or being generally inconsiderate
There’s a clear decision, outcome, and next steps at the end

Some points Drasner makes that I think are particularly useful in our context include the fact that purpose isn’t the agenda, the agenda is a tool that serves the purpose; that different kinds of meetings need to be run different ways; that too many people in the meeting is a common failure mode; and that respectful conflicts are good, and that people not saying things that need to be said are bad.

She also suggests meetings can be started and finished with sentences that sum up:

The shared purpose
What you’re doing about getting to that outcome
Who is owning what
How
When, what are the timelines

Three Core Ideas to Make Remote Work, Work - Cate Huston

Huston has been working remote for over five years, and for those of us getting ready to continue working remote for real and without a pandemic driving it, suggests three key approaches:

Embrace async.
Enable autonomy.
Build connection.

Crucially, these are all team-enhancers for in person teams, too:

Even for team members all going to the same office, meetings can be hard to schedule, unnecessary meetings are bad, and having written documents outlining how decisions came to be are good and helpful
Autonomous teams where individuals have significant control can respond faster and more robustly, especially where there’s documentation to support that
Office culture helps build connection but even there being intentional about how its done makes the connections stronger

Managing Your Own Career

Research managers are essential to a healthy research culture - Nature Editorial

Ours isn’t the only somewhat undervalued research supporting profession starting to get more organized and advocate for more recognition and resources. Research managers, often found in the sponsored research office under a VPR, are also beginning more consistent advocacy work. Such managers are also a somewhat under-appreciated resource at many institutions, and can help with agreements and planning.

Research Software Development

Software for research communities - UKRI
Ouvrir la science - French Ministry of Research

Some exciting things going on in the funding and recognition of research software in the UK and in France.

In the first item, a modestly scoped initiative but with real money available, UKRI has issued a funding call (totalling £4.5m) for research software quite generally, for projects within the remit of ESPRC: funding can be used for

development and re-engineering of existing software
maintenance activities
activities that widen participation in development and maintenance.

Rather than just being a one-off, this appears to be the result of extensive groundwork by the society for RSE and ESPRC, so hopefully this is the beginning of regular calls (and calls involving other councils like NERC).

In France, a more ambitious program but one so far which is only a document. The Ministry is announcing a significant push towards “Open Science” quite generally; the third of four themes concern software, in particular recommendations exist to:

Showcase and support the dissemination under an open license of source code from publicly funded research programmes.
Highlight the production of source code from higher education, research and innovation.
Define and promote an open software policy.

This includes strongly recommending open source license for funded research software, producing recommendations for funding agencies to better support software development, funding open software prizes, providing greater recognition for researchers who develop software, and building a catalogue of such software.

Julia Evans, who produces consistently excellent software examples covering a wide range of topics for her blog, describes her approach to creating meaningful examples that don’t seem contrived - start with real code and simplify.

Ownership of one’s work is a good thing, but individual “ownership” of a code base in a team can get out of hand, as Britton Broderick points out. It’s easy to see when it’s happening, but avoiding it takes a delicate balance between taking advantage of specialization and what can evolve into gatekeeping/”hands off my stuff” behaviour.

ExCALIBUR, the UK’s exascale effort, released a review on RSE (Research Software Engineering) skills for HPC, and what will be needed for training efforts in the future.

Research Data Management and Analysis

Write a time-series database engine from scratch - Ryo Nakao

You probably shouldn’t, of course, write a time-series database (or any other kind of database) from scratch. But when someone does, to meet their own very specific needs, reading about it is a great opportunity to learn from their experience. For time series databases in particular (which have a lot of applications for research - e.g. for sensor readings or system metrics), if you’re not familiar with them, this is a great way to get a sense of how they work.

Time-series databases are high-volume, append-only, time-indexed data stores where the most common query is to look up values in a window of time - and for many use cases recent windows will be queried much more often than windows further in the past. (More advanced use cases, like anomaly detection, aren’t covered in this post). Because input is typically happening somewhat regularly, and (hopefully!) data is being sampled faster than the data is changing, there are a lot of opportunities for compressing the data stream as well.

In this article, Nakao walks us through the basics of his implementation, tstorage, and how he ensured that data points arriving out of order are handled, the use of a write-ahead log for persistence, metadata handling, and a compressed encoding.

Research Computing Systems

USENIX LISA2021 Computing Performance: On the Horizon - Brendan Gregg

Gregg, who appears pretty often on the newsletter, gave a talk on his predictions for the future of computing performance - and monitoring and improving performance on new architectures (like ARM, new memory technologies like DDR5/HBM/persistent memory, accelerators, interconnects, operating system features (Linux’s kyber scheduler, io_uring), lightweight VMs and observability. The talk in both slides and video recording form are available embedded on the page.

It’s a pretty comprehensive talk for its 41 minutes running length, and probably a pretty good place to get a sense of where one pretty knowledgeable practitioner thinks things are going. Some noteworthy predictions from my point of view:

Extra tiers of memory are coming too late to be useful as new memory and storage technologies remove the huge gap between main memory and storage devices
BPF as a kernel “JIT” for more adaptive performance tuning

OpenZFS 2.1 is out—let’s talk about its brand-new dRAID vdevs - Jim Salter, Ars Technica

Ars’ technical articles are typically detailed deep-dives, well researched and clearly written, and this article is no exception. Salter

Back in #36 we talked about OpenZFS’s vdevs - virtual storage devices consisting of potentially multiple disks - pools of vdevs, and RAID-Z, a more dynamic version of RAID-5 (RAID with parity) except that the parity data is dynamically distributed per-write rather than in a precomputed way across disks, in a way that requires and benefits being integrated with the filesystem.

That’s quite performant with writes, and RAID-Z’s additional checksumming has the advantage of being able to detect silent data corruption; but it makes for extremely slow rebuilding (“resilvering”) of the RAID.

dRAID - integrating logical hot spare “disks” (spare blocks) and moving back to fixed raid stripe widths - has the write-time advantages of RAID-Z but much faster resilver times.

This is a complex beast - OpenZFS has been working on this since 2016 and it just came out now in OpenZFS 2.1 - and is aimed at large (~90 disk) storage systems. Salter doesn’t recommend adopting dRAIDs right away, but it’s worth keeping an eye one.

Emerging Technologies and Practices

Running GATK workflows on AWS: a user-friendly solution - Michael DeRan, Chris Friedline, Jenna Lang, Netsanet Gebremedhin, and Lee Pang, AWS for Industries Blog

The nice thing about being a late-early adopter or early-early-majority of a technology is that a lot of legwork has already been done. We’re arguably on the third generation of “running genomics pipelines in the cloud” attempts; each step is an increase in usability and often a decrease in cost, and with infrastructure-as-code, groups are generously contributing scripts to start up their entire described infrastructure from scratch at the click of a button.

AWS has a reference “run a best-practices GATK genomics pipeline using Cromwell and AWS batch” setup and I can say from experience that it’s incredibly brittle and finicky.

In this post, the authors describe (and provide code for) running short-variant and joint-genotyping workflows on AWS with Nextflow and AWS Batch. They also provide relative costing and runtime for on-demand and spot images, broken down by instance type (c5/m5/r5) and runtime storage type (Elastic Block Store and FSx for Lustre). In either case, data is staged out of S3 to the runtime storage, and then the Nextflow pipeline breaks the data up into shards and runs the GATK pipeline in containers on AWS batch, whence the data is staged back out.

Some fun results I hadn’t expected:

FSx is pricey, but the short variant calling workload is compute-time bound (in terms of dollar cost), so the increase in speed provided by FSx dramatically reduces cost for on-demand instances (though its a wash for spot instances). For joint genotyping, it’s less compute bound so the pricier FSx shows up in the cost.
Even though it’s compute bound, the instance types make less of an effect than I had expected, especially for spot instances

IoT for Beginners - A Curriculum - Jen Fox, Jen Looper, Jim Bennett, Nitya Narasimhan, Microsoft Learn

As readers will know, I think IoT infrastructure will eventually be useful for certain research applications, letting researchers collect sensor data more easily, leaving fewer components that have to be created from scratch by the project’s team members.

Here, Fox et al have a 12-week, 24-lesson course on IoT involving six different projects. You can buy ~$100USD hardware kits to implement the hardware yourself as you follow along, or use provided “virtual hardware” on your regular development machine.

Calls for Submissions

Workshop on Programming and Performance Visualization Tools (ProTools 21), part of SC21, - Submissions due 20 Aug

Some SC21 workshops still coming together:

The Workshop on Programming and Performance Visualization Tools (ProTools) intends to bring together HPC application developers, tool developers, and researchers from the visualization, performance, and program analysis fields for an exchange of new approaches to assist developers in analyzing, understanding, and optimizing programs for extreme-scale platforms.

Events: Conferences, Training

Supercomputing Frontiers Europe 2021, 19-23 July, Free

A pretty wide range of HPC topics from the traditional (weather and materials modelling) to self-service HPC and edge computing.

EuroPython 2021 Trainings - 26 & 27 July, €195 as part of EuroPython 2021

Several interesting sessions here, including Climate data analysis with xarray and cartopy, data analysis with pandas, introduction to property-based testing, and python packaging demystified.

Strange Loop - 30 Sept - 2 Oct, In person in St Louis but $175 for virtual attendees

Strange Loop is a widely known eclectic conference that covers software development pretty broadly including programming languages, databases, distributed systems, machine learning, security, observability, and other topics. It’s never been something I could justify going to in person, but at $175 for virtual sessions may be of interest.

Random

A hello world-type deployment written in kubernetes yaml which is also hello world in AT&T syntax x86 assembler, somehow.

A counterpoint to some copyright and licensing concerns about GitHub Copilot from last issue - if copyright really prohibited Copilot it would also prohibit a lot of things we don’t want prohibited (like data mining the medical literature).

Tuplex is a Dask or Spark type Python based data science platform that does JIT compiling to optimized LLVM byte code(!). The paper in SIGMOD’21 looks pretty interesting.

It turns out that git rebase —onto is surprisingly powerful.

An awk script for generating diagrams visualizing iptables chains.

In praise of the underappreciated architecture which greatly influenced both DOS/Windows and UNIX systems - the Vax.

Astro, a static website generator that feels like writing a modern node-based webpage.

SQL is ubiquitous, powerful, and has seen off dozens if not hundreds of would-be SQL replacements; and yet it’s unambiguously not a great programming language. Jamie Brandon gives the case against SQL.

You’ve almost certainly seen the instructions for using VQGAN+CLIP to generate images on Google Colab using a notebook by Katherine Crowson. Below are results for a couple of variations om “research computing team”. I will refrain from making either the logo for the newsletter.

RCT Newsletter