#73 - May 2021

Hi, all:

I hope you’re enjoying the change of seasons. The end of April here brings an onslaught of deadlines, meetings, and events, but things are settling back down now.

Details of the basecamp fiasco continue to come out. As the raw signal folks point out, discomfort as a leader isn’t inherently bad or a signal of a problem. It’s not our lot as leaders to be comfortable about everything. Listen to your team, even - especially! - when you don’t love what you’re hearing.

As always, let me know if you have questions, suggestions, or any other feedback - I always like to hear from you.

On to the roundup!

Managing Teams

Three Feedback Models - Jacob Kaplan-Moss

A quick overview and comparison of three feedback models, similar to what we covered when we were talking about performance communications in , but includes one I had forgotten, Lara Hogan’s Feedback Equation:

Lara Hogan, an author, public speaker, and coach for managers, has a Feedback Equation that is quite simple:

Observation of behavior, e.g. “In your review of Jane’s pull request, you gave her clear advice on test coverage…” Impact of the behavior, e.g. “… this helped her improve her code, which helps with our team’s goal of better-tested code.” Question or Request, e.g. “thank you; more like that!” (for positive feedback) or “in the future, can you … instead?” (for negative feedback).

This one is a lot like Centre for Creative Leadership’s Situation-Behaviour-Impact, but with the question at the end.

There’s a lot of data to support the Manager-Tools model, but if Hogan’s or the SBI model are easier for you to start with, they are perfectly decent approaches, and much better than not giving feedback. The most important thing is to give lots of feedback, mostly positive, promptly while focusing on behaviour and impact. Everything else can be tweaked over time.

Managing Your Own Career

What Good Leaders Do When Replacing Bad Leaders - Andrew Blum

At some point in your career, you’re going to step into a role as a leader where the previous leader made a hash of it. They weren’t necessarily a bad person or incompetent, but for whatever reason what they were doing wasn’t working. Blum talks about how to manage that transition.

A key point for me is an early sentence:

Good leaders create a separation between the past and the future.

Creating that rupture between “that was then” and “now we’re moving forward” is key. People want things to work well but it’s pretty easy to stay caught up in the dysfunction of what was happening before. Blum outlines three steps:

Acknowledge the contributions of the previous leader
Enable a vision for the future
- What about how we have worked and operated do we want to maintain?
- What do we want to leave behind?
- What do we want to create anew? and
Seek to understand your employees’ experiences [LJD: one-on-ones are great forums for that!]

Product Management and Working with Research Communities

The case of the 50ms Request - Julia Evans
Notes on building debugging puzzles - Julia Evans

We do a lot of excellent training in our community, but as a community when it comes to creating online trainings we’re not super creative - the tendancy is just to put a bunch of videos, slides, and text on a website. Even going as far as to add MOOC-like solution checking for homework is relatively rare.

While not suitable to every kind of training, this is a fun different approach; “The Case of the 50ms Request” is an online debugging puzzle in the form of a Choose-Your-Own-Adventure book, and Evans writes about putting together such a puzzle (with more planned) in “Notes on building debugging puzzles.”

Research Software Development

RSE Administration Tool - RSE-Sheffield

The Research Software Engineering Sheffield team has released their project management and time tracking tool, RSEAdmin - documentation can be read here, and there’s a test instance running you can log in to. It tracks proposed projects through acceptance and funding stages, then tracks staff time on the projects.

Having a Healthy Pull Request Process for Teams - Alex Kitchens

This is a longer read on setting up a pull request process, both for the authors of the PR and the reviewers. Other processes could be healthy too, but any healthy process will have clear and explicit expectations.

Kitchens spells out the responsibilities for an author - they fall under making PRs easier to review:

Make the PR Description Clear and Digestible
Explain Unexpected Changes
Keep the Size of Pull Requests Small (When Possible)
Make Sure the Team Knows the Problem Before Opening a PR to Address It
Listen to and Take in the Feedback Received on PRs

and for reviewers, under making the process easier for and thus unblocking the author:

Give an Explicit Approval/Disapproval with Explicit Reasons
- When Not Approving, Distinguish Which Comments are Blocking
- Express full thoughts rather than shortening sentences
Know When to Reach Out

There are good sidebars there on some possible approaches to “chunking” larger PRs, and examples of what is discussed.

Kitchens suggestion for explicit blocking or nonblocking comments, sounds sort of like Netlify’s boulder-to-pebble spectrum for comments we discussed way back in #15.

Testability = Modularity - Massimo Nazaria

A good reminder that, whether within a code base or across a larger architecture, a good guide towards what is a well-decomposed modular architecture is how testable it is. This also suggests a larger and somewhat deeper point - modularity isn’t some intrinsic property of a set of concepts, but instead depends on how you are going to use combinations of the underlying functionality.

Research Data Management and Analysis

First Steps #5: Pluto.jl - Josh Day, Julia for Data Science

A very quick overview of Pluto.jl, a reactive notebook for Julia. We mentioned Pluto in #48, and it’s continued to mature. It avoids one of the two big problems of Jupyter, hidden state in the order of execution of cells, by being reactive - update a variable in one place and it propagates to the other cells, like a spreadsheet. The maybe larger issue of notebooks (as opposed to environments like RStudio) remains - that snippets are written in an environment with no real off-ramp into version-controlled, tested, productionizable code other than copying-and-pasting relevant bits. For things like that I’m keeping an eye on the development of VSCode for Juypter notebooks.

UNION ALL, pt II - polymorphic data - Alexey Makhotkin

Makhotkin has a new this year newsletter on relational data modelling; he started with a long series on a very simple problem, expressing either/or data (“Do you have any symptoms of COVID-19 from the list below? If no, tick here: [ ], If yes, mark all that apply: [ ] cold symptoms; [ ] coughing; [ ] elevated temperature”) and described modelling the problem in JSON, the inevitable extensions, the storage considerations, and resulting migrations.

In this series he takes a similarly simple construct, the UNION operator, talks in part I about how UNION ALL can be much faster than UNION, and in this part connects it to object oriented-style polymorphism, expressing four narrow table “subclasses” into a single view.

Hosting SQLite databases on Github Pages - phiresky

Really cool JS and WASM work - hosting a (SQLite) database-backed web page completely statically by compiling SQLite to wasm, then implementing a virtual file system to access the statically hosted sqlite file.

Not only is this useful for making trivially hosted web applications for data analysis, it also opens up quite a number of options for creating online tutorials for SQL and data analysis (or visualization).

Research Computing Systems

Balancing Performance, Capacity, and Budget for AI Training - Timothy Prickett Morgan, The Next Platform

Using Lambda Lab’s GPU cluster systems as a launching point, Morgan goes through the current NVIDIA DL/HPC product catalogue of Ampere GPUs and suggests that the A6000 is likely a good compromise for many use cases between capacity (especially on-GPU memory), performance, and budget, and may be significantly cheaper than the Voltas that the cloud providers have previously stocked up on. Obviously the details of particular workloads matters, and some may need the 80GB A100.

This is a useful article for those making immediate-term buying decisions, but it’s a bit infurating that Morgan needed to do so much tea leaf-reading for this, particularly for prices. Unfortunately that’s where we are right now.

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale

A lot of information out there about running on-call, or more advanced practices like SRE, assume that you’re a large organization with 24/7 uptime targets. These can apply to research computing, but more often don’t.

Teams sometimes respond to the inability to have 5-nines uptime support and 24/7 oncall with a shrug and just keep things vague; “will respond promptly during working hours, with best-effort responses outside of those times”. But that lack of clarity isn’t great for either staff, who don’t know what’s expected of them, or researchers, who don’t know what they can and can’t count on. It’s also a missed opportunity to lay out priorities and define successes. How can you tell if you’re doing well if there’s no goal to reach?

Shaw suggests laying out managable service level objectives (SLOs - internal measures) and slightly less stringent commitments (SLAs - agreements, external measures), and designing on-call processes and policies about that, iterating on the process as needed. For some kinds of research computing, especially availability of compute, SLAs can be quite low and still keep the researchers happy; data availability is likely somewhat higher. Choosing some target services, a manageable SLA, and fixing a manageable on-call process consistent with that will help researchers and staff know what to expect and make priorities explicit.

Backblaze Drive Stats for Q1 2021 - Andy Klein, Backblaze

The quarterly update for drive failures by model backblaze does is always interesting. In this update, they are starting to report overall SSD failure rates along with HDDs; they’re relatively new (2 years) to using SSDs, and its only now that the statistics are starting to be good enough to report. There isn’t a lot of overlap yet in years of service between their SSD fleet and their HDDs, so their results may change, but so far they’re seeing about a 20x different in annual failure rate between HDDs and SSDs used for storage, and 10x when used as boot drives. I guess I’m impressed that the difference is that small; they also report that the AFRs for hard drives continues to fall, pointing to significant increase in reliability of ol’ spinning rust.

Emerging Data & Infrastructure Tools

Object Storage Makes a Push into HPC - Jeffrey Burt, The Next Platform

What HPC POSIX-like file system makes you happiest? Lustre? BeeGFS? GPFS? Isilon? Hahaha of course I was joking. POSIX file systems have lovely semantics for local systems, but those same semantics make it extremely difficult to scale well, resulting in systems that are either ruinously expensive (in $ or in people time), very brittle, or both. Many (most?) HPC use cases don’t inherently need POSIX APIs for running, and none require them for warm- or cold- storage.

Burt gives an overview of Cloudian, a ten-year old company working to make an S3-compatible object storage system attractive for HPC and high performance data analysis, using the same approaches (flash, tiered storage, improved networking) as the parallel posix file systems are using.

Making the Internet more secure one signed container at a time - Priya Wadhwa, Jake Sanders, Google Security Blog

As dependencies multiply and there are so many ways to pull down packages, the security of the “software supply chain” - making sure that the software you’re using is the software it’s supposed to be- is increasingly important

Sigstore is a new Linux Foundation effort to provide verifiable signatures from authoritative sources for a lot of kinds of builds:

We are initially targeting generic release artifacts such as tarballs, compiled binaries and container images. Later we will explore other formats (such as jars) and manifest signing, such as SBOM etc. We also open to collaborate with package managers and ease the adoption of signing for their communities.

Google is building on this with cosign for signing and verifying the signature of container images, and incorporating that into CI but also CD workflows.

Calls for Submissions

Prace Autumn School 2021 - 11-15 October 2021, In Person, Finland, Applications due 9 Aug

For EU participants. There are three tracks one can apply for:

Introduction to deep learning
Introduction to GPU programming with OpenMP
Hackathon: Optimizing GPU applications on LUMI supercomputer.

From the site:

The course consists of lectures and hands-on training on modern, GPU-accelerated high-performance computing: GPU programming and GPU code optimization at scale, as well as understanding and applying machine learning methods. The course includes also scientific case studies about using GPUs in various disciplines. The tutors and lecturers are experts in these fields providing various interesting aspects in the course topics.

Events: Conferences, Training

Automated Fortran–C++ Bindings for Large-Scale Scientific Applications - 12 May, Online, Free

The monthly Best Practices for HPC Software Developers talk series continues, with a look at using SWIG-Fortran to build modern, MPI-aware C++ interfaces to Fortran code.

Random

Formal methods, SAT solvers, etc aren’t widely used in research computing even though they’re useful for making sure all cases of an algorithm are covered or whittling down combinations. Here’s a collab notebook for one such package, Z3, with Python.

A cheat sheet for the still newish Go modules.

Small but huge thing - on Github, absent conflicts, you can now fast-forward commits from an upstream repo into your fork with one click from the web UI.

A summary of and map to a massive blog series on C++ coroutines.

Berkshire Hathaway stock prices, in the usual units of 1/100 of a cent, (almost) exceeded the size of an unsigned 32-bit integer and so broke the stockmarket.

EdgteDB, a Postgres-based database that avoids SQL in favour of ReST and GraphQL APIs, with schema migrations built in from the start.

SQLite running directly on an object store (Ceph).

HATETRIS - tetris that always gives you the worst possible piece. In case management isn’t enough frustration for you.

RCT Newsletter