#84 - 24 July 2021

Hi, all:

Success as a manger is defined by a lot of hard work and tough conversations that will pay off over very long timescales. It’s way less immediately gratifying than deploying a new feature or making the CI/CD dashboard lights all green again.

Success or possible success for me this week week: I think I managed to convince some stakeholders to not make a bad and limiting data-related decision which would have limited the scientific effort of a four-year effort; and I’m coaching some team members to take on increasing planning and coordination responsibilities, with an eye towards gauging their interest and current ability towards being leads themselves, over a process which will likely take months. Small steps, tough conversations (there’ll be a lot of feedback conversations about effectiveness in the new responsibilities - many giving positive feedback, but not all), long term payoff.

If you find yourself longing for the more immediate feedback of a “maker” rather than “multiplier role”, be aware that’s a career path that is absolutely possible and badly under-appreciated, as comes up in the “Managing Your Own Career” section. It’s tougher in Universities than it needs to be, but there’s nothing wrong and a lot to be said for going back and forth between the immediacy and hands-on nature of individual contributor work and the big-picture and coordination of people, project, or product management.

But I’m getting ahead of myself - on to the roundup!

Managing Teams

Guiding critical projects without micromanaging - Camille Fournier

However, as a senior manager, at some point you can make it harder for your managers to succeed when you give them very little structure to work with. It’s tempting to say “I don’t care how you do any of it as long as it gets done.” But that doesn’t help people figure out what is important to you, so they have to guess at what they share, when, and how.

It’s tough to strike a balance between being involved enough and not being too involved in any major effort. The fact is, if a project is important enough to be your radar as a manager (at any level) radar, it probably involves multiple people reporting to you. At that point it is part of your job to make sure the necessary coordination is happening, and that the objectives of the project are being met.

Fournier started having monthly status meetings on a major project:

First of all, this was a chance for discussion. I got to ask hard questions, and the team leadership got to show off. The team was forced to reach some agreement on the status before showing it to me, and my questions could reveal disagreements that they may not have resolved fully [….] And my presence was good for all of us, because it forced a group that didn’t all share reporting lines below me to get on the same page, and gave each the opportunity to highlight disagreements with the others in real-time when they didn’t feel aligned.

While this comes up all the time in projects, she points out that can be useful in other areas too where there are opportunities for misalignment, or changes are happening that require guidance from someone who sees more context and can share it. In those cases, where there’s no project (and thus no clear end), she councils that it’s important to remember to end the meetings when there’s no longer need for them.

Use a candidate packet to improve your interview process - Jade Rubick

Last week we looked at Rubick’s advice on hiring and recruiting, and one part was to provide a packet of information about the job, the work, and the organization to candidates after initial screening. Here Rubick goes into more details on what to consider including:

Details on the interview process
Why the organization is important, what you’re trying to do
How work works - remote if that’s relevant, or offices
Benefits
Salary bands and how salaries work
How career growth and guidance works
Information about inclusion and diversity

It’s a decent amount of work, but it will need updating only occasionally, and will make your organization look considerably more professional, more thoughtful about team members, and thus more trustworthy, than most of the organizations they’re applying to.

Managing Your Own Career

Mitchell’s New Role at HashiCorp - Mitchell Hashimoto
From individual contributor to manager, and back again - Gemma Barlow

The founder, once CEO and now CTO, of HashiCorp is taking on a new role - a regular old individual contributor job.

I think we normally think of our next career steps as managers as managing increasingly large efforts, and then becoming a manager of managers. That’s a good and rewarding career path, and you shouldn’t just pursue it just because it’s the default. The engineer/manager pendulum, going back between manager and individual contributor, is also a rewarding path, and one with its own advantages - you get to develop skills both looking at the big picture and maintaining technical depth and currency. You also get to stay strong in multiple very different ways of influencing the work of others rather than leaning on one or the other set of tools.

Personally, I’ve gone from researcher, to research computing staff, to research computing planning as an interim CTO, back to an individual contributor, learning new skills in bioinformatics, into a manager again of a genomics platform effort, into a sort-of-director (in that I coordinate leads). I don’t know what my next role will be, but I’m leaning strongly towards individual contributor again. Being an IC who has seen the big picture can make you a more valuable IC - and it can be more fun work, too, understanding more clearly how your effort plugs into a bigger picture. Being a manager who’s recently gotten their hands dirty recently doing the real work can make you a more valuable manager, too, understanding more viscerally what’s involved with the work you’re coordinating and being better able to foresee issues.

Trying to be a manager and an IC at the same time is a mistake, but doing them sequentially - or cyclically? It can be a lot of fun, and it’s a possibility you should genuinely consider; especially in research, where there is always so much fun stuff to learn.

(Having said all this, I’m not sure I love the fact that Hashimoto is moving to an individual contributor role in the same company. It’s the company that bears his name, where he’s still very close with the CEO and board, after all. How much pushback do you think he will get from “peers” or his manager when he has a bad idea?)

Product Management and Working with Research Communities

ARM Cloud HPC Hackathon 2021 GitHub Repo - A-HUG Hackathon
Twitter Thread: Winners of Cloud HPC Hackathon - Brendan Bouffler

The output of this Hackathon, which we mentioned in earlier newsletters, is a bunch of performance comparisons and Spack build recipes for HPC codes on ARM including ReFrame scripts for CI/CD so that there can be automated tests to validate that they continue to work and don’t have performance regressions. I suspect in coming weeks we’ll see lots of analysis of the work, especially the x86 to ARM performance comparisons.

But apart from the new-hardware stuff it’s worth looking at this as a research community event, in particular what appears to be an extremely successful distributed hackathon. Teams were allocated “points” and bid on particular codes to work on with those points; there was a single repo with dozens of packages (full applications and mini apps often used for benchmarking), a fully-worked example, and a set of guides. AWS and the ARM Users Group provided mentorship through the week, and AWS donated a bunch of cycles; some combination of the two contributed some M1 MacBooks and iPad Pros. In exchange, over a week, there are 31 HPC codes that previously only built on x86 now working on ARM (AWS’s Graviton2 in particular) with reproducible builds, CI/CD testing, and performance characterizations.

I’m sure none of the steps taken were novel to this hackathon, but it’s nice to see how successful it was; I’d love to hear about how communication and coordination worked (there’s already been some discussion of the compute infrastructure they used, spinning up 61 ParallelClusters).

Research Software Development

Migrating Facebook to MySQL 8.0 - Herman Lee, Pradeep Nayak
Facebook Makes a Big Leap to MySQL 8 - Joab Jackson

Migration stories are always interesting, and it’s interesting to see that major migrations are hard even for companies with what to our teams seem like essentially unlimited resources:

The 8.0 migration has taken a few years so far. [!!!: LJD]

Facebook had stayed with MySQL 5.6, which first went into GA in 2013 - that sounds old, but the next major release was 5.7 in 2015, then jumped 8.0 in 2018 (6.0 was scrapped, 7.0 had been used for something else. Naming things is hard, turns out even numbering things is hard).

The trick was, in the intervening years, Facebook had built 2,300 patches into their customized MySQL 5.6 and related tooling. Even going from 5.5 to 5.6 was a year long process. Not all of the patches needed to be ported - some were obviated by new functionality in 8.0 - that left “just” 1,500.

Their plan is, aided by integration testing and monitoring:

For each replica set, create and add 8.0 secondaries via a logical copy using mysqldump. These secondaries do not serve any application read traffic.
Enable read traffic on the 8.0 secondaries.
Allow the 8.0 instance to be promoted to primary.
Disable the 5.6 instances for read traffic.
Remove all the 5.6 instances.

My understanding form the article is that there are multiple uses of MySQL with different use cases (including use of patches) so that some replica sets are migrated and others are in progress. The long-running migration is causing a lot of maintenance pain - supporting two major versions at a time within a replica set - and they’re uncovering (and contributing fixes to) bugs in 8.0 that other customers haven’t seen.

Anyway, migrations are hard and even at Facebook scale there’s no way through but through.

Research Data Management and Analysis

Rendering 1M+ Particles - Farazh Shaikh
Maps with Django (part 2): GeoDjango, PostGIS and Leaflet - Paolo Melchiorre

Interactive displays of diverse kinds of data is a pretty common need in research computing. Particle data is a classic scientific computing requirement, and in his article, Shaikh walks us through using threejs, a 3D javascript library which makes use of WebGL for local GPU acceleration, points, and shaders, to provide interactive displays of a million moving points.

GIS was a comparatively niche topic in research computing a decade or two ago, even in Earth sciences, but with data easier to collect and disciplines like social sciences increasingly making use of computing, it’s growing rapidly in importance. In his article, Melchiorre walks us through from an empty directory to a working Django app using GeoDjango and the PostGIS extension for Postgres, and the javascript Leaflet package to display the maps, where an admin user can add markers on a map of the world and other users can view them.

No Cost Data Scraping With GitHub Actions And Neo4j Aura - William Lyon

If you’ve been looking for a little project to start playing with graph databases, here Lyon walks us through a simple web-scraping project that can be done using the free tiers of both Neo4j’s hosted Aura service and GitHub Actions. The Flat Data actions that we talked about in #75 that supports data-scraping cron jobs in your github repo, and then another action with a secret to push to Aura. The result is then a graph in Aura that you can visualize and use Cypher to practice writing queries against.

A small piece of the resulting graph, showing users, articles, and tags on the lobst.rs site

Research Computing Systems

New Sequoia bug gives you root access on most Linux systems - Catalin Cimpanu
Sequoia: A deep root in Linux’s filesystem layer (CVE-2021-33909) - Qualsys

By now you’ve probably heard about this surprising, 7-year-old, and nasty local privilege-escalation attack, where by creating, mounting, and deleting a very deep (like, path exceeds 1GB deep) directory structure, an exploit can overflow a 32-bit int defining the sizes of a buffer. Using that and some eBPF magic, it’s possible to write arbitrary kernel memory to become root.

On a lot of the HPC systems I’ve worked on, creating one million directories on those filesystems take some time, but I’m not sure that means it would necessarily be detected.

Most Linux distros have fixes released, so it’s a good idea to update.

VzLinux Is the RHEL-based Linux Operating System You’ve Never Heard of - Jack Wallen, The New Stack

You’re tired of hearing this, but in my opinion long term stable OS releases are a trap and one the community would best escape; but with CentOS’s future uncertain a lot of people are still stuck in the trap and looking for immediate alternatives.

Wallen points us to VzLinux, a five year old ongoing RHEL clone by virtualization company Virtuozzo. VzLinux is intended as a guest OS, there are plans for VM- and container-optimized versions, there’s guidelines tools for managing containers, there are tools for doing dry-run of conversion from CentOS (and then tools for unattended mass conversion), as well as for snapshot creation and roll-back. The community version of VzLinux 8 is available for download.

Don’t Wanna Pay Ransom Gangs? Test Your Backups - Brian Krebs, Krebs on Security

As the old saying goes: backups are useless, restores are what’s valuable. Krebs interviews Fabian Wosar, CTO at cybersecurity firm Emsisoft, who reports that ransomware targets who had on paper perfectly fine backup strategies often end up paying the ransom because:

They hadn’t realized how long it would take to restore entire systems, and they can’t afford to be offline for (say) months
Offsite encrypted backups are safe, but the decryption keys were hit by the ransomeware
The ransomware hit slowly enough that backups are encrypted/corrupted too

It’s really hard to know how safe your backups make you without routinely testing restores, ideally restores from different times in the past.

Emerging Technologies and Practices

Pragmatic Incident Response: 3 Lessons Learned from Failures - Robert Ross
Observe Services; not Servers - Piyush Verma, Last9

Like restoring for backups, the best time to test and develop the practice of running incident responses and retros are before you urgently need them, says Ross.

Further, running proper retros for small incidents helps keep small incidents small and can prevent larger incidents down the road. He suggests:

Declare and run retros for the small incidents. It’s less stressful, and action items become much more actionable.
Decrease the time it takes to analyze an incident. You’ll remember more, and will learn more from the incident.
Alert on pain felt by people — not computers.

Related to the last point, Verma urges us to focus our monitoring (and then alerting) on services, not servers. Many of us came up in the HPC world, where the distinction between “the node f3c09” and “the service of being able to run a user job on the node f3c09” is pretty fine. But with virtualization increasingly important even in HPC, the distinction exists and matters, and making it is the first step in being able to think about, build and focus on robust user-facing services rather than focussing on the technical inputs that the team uses to provide those services.

Using WebAssembly threads from C, C++ and Rust - Ingvar Stepanyan

WebAssembly has support for threads via Web Workers, and shared arrays via Shared Array Buffers - and on top of that enscripten provides an posix threads API, so it’s possible to use threads from C/C++ and Rust in webassembly!

But Workers behave a little differently than you might expect, and expectations are different in the browser too - you expect the current terminal’s command line to hang while a long running job is doing its thing, while having the browser’s UI freeze seems weird. So there are a few things you have to do a little differently. Stepanyan walks us through some examples.

Introducing the Scalable Matrix Extension for the Armv9-A Architecture - Martin Weidmann

Research computing can count itself lucky that AI is such a commercially important workload, as we continue to benefit from the increased demand for many of the same primitives around vector and matrix mathematics.

ARM-9A is building on existing scalable vector extensions (SVE, SVE2) to build matrix extensions, for

matrix tile storage
load/store/insert/extract tile vectors
outer product of vectors
and a streaming mode for vectors (which I don’t think I understand).

If I understand this early announcement right, SVE2 in ARM8.6-A already had support for “4 dot products at a time”, extending SME’s 8-wide vector dot product to handle (2,8)x(8,2) matrix multiplication, which can help tile matrix operations; SME buids on this with more matrix tile memory operations, and a vector outer product which can then be used as a primitive (such as for rank-one updates in a number of linear algebra operations).

Admittedly, I’ve never been a numerical linear algebra expert, I just always called high-level libraries - but outer products wouldn’t have struck me as the obvious next primitive to add; any readers in that space, do you have any thoughts?

Events: Conferences, Training

Research Running on Cloud Compute & Emerging Technologies (RRoCCET 21) - 10-12 Aug, Virtual, $25

Organized by the NSF’s CloudBank, this is three mornings (pacific time) of talks on use cases and case studies of research computing in the cloud, covering CloudBank itself, and AWS, Azure, IBM, and GCP.

HTCondor Workshop Autumn 2021 - 20-24 Sept, afternoons European times, Virtual, Free

HTCondor is pretty venerable, which means it doesn’t necessarily get as much attention as shiny newer tech as an engine for executing high throughput computing jobs; but it’s well tested, well documented, and widely used. Worth registering for if you’re interested in learning from the developers, or those giving talks on their use cases.

Random

Ever wanted to say “I’m in: I have full access to the mainframe now.”? Getting z/OS running on an Ubuntu laptop.

LLVM is the JVM of this generation - a platform that’s enabling a lot of new language development and experimentation. Here’s the first of a series of deep deep dives into LLVM’s representations - the bitcode format.

Another deep dive, this one into how Firecracker VMs work - what “paravirtualization” means, why bootup time is so fast, an the VMs so lightweight.

In Python, when to use and not use namedtuples now that there are dataclasses.

Continues to boggle my mind that for between $3-$22USD/minute, you can schedule antenna time to download data from a satellite. Here’s how to use AWS Ground Station from the command line.

A five-year old MPI tool I hadn’t known about: WI4MPI lets you choose whether to run using OpenMPI or IntelMPI (say) at runtime.

Creating 100M rows in SQLite in 33 seconds.

An argument that for distributed data systems, convergence (a concept about individual objects and how they are merged - replicas that have been delivered the same update eventually get the equivalent state) and confluence (a component gives the same output regardless of the orderings of its inputs) are more meaningful definitions than consistency.

RCT Newsletter