#124 - 4 Jun 2022

Hi!

I wanted to write more this week about the consequences of research computing teams as vendors, and what that means for strategy - I’ll do that next week.

But there was big news in HPC world in this past week - as you almost certainly already know, the Frontier super at ORNL is now the world’s first Exascale system as measured by the HPL benchmark.

This is an enormous technical (and project and product management!) accomplishment, the capstone of work that’s been ongoing since 2006 or so. Colleagues at AMD (which, disclaimer, has long been a worthy competitor of my current employer), HPE, and ORNL deserve all the praise they’ll rightly get. My sympathies to other colleagues at Argonne whose star-crossed Aurora system, facing a series of delays, didn’t make it across the line in time.

And yet, and yet, and yet.

Something can be a massive technical accomplishment and yet be not be the solution to the most pressing problems a field has - a distraction, even, from those.

When you ask research institutions what their priorities and needs are, as in the first item in the roundup today, bigger systems don’t top the list. The biggest priorities are for user support, outreach, data engineering, analysis, and visualization expertise, and processes for handling sensitive data. The priorities are people. Us, and our teams.

Even when CaRCC was asking specifically about systems, the top five priorities were data archiving and preservation, interactive computing, commercial cloud, infrastructure automation, and high throughput computing. That’s worth highlighting - at the time of announcement of exascale, growing HTC was a higher priority for research institutions in the US than growing HPC.

The push to petascale, then exascale, started in the 90s, and is the culmination of a strong vision for the future rooted in the 90s. Everyone involved is to be commended for executing that plan and that vision so well!

But the thing about visions of the future, like 50s sci-fi, or concept videos of personal computing from the 80s or 90s, is that they are extrapolations through the past to the (then-)present. They’re visions of current technology, but more so.

Possible 2020s research computing, as viewed from the 1990s, looked like the same big number crunching tasks, but on a much vaster scale. The same dense linear algebra operations, but much larger. 3D viz caves and grid computing.

But the thing is, over the following thirty years, research computing achieved much greater scope than could be contained within that cramped vision. It met ambitions completely orthogonal to it, even.

The triumph of research computing and data is that it is essential everywhere, in every field. Digital humanities and computational social sciences, precision health and simulations for urban planning, 3d modelling completely routine for art students.

And that work is vitally important, is helping drive those fields.

The drive for peta- and then exascale was largely about industrial policy, not scientific policy. It was an attempt to push the technology forward so that the 90s version of 2020s research computing could unfold (preferably while supporting local vendors). It was an enormous push, with a lot of incidental benefits for us and for technology broadly. But it was money spent on an overly narrow vision of the future, a future which thankfully didn’t come to be.

Groups driven by science accomplishments have done things differently. The vast majority of research impact, however you want to measure it (papers, citations, scientific awards and recognitions) happens based on research computing, but at the laptop or workstation or small cluster scale. Every research computing team with readers on this mailing list has a much higher bang for the buck, a much higher scientific-impact-per-dollar-spent, than the massive exascale-ish labs. And quite possibly higher scientific impact period. There are research projects who need extreme-scale HPC, but it’s an extremely niche use case, and one that research institutions are demonstrably not overly concerned about.

My fondest hope is that with the exascale race now won, the extreme-HPC part of our community can move on to more productive pursuits. No one but Intel seems to believe that the next big thing is the next power of 1,000, “zettascale”. (Intel could not possibly be more obvious in wanting a do-over on exascale).

There’ll still be an over-emphasis on “big” computing, sadly, since that suits the needs of vendors and those driving industrial policy. I’m hoping that at the very least, we’ll move to another big benchmark suite which will at least be based on real tools that researchers actually run.

But I also hope that there’s more emphasis on people, and less on stuff. Because research computing is about research, not computing. And research is fundamentally a people-driven activity, propelled by expertise and skill, with equipment added only as necessary. Our teams are awesome, drive research, and we need to support them well.

With that, on to the roundup!

Managing Teams

2021 RCD CM Community Data Report - Patrick Schmitz et al., CaRCC

I’m a big fan of the CaRCC Research Computing and Data Capabilities model - it’s popped up from time to time here since #21. Here they report on their collection of 51 US Institute’s results in 2021. The aggregate results are interesting - the capability coverage and level of maturity varies widely. That makes a lot of sense - different institutions will experience different needs in different orders, and so develop strengths and capabilities differently.

The model breaks capabilities down into five “facings” - researcher, data, software, system, and strategy/policy facing. As the figure below shows, software and data facing capabilities are generally somewhat less present or mature than researcher, systems, or strategy and policy facing, but there’s such wide scatter across institutions it’s hard to say anything definitive.

What I found especially interesting was the list of top-10 priorities the submitters had in aggregate; in order, they were:

Researcher facing - introductory user support/training
Strategy and Policy facing - strategic plan
Data facing - data engineering & analysis
Researcher facing - researcher outreach
Researcher facing - can researcher-facing staff effectively advocate rof researcher needs
Researcher facing - marketing-type activities (my words, not theirs)
Strategy and Policy facing - sustainability of funding
System facing - data archive / preservation
Data facing - data visualization experts
Data facing - sensitive data

(Do those line up with the priorities at your institution? Drop me a line either way - jonathan@researchcomputingteams.org)

These strike me as pretty mature priorities of fairly sophisticated institutions. As more institutions realize that these teams are vendors and no are no longer in a “if you build it, they will come” situation, having outreach and marketing-type activities become important. And as soon as you realize you can’t do everything in the shifting and expanding world of supporting research and scholarship with computing and data, strategic planning becomes vital.

But a lot of institutions are much earlier on in their journey, and are looking for a starter version of this assessment - CaRCC now plans to put out an “Essentials” version of the instrument, which is fantastic news.

A couple other notes:

I’d really love to get a sense for how the capabilities are actually provided on the ground at these institutions - the default up to now has been “provide everything that’s provided at all in-house”, but I think that’s no longer tenable.
Even though software-facing capabilities are less mature/present, no software-facing issues made the top-10 list of priorities (!!)

Overview of the capabilities coverage for the Capabilities Maturity Model’s five “facings” - researcher, data, software, system, and strategy/policy facing. While there’s huge scatter for each facing, data- and software- facing capabilities have noticeably less coverage than systems, strategy, or researcher-facing capabilities.

Remote Onboarding of Software Developers - Paige Rodeghero, It will Never Work in Theory, Live!
Please Turn Your Cameras On: Remote Onboarding of Software Developers during a Pandemic - Paige Rodeghero, Thomas Zimmermann, Brian Houck and Denae Ford, arXiv:2011.08130

I’ll highlight some other talks from this workshop in the Research Software Development section below, but this talk is relevant for I think all research computing teams - how to onboard technical staff when we’re all working from home.

Getting someone up to speed on the work of a team and on the team itself is always tough, but in this lower-communications-bandwidth world where there’s a lot less in-person interaction, it’s even harder. Doing it well, though, is vital - getting people successful as early as possible is important not only for the team’s effectiveness but for the newcomer’s morale and engagement.

Rodeghero here presents her work based on survey results from 267 new hires at Microsoft in the first year of the pandemic, when processes were still very much in flux. Successful onboarding had several things in common, and from that she makes eight recommendations:

Cameras on should be the default culture
Promote proactive communication (e.g. new team member asking for help)
Schedule 1:1 meetings with all team members
Explain the Org Chart (and other information about the organization)
Assign an onboarding buddy & technical mentor
Support multiple onboarding speeds
Assign a simple first task
Provide up-to-date documentation

I think in our smaller teams, we don’t generally need to distinguish between the supervisor, technical mentor and onboarding buddy - any two of those is likely enough - but everything else is very relevant. I think one thing that doesn’t come up very much elsewhere is that Rodeghro found new hires really struggled to understand the team dynamics and form bonds with the team when everyone had their cameras off. I think this is easy to forget when our teams have long average tenures and most of us know each other well - we know how to interpret people’s voices. But new people joining will really struggle if everyone is just an avatar in a box on a screen.

Product Management and Working with Research Communities

Following chatter about the Helmholtz AI Conference on twitter, this tweet led me to the Helmholtz AI unit’s fascinating consulting model - researcher submit proposals for vouchers, which are reviewed (as they come in, it looks like). Those vouchers can be for (<2 week) “exploration” engagements, where the project is fleshed out with experts, or longer (<6 month) “realization” engagements where the project is executed on. 169 engagements have been completed so far.

Has anyone worked with Helmholtz AI using this model, or have experience with a similar model? Let me know!

Cool Research Computing Projects

I love computer-powered citizen science projects. This Ars Technica on the Hubble Asteroid Hunter is a great summary of the possibilities of combining web applications, volunteer labour and machine learning for science.

“That’s why you needed a sample of them detected by humans,” Kruk said. “What took us a year to classify with the citizen scientists—it took only about 10 hours with the [algorithm]. But you do need the training set.”

Volunteers on the Zooniverse platform went through 37,000 Hubble images to find over 1,000 new asteroids, typically smaller than those which were known before.

Research Software Development

It Will Never Work In Theory, Live! - Organized by Greg Wilson

At the end of April, Greg Wilson hosted a series of lightning talks focused on actionable results from software development research. There’s many worthwhile talks there, covering issues relevant to our field - I’d like to highlight lightning talks (with slides available) on

Top Ten suggestions from managing a decade of undergraduate software teams (PDF here) - Weiqi Feng, Mark D. LeBlanc, J. Computing Sciences in Colleges (2019)

One of the defining features of academic research computing and data is that many of the contributors are trainees. This isn’t something you read a lot about handling in articles written for the tech industry or startups!

The authors describe a long-term interdisciplinary project, Lexos, which helps digital humanists analyze their favourite corpus of digitized texts. It will clean up, visualize, and analyze bodies of text. It started off as a set of perl scripts, and is now a hosted or self-hosted web app with a python backend and a d3.js-powered javascript front end, with releases Windows, Mac, and Linux, hosted on GitHub. With the group based in a college, most of the trainees available to contribute are undergrads doing summer projects or working part-time during the year.

Feng and LeBlanc offer their suggestions for managing such teams:

Recruit actively and iteratively - focus on students who can work independently and can lead tasks
Precede the initial sprint with a bootcamp - this is a really good idea if you’re hiring cohorts like summer students
Use CI tools, Unit Testing, and good modular structure (so students can work confidently on specific chunks of the overall product during their short tenure)
Conduct peer reviews: “Though we do not ask all students to make suggestions to pull requests, we do ask them to read through each and leave comments if they cannot understand what the code does” (great stuff there)
Meet in daily standups
Lead toward success: assign easy tasks early
Improve one tool
Add new functionalities

This approach provides real professional development for students, while advancing the group’s product. Some of that professional development is hard - the authors call out challenges and tensions around peer reviews in particular - but those are skills that are important for the students to learn.

The authors may undersell the level of project management skills they clearly demonstrate here. Maintaining a long list of well-scoped discrete tasks that can be handed out to short-term undergraduates is no small thing!

Research Data Management and Analysis

Time Series and FoundationDB: Millions of writes/s and 10x compression in under 2,000 lines of Go - Richard Artful

The wide range of open-source databases available now means that it’s increasingly possible to choose something which meets the right tradeoffs for your particular use case. And if not, there’s likely something close you can build on top of.

The README for this repo is basically a blog post on using FoundationDB (Apple’s open-source distributed key-value store) to build a distributed time-series database that met his needs. It’s a great overview of why you can’t just use Postgres, implementing compression, and secondary indexing.

Research Computing Systems

TACC Adds Details to Vision for Leadership-Class Computing Facility - Aaron Dubrow
Atos pushes out HPC cloud services based on Nimbix tech - Dan Robinson

Dubrow’s article gives us details about TACC’s upcoming system and facility, but the part I want to highlight is this:

TACC announced as part of its preliminary design presentation that the LCCF advanced computing system will likely be hosted at a Switch commercial datacenter under construction on the Dell Round Rock campus, 10 miles north of TACC.

The culture of some of research computing, but especially HPC, was formed in the 90s and earlier when research computing was a highly specialized niche resource. But the success, the importance, and maturity of research computing and data is such that it’s now everywhere, it’s necessary for most research, and most needs for it are thankfully technically routine. Complex and important and requiring expertise and onboarding, to be sure, but routine.

Even though the TACC leadership system (“Horizon”) will presumably be quite big and will be able to HPL very hard indeed, its size and scale will be consistent with what commercial data centre operators can handle. And those data centre operators run dozens or more data centres, hosting hundreds or thousands of systems, full time - that is their one job. They are very good at it, with an efficiency and effectiveness that comes from that much experience.

Most research institutes are not located in a place where space and power are cheap. There’s an opportunity cost to using campus or institute-owned space for machine rooms instead of labs or classrooms, in addition to the economic cost. We’ll see much more of this use of commercial datacenter use in the future, and it’ll open up yet another front in the (interminable and mostly uninteresting) cloud vs on-prem debate.

Relatedly, European vendor of HPC systems Atos will now, like HPE, also offer the systems as a service.

It’s fascinating to go through Facebook’s logbook for a hero run - in this case a recent large pre-trained text model. Nodes going down, bad commits, people forgetting to change parameters, an ssh server being taken out because it was running on someone’s instance… super interesting, and very relatable! I also really respect the fact that they kept such a careful log and released it with the model.

Life and leaving NERSC - Glenn K. Lockwood

A long-time HPCer and well known high performance storage expert, Lockwood muses on NERSC (by all accounts a great place to work), the recent history of HPC, and why he’s moving to Microsoft Azure:

A more innovative approach is to start thinking about how to build a system that does more than just run batch jobs. […] Such a “more than just batch jobs” supercomputer actually already exists. It’s called the cloud, and it’s far, far ahead of where state-of-the-art large-scale HPC is today–it pioneered the idea of providing an integrated platform where you can twist the infrastructure and its services to exactly fit what you want to get done. Triggering data analysis based on the arrival of new data has been around for the better part of a decade in the form of serverless computing frameworks like Azure Functions. If you need to run a Jupyter notebook on a server that has a beefy GPU on it, just pop a few quarters into your favorite cloud provider. And if you don’t even want to worry about what infrastructure you need to make your Jupyter-based machine learning workload go fast, the cloud providers all have integrated machine learning development environments that hide all of the underlying infrastructure.

Emerging Technologies and Practices

Some really interesting announcements coming out of ISC this past week:

Red Hat Joins Forces with DOE Laboratories - HPC Wire

For things that (say) Slurm is good at, it’s objectively much better than Kubernetes. But as soon as you don’t fit neatly into the batch scheduler’s model of “N nodes for M hours” requests, it all turns to weeping and gnashing of teeth.

There was always going to be a winning resource manager for more complex workflows. I would not have bet on Kubernetes — it’s fiendishly complex, both over- and under-powered for research computing needs, and still a moving target — yet here we are. It’s standard, widely deployed, and has a fairly stable programmable interface, and most importantly it’s what we have.

This press releases that Sandia is going to work with RedHat on Kubernetes (RedHat has a widely used commercial distribution of k8s, OpenShift) for extreme scale more complex workflows, connecting next-gen batch schedulers like Flux to the kubernetes world. And both Sandia and NERSC will work on improving the rootless container ecosystem.

GigaIO Announces Series of Composability Appliances Powered by AMD - HPC Wire

Composable computing is actually something you can order by the rack, now - AMD + GigaIO are selling racks with EPYC chips and MI210 aceclerators, with rack resources being able to be merged into “nodes” while the nodes are live for particular jobs. Very cool!

Random

A cheap and reliable and widely-used German credit card terminal recently reached the point where it would no longer be supported - and then a certificate or key or something expired, producing mild credit-card chaos in Germany.

A computer-music language for your browser - glicol.

How fast can you push data through Linux pipes if you try hard enough? Pretty fast - up to 65 GB/s.

How fast is the typical malloc/free? With a good library median times can be about 25ns, if you’re playing DOOM3 anyway.

How fast can you push a quicksort implementation? Google’s open-sourcing a library making full (and architecture-independent) use of SIMD and multicore that can sort 16 through 128-bit integers at speeds 9-19x faster than std::sort.

Digging into information theory, optimal strategy, and wordle.

Implementing Forth from scratch, where by “from scratch” I mean “first, design an instruction set”.

The first Lisp compiler.

Telefork a process onto another machine.

Unittests and mocks for bash scripts

The unreasonable effectiveness of “have you tried turning it off, then back on again”? (Honestly, why do we ever even turn these blasted machines on in the first place?)

That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.

RCT Newsletter

#124 - 4 Jun 2022

Exascale - we've reached a future from the past; CaRCC RCD capabilities commmunity report; Remote onboarding; Helmholtz AI's voucher model; Hubble Asteroid Hunter; Managing undergrads for a decade-long software product; TACC goes to a commercial datacenter; Lockwood goes to Microsoft

Managing Teams

Product Management and Working with Research Communities

Cool Research Computing Projects

Research Software Development

Research Data Management and Analysis

Research Computing Systems

Emerging Technologies and Practices

Random

That’s it…

#124 - 4 Jun 2022

Exascale - we've reached a future from the past; CaRCC RCD capabilities commmunity report; Remote onboarding; Helmholtz AI's voucher model; Hubble Asteroid Hunter; Managing undergrads for a decade-long software product; TACC goes to a commercial datacenter; Lockwood goes to Microsoft

Managing Teams

Product Management and Working with Research Communities

Cool Research Computing Projects

Research Software Development

Research Data Management and Analysis

Research Computing Systems

Emerging Technologies and Practices

Random

That’s it…

About This Newsletter