#122 - 14 May 2022

Communication is not collaboration; Opening up onboarding documents; Onboarding developers with tests, docs and containers; MIT's Office of Research Computing and Data; First view of the Milky Way's Supermassive Black Hole; Smarts in the network, LAN and WAN; A first RISC-V Cluster

Hi!

I took part in a hackathon this week, working directly with researchers on improving their code for the first time in — gosh, maybe eight years? Nine?

It’s funny what sticks in the mind and what doesn’t. I kept coding on and off over the past decade, so that was fine, but all the glue stuff that I didn’t think to stay in practice on like “how do you unload all modules?” or “why doesn’t this shell script that sets environment variables work when I run it?” (yeah, that one’s kind of embarrassing) took some time to remember.

There were more data analysis projects, and infinitely more AI/ML projects, then there would have been a decade ago when it would have been 90% simulation codes. Technology has changed over the past decade — there are some new profilers, and tooling, systems are faster, and hey Power systems aren’t big endian any more! But I was mainly struck by how little change there had been in the basic challenges. Maybe the percentage of research groups with extensive in-house research computing and data skills has gone up a bit, but mostly the research groups we saw were very capable, deeply knowledgeable in their domain, with significantly less expertise on the computing, software, and data management side. They needed, and knew they needed, collaborators to work with to achieve their goals. And our responsibility was to do a little bit of hands-on-keyboard work, but mainly to give them the skills and resources they needed so they could continue to advance even after the engagement; to give guidance as trusted experts. And that expertise ends up straddling at least two of software, systems, and data - those aren’t useful in isolation.

The hackathon ended with a bunch of teams all being able to do significantly more computational science (and social science), and tackle bigger problems. It was really nice to be able to contribute hands-on to such work again!

The work we do is important. It’s important that it be funded well, but we can’t individually do much about that. What we can do is make sure the services are provided professionally, and meet the needs of our researchers; that we assemble teams of great experts that function well together; that our team members are treated well, given opportunities to grow and do meaningful work; and that we constantly improve our offerings and build the skills of the teams.

One more note before the roundup - next weekend is Victoria Day weekend here in Canada; I’ll be taking the weekend off, so issue #123 will arrive in your inbox on or around the 28th.

With that, on to the roundup!

Managing Teams

Communication is not Collaboration - Matt Schellhas

Schellhas decries the tendency to assume that if groups aren’t working together, it’s because of a lack of communication, and so the solution is some joint meeting to keep each other informed (this can go even worse, and become… shudder … joint social activities). But communicate is not collaboration, and so more communication doesn’t necessarily lead to more collaboration.

To get two teams working together, Schellhas says, is not significantly different than getting one team working together:

  • Build enough psychological safety so that people can try something new, like trusting that other group
  • Set expectations
  • Back up those expectations
  • Provide constant feedback

and those actions don’t have to come from managers (although that does make it easier).


Weecology’s Wiki - Weecology lab, U Florida

Ethan White and his group made their joint lab wiki openly accessible, as a resource to others but also for people considering joining the group as grad students, etc.

We’ve mentioned onboarding materials and candidate packets many times in the past, and making these resources openly available is extremely valuable, in getting candidates (or collaborators, or clients) interested in the work of your group, for building a shared team understanding of the work, its impact, and how to bring people on board, and more. Having it as a wiki, and so readily updatable, is very helpful.

Note that this isn’t a “if you build it, they [the content] will come” sort of deal:

Most @weecology folks will remember hearing @skmorgane and I repeatedly say things like:

“Did you check the wiki?” “What’s missing from the wiki?” “OK, now that we’ve talked through how to do that can you write up that advice on the wiki?” “Nice solution! Add it to the wiki!”

It takes a lot of effort to making putting the material there and updating it the default behaviour of the team. But once it’s there, there’s a lot of uses it can be put to, and making it open for candidates and others to see is one of them.


Managing Your Own Career

A love letter to LFTM - Angélique Weger

We’ve had a couple articles here in the last couple of weeks about the need for more than one kind of document format than just a to-do list to keep track of the different things a manager needs to keep track of. The Low Friction Task Manager (lftm) is one such (github-friendly) system, and here Weger confirms the advantages claimed by the system:

  • Answers the question of ‘what do I do next?’, which is the ultimate productivity killer.
  • Keeps my working memory uncluttered.
  • Keeps me from um’ing during my daily standups. I always know what I worked on yesterday.
  • Is a handy record of accomplishments that I can reference when it’s time for my review, I want to ask for a raise, or I’m updating my resume.
  • Provides a reminder that I do, in fact, get things done and that I don’t, in fact, suck at my job.

And then describes her takes on it (adding Markdown, and re-ordering things). The system has folders for notes on 1-1s, other meetings, general notes, and projects, each with a template, and then running journals where things like to do lists are kept and archived, with scripts to keep things up to date.

As always, the best system is one that works for you, and that may change over time.


Product Management and Working with Research Communities

MIT to launch Office of Research Computing and Data - Scientific Computing World

As if there was any doubt about the rising importance of research computing and data, MIT is putting together an office direclty reporting to the VPR with University-wide RCD as its remit. I don’t love that it’s being lead by a professor rather than professional staff — we know that works less well for HPC centres, for instance — but still, it’s a good initiative, and makes MIT one of only a handful of large institutions that I know of that will have a single campus-wide effort rather than services and capabilities scattered all over campus (though I’m sure even MIT will take some time to move from column B to column A).


Cool Research Computing Projects

First Sagittarius A* Event Horizon Telescope Results. I. The Shadow of the Supermassive Black Hole in the Center of the Milky Way - Akiyama et al.
First Sagittarius A* Event Horizon Telescope Results. II. EHT and Multiwavelength Observations, Data Processing, and Calibration - Akiyam et al.

Increasingly, there’s no sharp line between scientific observations and experiments and the research computing and data efforts which enable them. The news this week that you probably already saw is a great example.

It’s way harder to see what’s at centre of our own Milky than to see distant galaxies; looking inwards, half the Galaxy is in our way. [Sagittarius A](https://en.wikipedia.org/wiki/Sagittarius_A) (so called because it’s in the constellation of Sagittarius, and was the brightest radio source in the constellation when such things were being catalogued in the 50s) is not only obscured, but it’s highly variable (the heated gas glowing and orbiting around a black hole understood to be four million times the mass of our own sun varies over a timescale of minutes to hours - imagine the amounts of energy involved!). So you can’t even just take a single long exposure even if that would work.

To pull out the signal at the frequencies where Sgr A* is brightest, you need telescopes which have collectively have a large collecting area. And to get at the resolutions you need to see spatial structure in this compact object, you need signals from far apart - using large baseline interferometry, accurately correlating signals from measurements far apart.

The Event Horizon Telescope is an effort and instrumentation spanning half the globe. At each site, data is collected at 16-64Gb/s with data saved to disk, and those disks are collected. Because the data has to be correlated, keeping the tracks in sync is very important, and atomic clock data is saved along with the time streams. Multiple pipelines run on the data (as described in the second paper). The pipelines are tested by running multiple simulations of what one might expect the area around the black hole to look like, generating synthetic data, and running them through the pipelines to check against expected results.

There are no observations (and subsequent galactic physics papers) of this without the research computing and data effort, and that RCD effort is highly specialized to the observations, with no firm dividing line between the astrophysics, software development, systems, and data management pieces.

A resulting view of the supermassive black hole at the centre of our galaxy


Research Software Development

Trusted Cyberinfrastructure Evaluation, Guidance, and Programs for Assurance of Scientific Software - Pignolio, Miller, and Peisert, Better Scientific Software
Best practices to keep your projects secure on GitHub - Justin Hutchings, GitHub blog

When I started in this field, research software was all numerical simulation - accuracy and testing were concerns, but the only real security issue was making sure the code didn’t get deleted. But with web applications, or infrastructure software, or sensitive data analytics, then integrity of the code really matters.

The Better Scientific Software article summarizes the Trusted CI report that we mentioned in #105, and focuses largely on governance issues - making sure there’s a security point of contact, having tools and testing, having processes for commits, training and knowledge about safe code practices, etc.

The GitHub blog focuses on GitHub tools for monitoring dependencies, dependency review, and turning on Dependabot for security alerts (which you should absolutely do if the language of your project is supported). Both approaches are useful, and work together in tandem.


We talk a lot about containers for reproducibility of scientific workflows, but don’t sleep on reproducibility of development environments also being a huge win. Here’s a blog form of a talk by Avdi Grimm about dev containers, with tooling like VSCode or Github Codespaces or any of a number of similar tool sets, and the benefits for onboarding new developers (hires or community members), and just productivity for day to day work, of a standard development container.


I just threw away a bunch of tests - Jonathan Hall
The evolution of the SciPy developer CLI - Sayantika Banik, Quantsight Labs

Related (but not limited) to ease of developer onboarding - I was just having this conversation with a friend. Test suites are code, too, and part of your product - they’re not some weird kind of “meta-code” for which the usual rules don’t apply. As Hall points out, that means keeping them documented, making them easy to run, refactoring them, and discarding some when appropriate.

As an extreme version of that, Banik walks us through the SciPy developer CLI, with tools and documentation for building, testing, static checking, benchmarking, using, and adding/editing documentation or release notes for SciPy packages. Obviously almost no other packages are SciPy, but an engaged developer community needs support and documentation, and if you’ll make tests mandatory as part of significant commits (and you should!), the test suites, and instructions for running and updating the test suites should be kept up to date.

The help screen for the SciPy developer CLI, showing options for building, testing, listing, type checking, updating release or API docs, benchmarking, and interactive use of an under-development SciPy package


The code review pyramid - focus human attention where it is needed, and have automation take care of what it does well.


Research Data Management and Analysis

sqlite-utils: a nice way to import data into SQLite for analysis - Julia Evans

As you know, I’m a big believer that many .csv files, and most directories of .csv files, could usefully be a sqlite file (or some other database). Evans discovers sqlite-utils, part of the suite of tools around Datasette, for quickly importing existing data into sqlite.


I’m also really excited about all of these “SQL atop data just sitting there” tools that are popping up, particularly on object stores and/or in columnar formats - I think they’ll end up being really useful for a lot of data analysis, or even for log analysis for systems. Sneller is another such tool, for schemaless JSON items.


How time-series compression algorithms work, and why they matter.


Research Computing Systems

Videos are available for talks from the Science Gateways Community Institute MiniGateways 2022 spring conference. Talks on data collection from real-time sensor networks, OpenOnDemand application development, and a lightning talk in the first session on building an open science data web application with no budget may be of interest.


Cool - Arm’s HPC compiler suite is now freely available! The fortran one is coming, with Arm working with LLVM on flag. There are, ahem, others available as well, but this is good news.


Emerging Technologies and Practices

Intel Unrolls DPU Roadmap, with a Two Year Candence - Timothy Prickett Morgan, The Next Platform ZFS without a Server Using the NVIDIA BlueField-2 DPU - Patrick Kennedy, Serve The Home Platform Week Product Updates - Cloudflare

After years of Software Defined Networking being largely a cloud thing, pushing intelligence into the network - either the data centre fabric or the WAN - is clearly very nearly here, and I’m fascinated to see how it shifts how we’ve thought for quite some time about designing systems.

Without getting into whose DPU is better than whose for which purpose [CoI reminder: I’m at NVIDIA now], the tech bigs clearly see an inportant future here. AMD has made a big purchase to get into that space, and Intel at their Vision conference announced their DPU roadmap through to 2026 (although to be fair it gets fuzzy past 2023, and they still insist on calling them IPUs). Morgan gives a runthrough of Intel’s announcement, and characteristically gives us some historical perspective - disaggregation is just the pendulum swinging back after years of pushing everything onto the CPU:

In a way, all of the virtualized functions that were added to operating systems and drivers are being consolidated on a dedicated and cheaper subsystem – just like mainframes and proprietary minicomputers did five and four decades ago because mainframe CPUs were also very expensive commodities

In the second article, Kennedy runs a ZFS RAID array server entirely from the compute on two DPUs, without any x86 server in the path - one DPU exposing the block devices, and the second making a RAID array. There are plans to try to use some further offloading to specialized hardware on the DPUs. Storage is one of the areas I’m most excited about the impact of DPUs (the other is tenant isolation/security).

On a same-but-different note, this week Cloudflare announced a number of new or updated products, particularly around their edge Workers, durable objects and distributed edge databases (based on SQLite!), open beta for their R2 object storage, and NAT and distributed API gateways. Again, this is about pushing compute and logic out into the network, but here the WAN. The work announced this week plus their previous work of pushing zero-trust authentication into the network and edge could have real impact for designing science gateways, DMZs, etc., changing how the data is distributed and what “has” to be collocated. Cloudflare of course isn’t the only company working on this, but they’re currently the furthest ahead with a complete portfolio of services.

R2’s cost structure, as I mentioned in #94, is pay per-month for the storage and no per-download-bandwidth fee. (There’s a nominal fee per million listing or reading operations, but unless those are occurring more than a million times a month, unlikely for our community, you won’t even pay that). That is going to be potentially very useful for research computing groups who want to make large datasets or other downloads available to the community, and for whatever reason can’t host it themselves.


Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers - Bartolini et al., arXiv:2205.03725

I’m more sceptical about RISC-V than others I know. Maybe it’s lack of imagination, but just don’t see how making an ISA open source leads to uptake and innovation. Hardware isn’t software. Even in a world with lots of FPGAs, the ISA is one thing, but most RISC-V core designs are very much not open source — and for going beyond FPGA implementations, making chips remains eye-wateringly expensive. Even absent the issues of turning an ISA into physical chips that implement the ISA, one of the superpowers of open source is forking and remixing, and having forked and remixed versions of an ISA isn’t an obvious path to success for an architecture.

But lots of people I respect strongly disagree, and god knows I’ve been wrong before. For areas where standards are important, enough people wanting the standard to succeed can be enough for it to do so. So one keeps an eye on things.

Here Bartolini et al. build an 8-node RISC_V cluster, based on SiFive U740 SoCs. Impressively, enough of the software stack is risc-v ready for the author’s to be able to run linux, slurm, NFS, LDAP, ExaMon for performance monitoring, HPL and stream benchmarks, and quantumESPRESSO. (One huge and unsung benefit of the work done over the past decade to build the ARM software stack has been to identify and special case “all the world’s an x86” assumptions that have lurked in code for a long time).

Performance isn’t brilliant — HPL gives ~2 GFLOP/s per node, the available 1GbE only gets them to ~75% efficiency on 8 nodes, and something is holding them back to 15% of theoretical bandwidth on stream — but power consumption (see below) is absolutely tiny; and this even being possible demonstrates that the tool chain and software stack is further along than I had realized.

Table 6 from Bartolini et al, showing power consumption broken down by core, DDR, IO, PCIe, and PLL subsystems during boot, idle, and the HPL, Stream, and quantumESPRESSO tests.  Power draw is in single digit Watts.


The first quantum computing coding competition that I’m aware of.


Fascinating to me how cloud services like AWS Batch are adopting features familiar from HPC queueing systems, like job dependencies.


Random

You know how you’ve wanted Wifi and bluetooth on your Apple II? You’re in luck.

Setting up and exploring gopher in 2022. I’ll remind you all that in 1994 I was mortally certain that the web was a fad and Gopher was where things were going, so do keep that in mind when deciding how much weight to give my opinions on technology.

COLR, a font format that includes colours for fonts with letters that intrinsically are coloured with multiple colours. The last 15 years of the internet has proven that we will always use new technologies wisely and with restraint, so I’m sure this will turn out fine.

Implementing TeX’s paragraph-wrapping in perl, for some reason.

Writing your own uint128 to float64 conversion by hand for some reason, and getting better results than the rust compiler?

Accessibility in Linux is kind of a mess.

Adding CSV support to (an implementation of) awk, which honestly sounds pretty useful.

Implementing a lock-free queue with C atomics.

Uncertainty propagation for the masses - guesstimate lets you give the range of input values and compute expected output values. This is the sort of thing we know how to do in science and sort of forget to apply in the rest of our lives. The results are more useful than you’d think!

The history of the Intel Hypercube, one of the first commercial multiprocessing systems, and (in posts coming soon) its influence on matlab and technical computing. The first parallel computer I ever used was the Paragon, an immediate descendent of this machine.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.