#98 - 29 Oct 2021

The importance of the basics; Hour-long standups?; Getting answers to questions; Coordination models; RSE and tangled messes, part II; CI/CD minimal definitions; JAX for HPC and AI

Hi!

At least one other research group has also taken to providing some interview questions ahead of time. In response to the discussion in last issue, Titus Brown wrote in on twitter:

We’ve also started giving out interview questions in advance. Like you, we’ve found it leads to better discussions. I actually posted one set here, and [Rayna Harris] blogged about her hiring experience on the other side of this, here. Don’t remember where the idea came from initially, but we like being open about things so :shrug:

I’m really pleased the newsletter has grown to the point that we can have back-and-forths about topics like this, because I think we need it in this community.

This week I heard one research software developer leaving a group because they weren’t getting the kind of work they found meaningful - partly because their manager only every really talked with them at the institutionally mandatory annual review time to talk about work goals and plans for the future.

Research computing and data is vitally important, and so are the teams doing that work. But noone ever really teaches research computing team leads or managers about how to support those team members and lead those teams effectively. It’s bad for research, and bad for our team members - who deserve good support, and who are scandalously underpaid compared to industry and can leave at any time.

The thing is - and stop me if you’ve heard this - in research, we pretty much know the advanced skills of managing and leading - building a multi-institutional collaboration, creating a clear vision of a necessarily somewhat nebulous research project, communicating over over different kinds of domain knowledge. But the basics - how do we hire, how do we set performance goals and nudge people towards them, how do we make sure we have open lines of communications with our team members (one-on-ones), when does it make sense to delegate - no one ever tells us this stuff. Hopefully with this newsletter community we can build some of that shared knowledge together.

Let me know (just hit reply, or email [email protected]) if there are any particular basics you’ve struggled with that we can talk about. In the mean time, on to the roundup, and have a happy Hallowe’en:

Managing Teams

Stand-up Meetings are Dead (and what to do instead) - Ben Darfler, Honeycomb Blog

What if daily standups, but for an hour?

Darfler describes how Honeycomb has switched their standups - from the usual short round-table format to a daily hour long gathering (a “meandering team sync”) that includes social time, and then a collaboratively-edited catch all agenda of work items. Rather than being formulaic, it becomes the standard place for team-wide discussions (technical or process) and also explicitly includes a social component. Darfler finds that it reduces total meeting time by always providing a venue for discussion of any given topic (much the way regularly-scheduled one-on-ones typically reduce interruptions - for both team member and manager - by providing a bucket for topics to go into). Darfler recommends starting with 30 minutes to see how it goes.

I think almost any meeting format can work for a team as long as there’s regular check-ins about the success of the meeting format and opportunities for course correction. I’m not sure our team would switch to this format any time soon, but it’s an interesting idea and I’d be curious to see how it worked.

What sorts of meeting rituals do you use in your team? Do you have anything other than the weekly staff meeting and sprint rituals like standups, sprint planning, and retros? What’s worked well for your team (and alternatively, what did not work?)


A framework to improve performance - TLT21

Long time readers won’t learn much from this article but it’s a short read:

The first step to adequately address consistent performance issues (emphasis on the “consistent”) is to ensure that you have a clear and transparent standard for good performance for each role.

It’s “easy” but takes a lot of work to have have shared, common expectations about performance. As a manager or lead you can help that with sharing your expectations regularly with your team members, via feedback, and helping the team develop its own explicit expectations of each other.

Then comes the work of nudging people towards performance if they’re not there yet - if the issue is one of behaviours, then feedback and coaching on behaviour; if its one of skills, then training; and if it’s one of knowledge, then documentation (which often is a whole-of-team effort itself).


Managing Your Own Career

How to get useful answers to your questions - Julia Evans

Evans gives advice for how to get useful answers to questions - the context she uses is technical questions, but honestly the approach works just as well for getting your boss or collaborators to answer questions in email, or anything else.

She offers two pieces of advice for making it easier for the question-answerer to give you the answer:

  • Ask yes/no questions
  • State your current understanding

And two pieces of advice for getting more out of the answer:

  • Be willing to interriupt
  • Don’t accept responses that don’t answer your question
  • Take a minute to think

Product Management and Working with Research Communities

How To Produce a Webinar Series - Osni Marques et al., HPC Best Practices (HPC-BP) Webinar Series

The Exascale Computing Project has hosted 58 roughly monthly webinars on “HPC Best Practices”, so they’ve gotten it down to more or less a science now. In this github repo, the organizers have a check list, a guidance email to presenters, and a paper from 2019 describing their experiences. This might be a good starting point if your group or community wanted to start organizing such a series.


Better coordination, or better software? - Jessica Joy Kerr Coordination models - tools for getting groups to work well together - Jade Rubick

Collaboration is good, but in large-enough collaborations it isn’t feasible or even desirable to have everyone working as if they’re on one large team. As Kerr points out, at some point you need to move beyond collaboration to coordination around well-defined interfaces (with occasional, ad-hoc deep collaboration between members of the sub team).

Rubick is slowly writing a pretty impressive compendium of coordination models within but also across that he’s seen work, how to make them work, and their tradeoffs. Several of them are extremely relevant to research computing and data;

There was a lot of discussion early on in the newsletter about centralized vs embedded RSE or data science teams; it’s nice to see someone thoughtfully writing up a more detailed overview of the kinds of coordination models, with frank looks at their failure modes.


Research Software Development

Is Research Software A Tangled Mess? - Peter Schmidt and Derek Jones, Code for Thought podcast

Jones writes a blog and now a book on empirical data on software engineering. Earlier in the year wrote a post, Research software code is likely to remain a tangled mess, that we mentioned in #63 and that got a lot of, let’s say “attention”, in the RSE community. His comments weren’t really that controversial about research software written by researchers so much, as having some doubt that the growing RSE effort will make much inroads.

Peter Schmidt interviewed him for the RSE Code for Thought podcast. I tend to be pretty sympathetic towards what he’s saying:

  • There’s very little objective evidence in favour of most software development best practices; that doesn’t mean that what we mean by best practices are bad bad, but there’s not a lot of evidence to demonstrate they lead to objectively better outcomes
  • The default for developers is to write tangled messes
  • Researchers aren’t trained to do software development, and don’t do it every day
  • Therefore most researcher-written code is a tangled mess
  • That may not even be especially bad - one of the few statements about software development that there is a lot of data to point to most software in industry has a very short shelf-life.
  • If that’s the case, not putting effort into software engineering until a piece of code has stood a certain test of time/users isn’t an obviously a poor outcome

I think it’s likely at least as true in research than in industry that software has a short life, and tends to never be used by any other than its authors and immediate peers in the lab. Most ideas don’t pan out (that’s true in business just as much as in research).

Maybe more controversially, Jones argued:

  • Software “sustainability” doesn’t actually mean anything. Testing does, testing is good, but “sustainability” doesn’t.

Minimum Viable CD - Minimum CD signatories

There’s a lot of disagreement about what CI/CD means, with lots of people using it for fairly disparate things. This is a push for a clarity around these terms; this list of signatories has a pretty modest list of requirements for continuous integration, which I think most research software teams probably meet:

  • Trunk-based development
  • Work integrates to the trunk at a minimum daily
  • Work has automated testing before merge to trunk
  • Work is tested with other work automatically on merge
  • All feature work stops when the build is red
  • New work does not break delivered work

and then Continuous delivery, which I think most groups are a little further behind on, not lease because I think many systems teams don’t support it:

  • Use Continuous integration
  • The application pipeline is the only path to deploy to production.
  • The pipeline decides the releasability of changes, its verdict is definitive
  • Artifacts created by the pipeline always meet the organization’s definition of deployable
  • Immutable artifact. No human changes after commit.
  • All feature work stops when the pipeline is red
  • Production-like test environment
  • Rollback on-demand
  • Application configuration deploys with artifact

Research Computing Systems

Real-World HPC Gets the Benchmark It Deserves - Nicole Hemsoth, Next Platform

Following hot on the heels of the news that China might already have two exascale computers but couldn’t be bothered to submit to the always-dubiously-meaningful top-500 list, Hemsoth reports on an actual real set of HPC benchmarks, put out by the Standard Performance Evaluation Corporation (SPEC), of SPECint, SPECfloat, etc. fame (although I notice now that they haven’t been called that since 2006.)

SPEC has long had MPI and OpenMP benchmarks; SPEChpc is a suite of 6-9 benchmarks including kernels modelled after weather, astro, HEP, combustion, and solar physics codes that involve combinations of MPI, OpenMP; there are separate accelerator versions too. The SPEChpc suite isn’t open source, but it’s freely available to academic or non-profit organizations.

The best benchmarks of course are your real workloads. I don’t know if SPEChpc will take off, or even yet if it’s any good, but defining some kind of “official” semi-synthetic set of reasonable and easy-to-set-up kernels is a necessary first step in moving us away from the execrable Linpack benchmark used in the top 500. Too, it allows users to consider the individual benchmarks more seriously if they’re closer to the communication and computation pattern of their application, rather than just reporting a single number.

SPEChpc2021 Large runs for Summit, Frontera (in 3 configurations) and Taurus


Emerging Technologies and Practices

Why JAX Could Be The Next Platform for HPC-AI - Nicole Hemsoth, The Next Platform Machine learning–accelerated computational fluid dynamics - Kochkov et al., PNAS End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman - Samantha Petti et al, bioRxiv

This is a very odd pairing of papers, with two very different fields I’ve worked in - bioinformatics and fluid dynamics - connected by a python library.

I first mentioned JAX in the newsletter way back in #34 as “a really cool autodifferentiation package for python”. While that is and was a key component of it, even then it was something more than the traditional autodiff tool, emphasizing composability like differentiating through loops, branches, and recursion; and it has primitives for vectorization and can compile code straight to GPU code (or Google’s TPUs). That makes it a useful tool both for traditional numerical computation but also deep learning and AI.

In the first article, Hemsoth interviews Stephan Hoyer, climate physicist turned Google AI applied scientist, about his team’s work with JAX-CFD, including a paper published in PNAS. JAX-CFD uses JAX for both traditional CFD numerics and AI; in the paper, Kochkov et al. first run high-resolution runs and train AI models for interpolation and correction, building (if I’m understanding this correctly) something like a subgrid/modified advection model. They then demonstrate, on larger domains, decaying turbulence, and higher turbulence simulations, that incorporating the model allows them to get the same accuracy running at lower resolutions; or, as Hemsoth succinctly summarizes:]

even though it used quite a bit more computational power (150X more FLOPS) was only 12X slower at the same resolution but 80X faster for the same accuracy.

Fig 1 of Kochkov et al, demonstrating their training and simulations, the overall performance, and the the runtime enhancement allowed by running at 1/8 the resolution

The paper by Petti et al. is a bioRxiv preprint that uses JAX for something very different - a discrete problem of multiple sequence alignment, then implemented in a proof of concept for a problem in structure prediction for proteins. Smith-Waterman for sequence alignment is a classic dynamic programming problem. First they implemented a “smoothed” version of Smith Waterman, so one can differentiate the solution (what is the change in the alignment score of the modified algorithm if we make a small change to the inputs), so it could be readily calculated with JAX. But now the problem of alignment can be calculated jointly with AlphaFold structure predictions to improve predictions of how the protein folds compared to a known proteins. The senior author explains how in a twitter thread.

I’m not sure where deep learning and research computing will be going, but the fact that JAX is being used successfully in very different fields for very different applications makes it a package worth keeping an eye on.


Developing a unique FPGA testbed for UK researchers - Nick Brown, EPCC

FPGAs have been “emerging tech” since as long as I’ve been doing research computing and data (20 years?); at the time we were waiting for FPGAs to get easier. Now it’s starting to look like everything else has gotten harder, so FPGAs no longer stand out; and there are specific applications with vendor-sold FPGA solutions that have such eye-watering speedups that it’s building interest in other areas. In this article Brown talks about EPCC’s new FPGA testbed and some of the applications that are being worked on there.


Calls for Submissions

22th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2022) - 16-19 May, Italy, Papers due 24 Nov

Tracks for topics are:

  • Track 1: Future Internet computing systems
  • Track 2: Programming models and runtime systems
  • Track 3: Distributed middleware and network architectures
  • Track 4: Storage and I/O systems
  • Track 5: Security, privacy, trust and resilience
  • Track 6: Performance modeling, scheduling, and analysis
  • Track 7: Sustainable and green computing
  • Track 8: Scientific and industrial applications
  • Track 9: Artificial intelligence, Machine Learning and Deep Learning

2nd International Conference on Image Processing and Vision Engineering - Online, 20-24 April, Papers due 30 Nov

From the site:

IMPROVE is a comprehensive conference of academic and technical nature, focused on image processing and computer vision practical applications. It brings together researchers, engineers and practitioners working either in fundamental areas of image processing, developing new methods and techniques, including innovative machine learning approaches, as well as multimedia communications technology and applications of image processing and artificial vision in diverse areas.


Events: Conferences, Training

ECP SOLLVE - OpenMP Teleconferences - Monthly calls, Fridays (usually last of month), Zoom, Free

There are roughly monthly OpenMP calls by the Exascale Computing Project featuring an update of ECP’s SOLLVE and a talk. The talk this month, which unfortunately this newsletter is going out too late to let you know about in advance is “OpenMP Tasking, Part 2: Advanced Topics” by Xavier Teruel-García; but there will be another in early December, and then returning onto the regular schedule in the new year.


Workshop on the Science of Scientific-Software Development and Use - 13-15 Dec, Free, Virtual

This workshop, sponsored by the U.S. Department of Energy, builds on reports from 2019 and 2020 on building software better:

With this increasing diversity, we believe the next opportunity for qualitative improvement comes from applying the scientific method to understanding, characterizing, and improving how scientific software is developed and used.

You can submit a position paper here, or just attend; a workshop report will summarize the breakout sessions.


Random

The thing is - and I don’t like it any better than you do - Javascript is here to stay, everyone already has a development environment for it installed, and webasm has helped push multithreading support. So now there’s a book on multithreaded javascript coming out.

Teach concurrency primitives with this adversarial game where you the players try to break multithreaded code - Deadlock Empire.

The new M1s look like they have really interesting multithreaded floating point performance.

An intermediate language for APL-type languages.

Prolog, but instead of values being true/false, they have probabilities.

When to use each of the git diff algorithms.

Round-robin your way to your new favourite coding font. And here’s how it was built with low-code tools.

Anyone ever used database service free tiers for little side products? Cockroach labs talks about theirs here.

GitHub now has a beta feature where it will try to find “merge queues” of PRs which can be merged sequentially without conflict.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.