#50 - 13 Nov 2020

Hi!

This is email #50, and a good time to think about what’s next. It helps that events of this week - the US election, Pfizer’s vaccine - make thinking about the future much less discouraging.

I wrote last week that post-pandemic challenges that research computing teams will be facing include:

Having really high-functioning teams, ready to work with new distributed coworkers
Having a very clear focus to communicate to administration and funders
Having a very clear value to communicate to researchers and other stakeholders

The current link-roundup approach has supported readers much more in 1 than 2 or 3; there’s a lot more material out there every week on managing technical teams than there are on (say) grant writing for research computing, or research product management.

In 2021 I’d like this newsletter to be much less about “links of the week” and more about building up a library of material that help research computing teams face the challenges we take on in supporting research in an uncertain environment.

To support that, I’m putting together a much improved webpage which will have a library of resources, and make it easier to search for past material. I intend that to be up and running early in the new year. I’d also like to change the format a bit, and have shorter more focussed newsletter issues.

I’ll send around a survey before the end of the year, but if you have ideas of what you’d like to see, how you’d like things structured - should everything be in one newsletter, or should you be able to pick-and-choose topics? Or even new formats would you like to see - would something like a community chat/zoom call or interviews or something else be useful to you? - please let me know - email me at jonathan@researchcomputingteams.org, or respond to this email.

Managing Teams

We Learn Faster When We Aren’t Told What Choices to Make - Michele Solis, Scientific American

A key responsibility of ours as managers is to make sure our team members are growing in their skills and abilities. Solis’ article summarizes recent work with simple game-like settings where participants learn, for instance, which symbol in the game is worth more. The authors demonstrate that the participants learn faster when its their choices that drive the learning rather than when they’re guided through the choices which show them the same things.

It’s important to delegate new tasks to team members, and while they’re learning - climbing up the responsibility ladder, or gaining experience managing increasingly complex projects - it’s our responsibility to provide guidance to team members and not leave them adrift. But we do that best by providing guardrails, not by controlling the steering wheel.

End Micromanagement: 6 Signs You’re a Micromanager (And What to Do Instead) - Dara Fontein
The Most Important Management Concept You’re Missing: Task Relevant Maturity - Lighthouse

Relatedly, one of the big questions new managers have is how much managing is too much - you don’t want to micromange. In research computing the default is to come down on the side of way, way too little managing.

These two articles read together I think help clarify things. Fontein’s article outlines common micromanaging signs, and I can almost guarantee that if you’re in research computing you don’t tick any of those boxes (except maybe “you hate delegating tasks”, but that’s a skill that needs some practice).

But when working with new people, especially junior staff, you do have to keep an eye on them as they progress. So how does that work?

The Lighthouse article on task relevant maturity (and other resources that have come up before - An Engineering Team where Everyone is a Leader and The Responsibility Ladder ) help clarify thinking around this. The idea isn’t to drop someone new into some big project and let them sink or swim; that’s not good management. But doing the work for them or rewriting their code isn’t good management either.

The idea is to give clear goals about what is to be achieved along with measures of success, and to let them run with it. Initially you follow up relatively closely - assessing the what’s not the how’s - and as they mature and demonstrate “task-relevant maturity” the training wheels slowly come off and they get more and more latitude and less oversight.

Product Management and Working with Research Communities

A web-native approach to open source scientific publishing – Nokome Bentley, OpenSource.com

The first of eLife’s Executable Research Articles are now online, allowing published, interactive Markdown and Jupyter notebooks in papers.

Readers of an ERA publication can then inspect its code, modify it, and re-execute it directly in their browser, allowing them to better understand how a figure is generated. They can change a plot from one format to another, alter the data range of a specific analysis, and more.

You can see two here:

Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc - Lewis et al.
Inter- and intra-animal variation in the integrative properties of stellate cells in the medial entorhinal cortex - Pastoll et al.

These papers are powered by Stencila which appears to be open-source and pay-for-hosting.

Unequal effects of the COVID-19 pandemic on scientists - Kyle R. Myers et al., Nature Human Behaviour

A massive survey of 4,535 faculty/PIs and what’s happened to their allocation of work time during the pandemic.

Some headline results:

Fields are affected differently - equipment-sensitive or field-work intensive fields have really taken a hit
Researchers with small children have taken a big hit on their research time
Women have taken an especially big hit on their research time, across fields and types of research
Research time is down by a median of ~20% across the board, while grant writing, teaching, and other tasks are more field dependent

In research computing, we have the opportunity to step up and move from service providers to collaborators to shoulder some of this load that’s fallen on the community we serve (where it’s possible for us, of course - many of us have had our home/family responsibilities increase, too).

Bryn Roberts: Reflections on Pharma’s Scientific Computing Journey - Stan Gloss and Loralyn Mears, BioIT World

It’s always interesting to see why communities adopt technologies; anything that catches on quickly is because it addresses a perceived need, and solutions that address those needs go from a curiosity to being suddenly in high demand.

If you follow the job board at all you know that there are tonne of research data jobs of all sort in pharmaceuticals - governance to data platforms to software development. The article by Gloss and Mears interviewing Roberts describes why - data was very siloed in pharmaceutical companies until very recently, due to inertia and lack of incentive to change. As easy to use data analysis and platforms became available which could make use of the integration of these different data types, that siloing becomes untenable. Now there’s even (limited) data sharing across companies and a genuine interest in FAIR data standards and common data models and APIs.

Research Software Development

Ten simple rules for writing a paper about scientific software - Joseph D. Romano, Jason H. Moore, PLOS Computational Biology

This may be useful for members of your team or researchers your team works with. Much will be familiar, but it’s worth highlighting a few suggested rules:

Publish for users, not developers - or, more generally, know your audience and what you’re trying to accomplish with a paper or any other communication
Create a long-term software management plan
Maintain consistency between code, documentation, websites, and papers
Plan for follow-up publications and update the software accordingly
Prioritize visibility and availability - again, know what you’re trying to accomplish and who you’re trying to reach.

Rust vs Go - John Arundel

Both these programming languages are getting interest in different parts of research computing. The comparison here is fair; Rust as a C++ replacement, essentially a systems programming language, and Go as a more general development language - more like a Java use case, but in a language a tiny fraction of the size.

For a lot of low-level methods development Rust is probably an increasingly decent choice (unless everything’s rectangular arrays, in which case Fortran is still hard to beat). Rusts’s parallel computing tools are still pretty primitive compared to C++ or Fortran, though rayon looks neat. I actually think it’s a failure of our community that we require tools like aligners and fluid software to be written in a systems programming language, but that’s where we are now. For more general software development, I’ll be curious to see how Go’s new generics handling works out.

How to Recalculate a Spreadsheet - Lord i/o

Reactive programming frameworks, as they mature and their performance improves, are something that could be useful for research computing. Here’s a walkthrough of how they work (using the real history of spreadsheet internals as an introduction) as a way to introduce the Rust Salsa package.

An open-source, high-performance simulator for large-scale geological carbon dioxide storage - Press Release
Extreme-Scale Scientific Software Stack (E4S) version 1.2 - E4S Collaboration
Engineering strategy every org should write - Will Larson

These are all different stories, but they’re connected a theme - the power of strategically adopting a software stack, how that makes a development team more capable of creating new applications rapidly, and how important it is to have a strategy for adding and deprecating things from that stack.

The press release is about a pretty neat multi-physics, portable simulation software for simulating geological storage of CO₂ - GEOSX, which was put together by a collaboration of LLNL, Total, and Stanford.

GEOSX came together as quickly as it did because of LLNL’s RADIUSS Project - Rapid Application Development via an Institutional Universal Software Stack - which is a toolbox of components for assembling multiphysics simulation software. RADIUSS overlaps quite a bit with the Extrem-Scale Scientific Software Stack (E4S) which just had a new release.

Larson’s article is a list of strategies every development or computing team should have - from the mundane (how do we review, merge, deploy and release code) to hiring but also “What are our approved technologies for new projects” and “when and how do we deprecate internal tools”. Because it’s not just enough to have a preferred stack, there has to be a strategy behind it, and that will inform when new tools enter that stack and others leave.

Larson ends his list with a pretty thought-provoking paragraph:

For many of these, you could argue that they are more policy than strategy. That’s a reasonable argument, but for me the distinction is that you turn a policy into a strategy by explaining the rationale behind it: a policy is just an incomplete strategy.

Research Computing Systems

EPCC passes ISO 27001 Information Security and ISO 9001 Quality audits - Anne Whiting, EPCC

EPCC has passed its ISO 27001 (infosec) and ISO 9001 (quality) audits, which is a remarkable achievement for a research computing centre. It’s going to be increasingly important for centres to carve out their niches and be able to explain why they are the best to handle particular research computing needs; EPCC has clearly made the decision that part of its portfolio is sensitive data with high availability requirements and pursued certifications to demonstrate that it is an strong partner for those needs. Really commendable.

Emerging Data & Infrastructure Tools

A Complete Guide for Running SpecFEM Scientific HPC Workload on Red Hat OpenShift - Kevin Pouget, RedHat OpenShift
Demonstrating Performance Capabilities of Red Hat OpenShift for Running Scientific HPC Workloads - David Gray and Kevin Pouget, RedHat OpenShift

Following up on the article in #49, these two blog posts describe the work of RedHat’s Performance and Latency Sensitive Applications (PSAP) in running Gromacs and SpecFEM3D Globe in OpenShift, RedHat’s Kubernetes distribution.

It’s pretty clear running HPC workloads on k8s isn’t quite ready for mainstream yet - but I think it’s closer than some would think. The tipping point, when we start seeing more adoption, won’t be “when things are exactly as easy and performant on k8s as they are on HPC systems”. Instead, I think there will be two steps:

When cloud-style data-intensive workloads are common enough and enough of a pain to run on HPC systems, there’ll be an uptick of teams running both HPC and OpenStack-style systems. I think we’re clearly in the middle of this transition.
When running HPC workloads on the OpenStack systems is less of a pain (along all dimensions) than running two systems, it’ll start being more common to run the workflows on the OpenStack systems. We’re not there yet, but work by teams like the PSAP group at RedHat are getting us closer.

The first blog post covers SpecFEM, which requires a mesher job followed by a solver job. A golang client was written to fire up the job suite, aided by a Kubernetes datatype API for defining the job type; then the images had to be built. To run the jobs, as described before, Google’s MPI operator was used, and the team had to kludge a way to send the output of the mesher job to the input of the solver job. (It really seems to me like shared filesystems across jobs, like shared networks between jobs, are at the heart of the differences between cloud-native and HPC-style clusters).

The second blog post covers the performance results, for microbenchmarks as well as SpecFEM3D_Globe and Gromacs (the post described last week). Software defined networking definitely affects latency and bandwidth microbenchmarks, which is the reason for the isolation breakout described before. With that in place, scaling on Gromacs up to 32 nodes looks pretty decent - maybe 10% worse performance than bare metal, but performance also seems a bit more consistent. SpecFEM seems to fare better. Strong reaches about a 10-11% performance drop as you go up to 32 nodes, but weak scaling the problems the difference is more modest - 3%.

Ten simple rules for writing Dockerfiles for reproducible data science - Nüst et al, PLOS Computational Biology

This may be a good link to send to researchers and trainees looking to publish docker images for the first time, or even for those quite familiar with docker but who have been using it for a long time but who haven’t changed their workflows. The rules are:

Use available tools - reuse existing dockerfiles for instance, or if tools like containers, docket, or repo2docker for Jupyter notebooks meet your needs, try using those.
Build on existing images - a lot of relevant base images can help you avoid building from scratch (and will track multiple dependencies for you).
Format for clarity - All code, even dockerfiles, are read more often then they’re written
Document within the dockerfile - Ditto
Specify versions - Reproducibilty is greatly aided by version pinning
Use version control
Mount datasets at runtime - essential for reuse
Make it one-click runnable - Take advantage of container-running tools in Rstudio, Jupyter, or VSCode
Order the instructions - Speed the building of images by making it easier to cache earlier steps
Regularly use, rebuild the containers - Ensure it still works, and keep it maintained.

Events: Conferences, Training

FAIR 4 Research Software - Nov 17 7:00-10:00 UTC, Nov 19 16:00-19:00 UTC, Lamprecht et al, SORSE event

A workshop on applying FAIR research principles to research software.

Plato Elevate Winter Summit - 11 Dec, Free

Plato, which targets technical leaders, advertises this as “A half-day summit to make you a better engineering leader!”. Sessions on organizing your team, hiring and onboarding remotely, and improving alignment and communication seem very on-topic for us. Talks from the previous summit are available.

Random

If you’ve wondered about self-hosting MOOC-style teaching/learning platforms like edX, tutor is a easy-to-deploy docker distribution.

There’s growing support in the Linux maintainer community for replacing the protocol and behaviour of scp in favour of a tool with most of the same options but using the sftp protocol. This LWN article explains why.

DBDiagram.io - a slick-looking freemium entity-relationship diagram drawer for relational db schemas.

NoisePage - an really interesting project out of CMU for a postgres-compatible self-driving database system with an Arrow-based columnar storage.

RCT Newsletter