It’s been a good week here to remind me of the breadth of skills we need as research computing managers. I’ve had to meet separately with several groups of research stakeholders, and with a legal team on topics of licensing and privacy, helped a team member debug a service, clarifed the scope of multiple tasks, put together a budget for a project that might never be, helped configure some authentication, give a presentation, and coached a team member on presentation skills.
One of the things I’d like this newsletter to grow into in year two is a bit more of a community where we can exchange tips, best practices, and knowledge on the dizzying number of sills we need to have at our command. To put it mildly, my training as an astrophysicist did not prepare me for this.
But it was nonetheless a great week! The team continued on without me with zero issues while I was away for two days (an experiment I’ve been needlessly hesitant to try since the pandemic started), progress is outstanding, and efforts with other stakeholders started months ago are starting to pay off. Our skills from our previous careers do carry over, and we can be successful - sometimes it just takes more trial-and-error than it would have if we had a bit more of a community of research computing team managers.
We got some more tips on email-handling from a reader:
Absolutely, Inbox isn’t a ToDo list device.
Using Gmail, I heavily use filters to pre-tag incoming email so that once I read it (and create an action in OmniFocus if warranted), I hit archive and it already has a tag to cut down on search time/effort.
I have used Active Inbox for awhile (Chrome extension), but they’ve wandered from the original, and now I basically only use the tags (S/Action, S/Waiting For, S/Read (for long-form items like Schneier’s security mailing or Paul Graham’s work), and S/Someday (which I don’t actually use). I mostly use Waiting For, and it is duplicated in OmniFocus, but Waiting For is the category of work I’m most likely to forget rather than Action.
When I was a sysadmin, I received ~126k emails in a year, with all the dept’s systems reporting in daily. I got through it all, Inbox Zero, every day. A bit tougher now that I have a lot fewer emails, but many require replies (but I do try to minimize replies).
As always, we’re happy to hear any productivity tips you have - just hit reply.
And now, the roundup!
The Management Flywheel - Camille Fournier
One of the huge challenges for managers - which can be overwhelming for some new managers - is that changing people systems is hard. And all problems are people problems. Fournier points out that managers that try to get out whatever problem their team is facing by saying “We’ll Just Use Technology X” or “We Just Need The Right People” are probably not going to succeed. In her experience, and mine, there’s no big-bang change of people or technology that is going to get a team or an organization out of a rut.
Changing how teams or orgs work isn’t a single-step process; “team’s working well” and “ team isn’t working well” isn’t a big on/off switch. Fournier uses Jim Collins’ analogy - it’s a flywheel. A flywheel can be stopped dead. If you apply some force to it, it’ll spin for a while, but start losing momentum. If you turn your back too long, if you don’t keep strategically applying force, it will slow down dangerously. But you do keep giving it a push, again and again, there’s no single point at which you can say “before this, the flywheel wasn’t really spinning, now it is”, but over time it will have very clearly picked up momentum - and now maybe there’s a few more people helping keep it going.
This can be enormously frustrating for new managers. There’s no single moment where you can just step back and say, “yup, that push sure did it all right, now it’s fixed”. But once you get that flywheel going, jeez, there’s no telling how long it can go with a little nudge here and there.
You Have A Performance Problem - Adrian Howard
If people keep telling you our team’s code had a performance problem, we’d do something about it. Given that that’s the case, Howard has a short and pretty scathing article asking us why we’re not doing anything about the diversity, equity, and inclusion problems that people keep pointing out.
Online Research: From Funding to Data Collection - Association for Psychological Science Blog
A lot of psychology research, which has traditionally depended on people coming into the department for experiments, has been disrupted by COVID-19. Much of that work is slowly moving online, which arguably helps with recruiting a somewhat broader population than those within easy commute to research University campuses.
The APS blog discusses some options from a webinar by Robert Ariel (Virginia Wesleyan University) and Joshua VanArsdall (Elmhurst University). They range from the new - jsPsych, a library of commonly used tasks, and hosted platform Cognition.run. Another option for those subjects connecting with desktops is to have the experiment running on the researcher’s computer, connect with Zoom, and enable Zoom’s Remote Control feature. The blog post includes connections to resources if anyone in your researcher community is interested in either approach.
A Software Development Life Cycle for Research Software Engineering - Kings Digital Lab
There was a really interesting SORSE talk this past week, Digital Humanities RSE: King’s Digital Lab as experiment and lifecycle by James Smithies and Arianna Ciula. The Digital lab, which hosts and maintains 160+ digital humanities projects, has a very nice lifecycle model for the research software development/hosting/maintenance efforts they get involved in, and they’ve generously made it, and templates for the documents at every step along the cycle, available to the community. If you’re interested in creating or revisiting a lifecycle model for project engagement with your research communities, this is a good place to look.
How to Introduce and Implement Policy in Your Institution and Still Have Friends Afterwards - Danny Kingsley, Sarah Shreeves
These are the very interesting looking course materials for a previously held course on making changes in policies in academic institutions. The context seems to be things like open access or data management policies in Universities but it has applications more widely. It’s pretty common for those of us on the computing side of research to need to shift institutional policies for our work to be effective, and these sorts of tools are extremely helpful.
As with much in the newsletter, the key described in these materials is to be deliberate and professional in your approach - map out key decision makers and stakeholders, work to find out their incentives and needs, discuss the matter with them systematically.
Software and software metadata - Daniel S Katz
On the data FAIRness side, a lot of work has gone into metadata standards in various fields. For research software, there really hasn’t been. You might want to keep track of contributors and papers-to-cite for credit (like it or not, credit is still coin-of-the-realm in research) but also programming language and/or OS requirements, various URLs (docs, source code, installation scripts, etc), Individual ecosystems have some of that in their own packaging systems, but there’s nothing cross-ecosystem.
A lot of the information is available in human-readable form like READMEs, which is good, it’s not lost for eternity or anything, but it’s not machine readable. That’s a shame, because as dependency-tree tooling is getting better it would be great to be able to see which giants shoulders a particular tool is built on top of, etc.
In this article, Katz reviews the state of software metadata, why it’s still hard, and what are some possible next steps for the community.
A couple recent cool research computing projects caught my eye, because they’re likely to have significant impact. In my experience impact in research computing comes at the intersection of computing (simulation, analysis) and real-world data collection.
In many areas it’s common for half or even more of rivers to be seasonal, or “intermittent” rivers - the water only flows during rainy periods. They lay a lot of ecologically important sediment and have a huge role to play in groundwater and ecosystems, but they’re inherently more complicated than always-running rivers and less known about them. This $6 million, four-year NSF EPSCoR Track 2 grant will be among other things, a huge IoT effort to put sensors down to collect data over many states, collecting data over multiple watersheds and jurisdictions and getting a very large-scale view of their behaviour. In addition, there are really interesting broader-impact goals including working with indigenous communities and having undergrads take on significant roles in the research.
Researchers combine CAT scans and advanced computing to fight wildfires - Andrew Myers, Stanford University
Fire researchers creating better, faster models to predict how wildfires burn - Todd Hollingshead, Brigham Young University
Pyrolysis kinetics of live and dead wildland vegetation from the Southern United States - Elham Amini, Mohammad-Saeed Safdari, David R.Weise, Thomas H. Fletcher, Journal of Analytical and Applied Pyrolysis
We’ve seen bigger and more damaging wildfires in recent years as climate change continues to make its impact felt. One of the challenges in modelling wildfires is that fire is not an on/off state - between alight and extinguished, fire can sit smouldering in wood for some time, waiting for the right conditions to start burning again.
In this work, researchers used CAT scans of smouldering wood to build a “sub-grid scale model” of how wood smoulders under a number of conditions, and used that to improve simulations of wildfires - which can then be used to make predictions of how fires will spread, and to inform better planning before fire season on evacuation zones, escape routes, etc.
Simple Anomaly Detection Using Plain SQL - Haki Benita
Because most research software developers or data analysts are extremely capable at one or more programming languages, the tendency when doing any non-trivial analysis is to pull it all into memory in arrays or data frames or what have you and perform the calculation – even when the data is sitting in a database.
But as data volumes grow, that leaves you awkwardly having to manage out-of-core computation, in addition to dealing with the costs of all that data movement. Whereas the database has analysis capabilities, with very sophisticated query optimizers, to work on the data in-place.
Benita walks through a tutorial of doing complex, windowed analysis of data in plan SQL - something any database should be able to handle - using anomaly detection as a use case. (I guess in a way this also ties into the SLO monitoring article below). Although the SQL might not be as natural as Python or R or Fortran, it has the huge advantage of analyzing the data in-place. And of course with SQL engines that support user defined functions (which is most of them) even more is possible.
Spying on the floating point behavior of existing, unmodified scientific applications - Adrian Colyer, The Morning Paper
Spying on the floating point behavior of existing, unmodified scientific applications - Peter Dinda, Alex Bernat, and Conor Hetland HPDC’20
Coyler’s paper summaries are something I often read, but they’re almost always about hyperscaler-style distributed systems - so I was quite surprised to see one mentioning LAMMPS, GROMACS, WRF, and other familiar names!
The paper, by Dinda et al, was presented at HPDC’20 (you can see the talk here). The paper, which Colyer as always has an excellent summary of, describes the software package FPSpy, which monitors floating point signals during the run of unmodified codes. While it’s not surprising that Inexact is signalled a lot, underflows, overflows, NaNs, and denorms still happen a fair bit, and generally in a small set of code locations. (Coyler and the authors suggests promoting those regions to arbitrary precision calculations using tools like MPFR I think both misunderstands the problem and the cost of the solution).
To be clear, I think the particular codes mentioned have vetted their numerics super closely, and I don’t think the results here suggest there’s any significant impact on the results. But I have always been a little surprised by how cavalier people are about ignoring IEEE 754 flags, or setting behaviour to a known good state in case the job ahead of you did configure the (say) rounding behaviour to something you weren’t expecting.
Do any simulation-running readers use tools for monitoring these events? Because this does seem like a tool that would be handy to run from time to time.
Code scanning is now available! - Justin Hutchings, Github Blog
How to properly manage ssh keys for server access - Marc Päpper
Päpper covers similar ground, giving a nice motivation to the problem certificates solve with very hands on steps covering how it’s done manually, but then adds the additional step of adding role information using “principals” in the signature.
Alerting on SLOs - Mads Hartmann
Another recurring theme in this newsletter is that while research software development takes a lot of guff, research software development in research is often much closer to industry best practices than research computing systems management. While there’s a lot of research software out there with version control, continuous integration testing, documentation, and disciplined release management, it’s much rarer to find research computing systems with crisply defined service level objectives (SLOs). And without SLOs it’s not possible to answer even the most basic questions - “How are we doing?”
SLOs could be measures of node uptime, but they could just as easily be job time in queue, filesystem bandwidth, data availability - anything that matters to the researcher. This is a nice article on the topic, with lots of discussion and resources on service level indicators (SLIs), SLOs, SLO windows, error budgets, and burn rates, and how to alert on SLOs (not by “the SLO has been exceeded” - you alert when the burn rate is high enough that you’d likely exceed the SLO in the SLO window). If you’re thinking about implementing SLOs - and maybe especially if you haven’t before - this is a good read.
Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - Thai Wood, Resilience Roundup
Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - John S. Carroll, Industrial & Environmental Crisis Quarterly (1995)
This is is a recent blog post about a less recent paper, reviewing how incident reviews work in high-hazard industries like nuclear power.
Whether the environment is life-critical or just inconvenient like a research cluster going down, a common incident review failure mechanism is to focus on the fix to the presenting problem. Wood and Carroll argue that instead the focus should be learning and adapting and trying to avoid or improve responses to entire classes of problems. Wood identifies failures of incident reviews as:
oneAPI is a standard, based on Dynamic Parallel C++ (DPC++) which is in turn is based on Khronos’ SYCL, which aims for performance portability across CPUs, NVIDIA or AMD GPUs, and FPGAs. We’ve been through this a few times before and I remain skeptical of the “one programming model to rule them all” dream, but it’s interesting that Intel is putting real money behind this to fund work at Heidelberg.
Coiled: Dask for Everyone, Everywhere - Matthew Rocklin
I continue to think the PyData tool Dask is one of the more interesting experiments out there for distributed computations. As it matures and gets more capabilities, it’s getting harder to set up non-trivial Dask clusters - especially if you want to set up authentication, keep images in sync, etc. Coiled aims to make it much easier, and there’s a hosted Coiled Cloud to make it still easier (and to provide a business model).
Build automation for the post-container era - Earthly - The Earthly Team
If your team has been doing anything on hybrid clouds, this special issue may be of interest. From the call, topics of interest include, but are not limited to:
ChaosConf 2020 - 6-8 Oct, Free
Curious about ‘chaos engineering’ - deliberately introducing faults to make sure your system can gracefully handle them? (Way back in issue 5 we talked about Slack’s Disasterpiece theatre, a maybe more research-computing friendly version of same). This is a conference with several hands on sessions for introducing chaos into everything from simple load balancing with nginx to making it a part of your CI/CD systems.
The next coming event in the SORSE programme is a twofer (you can register for the two separately). The first, by Drs Trisovic and Crosas, talk about using a number of different container formats to make research outputs of all sorts more Findable, Accessible, Interoperable, and Reusable. The second, by your humble newsletter scribe, is a 10 minute crash course on things we’ve covered in this newsletter a number of times - one-on-ones, feedback, and delegation.
High Performance Container Engines Workshop -October 8, 2020, 1:00PM - 5:00PM CEST, Free
Hosted by AWS and following up on an HPC containers workshop at ISC20, this event will focus on how containers can be used in traditional HPC environments.
After a quick refresh on Containers and Container in HPC, the current state of the art in run-times and engines will be presented. Followed by a set of hands-on workshops to get the attendees to run HPC applications on the most prominent HPC Container engines and run-times. Already set is GROMACS on SARUS and Podman, potentially Singularity.
Interesting to see the move away from regularly structured computations like simulation or even deep learning towards general data analysis have impact on even pretty fundamental bits of infrastructure like memory allocation primitives in NVIDIA GPUs.
Fortran in the browser with webassembly, which will allow decently fast linear algebra and other number crunching in a web client - could be lots of applications for speeding visualizations.
Short functions are definitely not necessarily less buggy.
Make sure your shell sessions always use tmux with tmux_chooser.
A video recording of a talk from 2019 on various sustainability models - or even just sustainability definitions - for open source projects.
Interesting to see bog-standard HPC techniques - keep an eye on memory accesses and copies, improve communications patterns - being applied very skillfully by the deep learning community. See, e.g., “Microsoft’s updated DeepSpeed can train trillion-parameter AI models with fewer GPUs”, source code here.