#129 - 9 Jul 2022

Making Change Happen; RCD Strategic Planning Challenges; Large batches hide bottlenecks; Arm memory model case studies; Maybe your DL model needs a database; Graphs for Code Search; HPC Operational Data Analytics

Hi!

I write a lot here about challenges our teams face, but I don’t think I mention often enough the huge advantage our background in science is in our work (or in similar work in other environments).

Whether we were trained in science or just picked up that mindset by immersion in the ecosystem, the skills we have are skills that (say) tech companies pay real money to train their staff in. The advanced collaboration and influencing skills we have to develop in scientific collaborations is a huge advantage all the way up the management or technical leadership ranks, for instance. Being willing to think in terms of experiments when making changes (like Beinbold’s article, below), or having a “growth mindset” (current jargon for believing that people can develop new skills(!!)), or being able to synthesize information - these fundamental skills, skills we don’t even bother mentioning in our environment, are in high demand.

It’s not just people skills, either - I see an article about staff engineer interview skills that include being able to estimate well, something every quantitative scientist does without thinking about it, or another article this week about how tech companies are having to relearn basic performance things like array-of-structs vs struct-of-arrays, which many of us have had to learn optimizing scientific code.

These skills are in real demand everywhere, because they are valuable.

Within research computing and data teams, I think we don’t take full enough advantage of those skills because we take them for granted. We’re really good at turning our scientific mindset onto the actual science stuff we’re supporting, but less good when thinking about how we work. Experimenting with new approaches, documenting processes as protocols, collecting data about how well our team is working - our scientific skills are really powerful tools that can be used in a lot of contexts.

Anyway, I’ll write more about that later - for now, on to the roundup!

Managing Teams

Seven Skills to Save Software - Making Change Happen in Complex Software Environments - Mathiew Beinbold

This article is in the context of leading a massive software rearchitecting effort - that’s the ‘complex software environment’ and ‘save software’ bits. But Beinbold describes very clearly what it takes to make change happen in any complex group.

Skip down to about 1/3 of the way through, to the “The seven skills” slide, and go through the rest. This is an extremely lucid overview of what it takes to make change in an organization. The bits about getting people on board - have new identities for participants, having onramps for people to get involved at all levels of commitment, have ways of allowing people to publicly identify with the change - are great.

This scale of engagement is important when trying to make change happen in a large organization, but even in smaller teams - even just your own small team - these are the steps to go through. With a small team a lot of these things can be done collaboratively, especially coming up with a clear vision.

Beinbold’s steps are:

  • Crafting a compelling vision
    • Start at the end - what are you celebrating?
    • Recast obstacles as outcomes, not absences (e.g. as positives, not negatives)
    • The vision must include an articulation of why the current state is not sustainable, an attractive plausible future, a new identity for participants, and a path connecting it together
  • Coalition building
    • Create onramps for people to get involved
    • Start organizing efforts
  • Communication
    • Evoke the reasons for change they already have
    • End with a call to action (and provide schwag!)
  • Work via experiments
    • Small experiments == less resistance
    • Short term wins
  • Celebrate the wins
  • Empower others
  • Terraform the culture - new practices need deep roots

Towards a National Best Practices Resource for Research Computing and Data Strategic Planning - Schmitz, Brunson, Mizumoto Jennewein, Strachan
Why Most Strategies Lack Clarity - John Cutler

The first link is a summary of a workshop at PEARC21 about strategic planning for RCD work, and a survey of challenges. It’s a great distillation of why things are so challenging and confusing in strategic planning in our line of work. It ends with a wish list of thing that would help:

  • A repository of templates, examples, and models of strategic planning
  • A collection of narratives and use-cases that describe successful programs
  • Examples and practices for communication strategies related to strategic planning
  • A program of mentoring and identifying expertise related to strategic planning

Another request by a participant listed elsewhere in this document is for “common strategies”, which I think reflects some of the confusion about strategy in our line of work. But the nature of strategies is that they’re very customized to a particular situation. Strategy examples and models, yes, for sure. Case studies are an absolutely classic ways to learn strategy by example. And a list of common tactics is a perfectly sensible thing to want; tactics are applicable in many contexts.

Honestly, I think the terms “strategy” and “tactics“ cause more confusion than they’re worth. We’re not on a battlefield. There’s some outcomes you want, and current problems in the current context preventing reaching those outcomes; a proposed way of solving a problem and achieving an outcome in that immediate context; and a plan to implement the solution. That’s all there is. Using buzzy ill-defined words to describe those things clouds thinking and leads to people talking past each other.

Also listed are some challenges in executing operationally, I guess as inputs into what’s feasible when planning:

  • Funding
  • Rapid changes in compute needs
  • Vast storage needs
  • Difficulty finding, hiring, and retaining staff

An issue that comes through in several places is that there’s often not a very good overall direction set at the institution level or even at the research level, so it’s hard to see what to plug into. Also, research computing and data often crosses several organizational boundaries - VPR, Libraries and Central IT which typically report up into something like a VP Operations, and often various schools have their own research computing efforts. At best, those individual service providers have clearly defined directions and focuses which collectively leave gaps or overlaps in the overall effort; the more typical case is even those efforts are kind of fuzzy about what they do. Either way there has to be some kind of bottom-up and top down approach to putting something coherent together across an institution.

In the second, Cutler gives his opinions as to why there’s a lack of clarity and coherence in so many strategies. His argument is that people think strategies have to provide certainty, and everything is uncertain, so people are unwilling to commit to a strategy. I definitely think that’s part of it! I think another part is that any meaningful strategy — really, any meaningful decision about the future at all — means explicitly closing the door on other perfectly good possible options, and people are incredibly uncomfortable doing this.


Technical Leadership

Large batches obscure your bottlenecks - Jonathan Hall

Here’s a good technical analogy for technical leaders - large batches are often something we do to hide latency. If we are working in large batches - lots of work in progress, for instance - it’s hard to see what the bottlenecks, the latency issues, are in a team’s work.

The easy way to find bottlenecks, targets for improvement, is just to reduce batch sizes so you can more clearly see where the bottlenecks are in a team’s processes.


Managing Your Own Career

Advice for Engineering Managers who Want to Climb the Ladder - Charity Majors

A good article by Majors on how being a director (in title or in role - a manager of managers) is very different than being a manager. It involves a lot of similar activities, but the function is very different - the broader scope means there’s much more emphasis on working with peers across the organization, stakeholders, and upper levels.


Here’s an announcement on twitter by Dr. Anna Krystalli that she’s started her own freelance R RSE practice. This is fantastic, and I’d like to see more of it. There are advantages to working embedded in an institution, but disadvantages too. After two years of mostly working remotely, researchers have gotten very used to working with research computing and data staff remotely, and there’s no reason (other than slightly easier billing) that such services can’t be offered from outside the institution.


Product Management and Working with Research Communities

Quantum games and interactive tools for quantum technologies outreach and education - Seskir et al., SPIE Optical Engineering
How Effective Is Quantum Education Really? - Kenna Hughes-Castleberry, The Quantum Insider

The amount of educational and training material available for quantum technologies is absolutely remarkable. A lot of this is due to the recent explosion in interest in quantum computing, but quantum materials science etc has meant that a lot of materials predate this. There’s even an EU-funded effort, Quantum Technologies Education for Everyone (QuTE4E). Seskir et al, as part of that effort investigate the state of a number of games and other tools for outreach and education, and Hughes-Castleberry summarizes. This might be of interest for people looking for community outreach tools for quantum efforts in their own institutes, or to survey ideas for outreach and education tools in other fields.


Research Software Development

Synchronization Overview and Case Study on Arm Architecture - Ker Liu, Zaiping Bie, Arm

Arm has a different memory model under concurrency than x86 does (there’s a nice overview by Russ Cox discussed in #81), and a lot of software everywhere very much has an “all the world’s an x86” assumption baked in. The white paper described by this blog post gives an overview of Arm’s memory model and synchronization mechanisms, and gives three cases studies (OpenJDK, DPDK, and MySQL) of analyzing problems caused by broken assumptions, and fixing them.


Using Graphs to Search for Code - Nick Gregory

Code static analysis tools and libraries have become so available and powerful that it’s possible to write custom tools to look for particular code idiosyncrasies without too much effort. Here Gregory uses semgrep to search for one patterna across 11,659 Go GitHub repos just by writing one rule, and wrote a (much faster but custom) go program to do the same.


Research Data Management and Analysis

What the F(ederation): Synthesizing the many recent data strategies, white papers, and reviews - Jessica Morley

Morley summarizes and synthesizes a series of nine strategy documents from the past three-ish months(!!) for or relevant to the UK’s National Health Service (NHS) data in healthcare plans. In areas of Direct Care, Managing Population Health, Planning NHS Services, and Research, Morley distills key points under the topics of Platforms, Privacy and Security, Information Governance, Ethics Participation and Trust, and Workforce and Ways of Working.

If you’re interested in considerations around health research data and health research data policy, this summary (and links therein) is a great resource for understanding how a national government that is probably furthest ahead (along with Australia) in thinking about these topics is coming to understand the needs, opportunities, and challenges.


RETRO Is Blazingly Fast - Mitchell A. Gordon

Gordon writes about a paper a team from Google’s Deepmind published at the end of 2021, describing their RETRO model. It’s a large-language training and inference process that uses a database (an index into a dataset plus a way of querying) rather than just flat data, in both the training and inference stages to pull out relevant data. It greatly speeds training, allows for a smaller model (albeit with an associated database).


Docker for Data Science - Alex K Gold

A chapter of a larger online “DevOps for Data Science” book aimed specifically about creating and using reproducible R and Python containers for both day-to-day data science and publishing reproducible workflows.


Research Computing Systems

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments - Alessio Netti, Michael Ott, Carla Guillen, Daniele Tafani, Martin Schulz

This is a cool paper by a team from Leibniz, Jülich, and Fujitsu describing near-real time data analytics of HPC clusters operational data for visualization (ODAV) or real-time control (ODAC). At Liebniz, on SuperMUC-NG, they monitor CPU performance, memory activity, and other such metrics (27 in total) through an analysis pipeline and store the data in database for near-real time visualization of jobs as well as being able to go back through job history. (Monitoring these metrics resulted in overhead on the compute nodes well below 1%).

On Jülich’s hybrid DEEP-EST system, with a similar architecture (shown below) they one step further: the system is

used to perform predictive optimization of the system’s warm-water cooling infrastructure. It is in active use since October 2020. A diagram representing the ODAC pipeline imple- mented on the DEEP-EST system for cooling control.


The practical guide to incident management - Incident.io

The Incident.io team has a pretty comprehensive guidebook to incident response. Even if most of our teams don’t do on-call, the other material on responses, and improving based on what is learned is very relevant, as are the playbooks and reading lists.


Random

Measuring memory usage in Python: it’s tricky! Multiprocessing also has its quirks. Finally, CPython 3.11 is 10-60% (average: 25%) faster than Python 3.10, which is good news all around.

NASA Computers in Spaceflight, 1958-1987.

Using GPT-3 to explain code.

A JIT-ting tool for parallelizing shell scripts(!?) [PDF link]

A very sensible approach to monitoring tiny web services in a minimal way.

An utterly deranged approach to implementing a new programming language, by bootstrapping it through seven layers of abstraction using only assembly language and the layers beneath.

Implementing an assembler entirely in C++ templates.

Simple hypercard-inspired web programming for those new to coding with Anaconda’s PyScript.

Finally, we can do deep learning inference on our Commodore 64s, with Tensorflow Lite for Commodore 64.

You…. you don’t have a Commodore 64 lying around? Ah, you must have been a TRS-80 sort like me. That’s cool, we’ve got you covered. A C64 emulator for the Raspberry Pi.

There’s a new ghostscript PDF interpreter, which is how I found out that ghostscript is still a thing. (Quick, ask me how much I miss having a Linux workstation as my daily driver, if I can hear you over the sound of my laughter).

The history of Coherent, a cheap ($100) commercial UNIX clone for x86 from the early 90s.

TIL that SQLite will give index recommendations.

Style guide for markdown lists, with a markdownlint configuration, to make the lists and version control diffs as readable as possible.

A simple if hacky approach to having a multi-user AWS ParallelCluster by creating posix users as part of the node initialization.

An RSS tool to generate RSS feeds from pages that lack them, RSS Please. As a newsletter author, may I say — please, please, please, put RSS feeds on your websites.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.