#155 - 11 Feb 2023

Strat plans are like scientific papers (not in a good way); Flexible work arrangement survey; Talking after mass shooting; Write SOPs; Minimize WIP; Organizing inclusive events; HPC Security draft


As you can see, I’m slowly getting back onto the regular newsletter schedule after coming back from my belated holidays trip.

Thanks for you comments on last week’s newsletter about a refactoring. I had several people say that they like having things in one place, although I worry that there’s a bit of a survivor bias to that.

Still, it should be taken seriously - so I’ll tweak my plans. Let me tell you what I’ll be trying over the next couple of months.

For the past year I’ve been hearing from people who aren’t RCD managers, but are running other teams in academia, or who are managers from academia but now working outside, in spinoffs or in company’s data science or AI or biotech teams. I’ve been a bit surprised to hear from them, but the problems they wrestle with (especially around the basics of management or thinking strategically) are similar.

So for that time I’ve been trying to help them, playing up the basic management content (like the just-finished meeting series) and playing down some of the systems or software or strategic plans content for instance.

I think that’s made RCT a little worse for its key audience, to be honest. The alternative is to make it even longer, which probably also would be not great (especially as the focus would still be split).

So my intention was to have a separate management-for-people-from-the-research-world newsletter, and have RCT switch back to being very much RCD/research technology professional focussed. RCT would have somewhat more software and systems and data management content, along with higher-level management issues like strategic planning and running a professional services team. Email #2 would have the management-for-researchers basics content.

Having heard the feedback, I may not make as big a transition as I was initially thinking; I’d include at least some of the management basics here, although my own writing at the beginning probably wouldn’t focus on management basics. When I do write something like that, they’d appear first on the other email, and show up here as blurbs in the link roundup.

Anyway, that’s my current thoughts. I’d love to hear your feedback - let me know what you think! Hit reply or email me at jonathan@researchcomputingteams.org.

Let’s talk super-briefly about RCD strategic planning. I’ve said before (#130) that I don’t think collections of strategic plans are a useful way for groups to develop their own strategies or strategic plans. I’d like to go into a little more detail why here.

Strategic Plans are Like Scientific Papers (And Not in A Good Way)

When talking about graduate-level science education, we often raise the point that reading scientific papers is absolutely not training in how to do scientific projects yourself. Scientific papers are polished final products. They describe a finished set of actions, and conclusions. They provide little or no guidance as to why the researchers chose that problem to focus on, how they made decisions about proceeding and next steps, and where they came up with their ideas about next steps. Indeed, the whole genre has incentives to make the process being described look inevitable: that taking step B after taking step A was in fact the only reasonable choice. Knowing how to tackle a scientific project, and why to choose one project over another, which steps to try after the first, requires mentorship, practice, and having other experienced hands to teach how to find answers to these questions.

Someone else’s strategic plans are the same thing. If anything, they’re a smidge worse. Other team’s strategic plans are like code without comments - we can read it, sure, but it will reveal merely what the code does. We have to supply our own guesses about why it does it, and why it does it that way.

Making your own strategic plan that advances your own team is all about deciding on the why’s and how’s, and deeply understanding the local context which provides the constraints, incentives, and goals. We can’t transplant bits and pieces from other teams’ strategic plans and hope they take root in our context. Even if it “worked”, by definition it wouldn’t add up a to a focussed, coherent solution to the problems your environment has.

Luckily there are parts of our RCD community that understand this and are taking action. I was really heartened to be on a recent CaRCC Strategy and Policy Facing Track call, chaired by Patrick Schmitz & Melissa Cragin, which dug into the local factors that impact RCD strategy. They’re starting the process of discussing these constraints and choices, talking about them in terms of more complete case studies - these are the local factors, and so we’re trying this, because that - that are useful ways to learn and transfer knowledge. Case studies are how business schools, law schools, and even some med schools and social sciences teach complex decision making in ambiguous situations; we don’t get much of it in STEM education. Efforts like this are really encouraging to see!

Resources I like

Strongly recommended books for our line of work:

  • The Nonprofit Strategy Revolution,” La Piana & Campos - our teams operate like nonprofits in that they are mission-driven and can’t just “pivot” to another mission. This book lays out a very clear process for non-profit strategic planning that maps well to our needs.
  • Good Strategy / Bad Strategy”, Richard Rumelt - mainly just to make sure you know what bad and pointless strategy looks like, and the problems good strategy solves. Warning; once you read this you’ll see bad and pointless “strategy” everywhere in research.

With that, on to the roundup!

Managing Teams

Q1 2023 Flex Report - Scoop

At this point a lot of teams have settled into a relatively stable set of working conditions. It’s interesting to compare against organizations broadly.

Scoop (a company that sells software to hybrid-working teams for signing in to come in on certain days) ran a survey of US companies to give an overview of what “hybrid” is turning out in practice to mean.

Some highlights:

  • A lot of the organizations that require full time on-site are either very large or those whose work is inherently on-site (e.g. hospitality, manufacturing)
  • A bare majority have some degree of flexibility
  • … the largest chunk of which (46% of those that have flexibility) have more or less maximal flexibility - it’s fully the employee’s choice when/if to come in
  • Other popular choices, in order of frequency: a minimum number of days per week (2 or 3, typically), fully remote (no office at all), followed by specific days per week (typically mid-week)

My impression is that most of our teams have settled into either having specific in-person team days or being fully flexible; does that mesh with your experience? What has your team chosen, and why? Has anyone completely given up their office space? I’d love to hear - just hit reply or email me at jonathan@researchcomputingteams.org.

The Remote Work Differences No One Talks About - Marissa Goldberg

Goldberg makes some points about the key differences with remote work. They’re points that I think most of us have kind of figured out, but I haven’t seen them expressed so concisely before; this is a nice formulation.

The big differences, Goldberg writes, are about

  • Accountability (between the team member and the manager and/or other stakeholders)
  • Judgement (of the team members)
  • Autonomy (of the team members)

Remote work necessarily increases the autonomy of the team member, and that means the bar for judgement goes up (the team member has to have a more discerning take into what matters, and how to do things) as does the bar for accountability in how team members are held accountable.

In our line of work, in the world of research, we’ve typically granted lots of autonomy and required fairly good judgement (at least about the content of the work, if not the process). Where managers I talk to typically struggle somewhat is on accountability topics (and, maybe, hiring for judgement without requiring a Ph.D.).

Is that consistent with your experience as your team went remote? Anything you’d add or emphasize?

How to Be an Ally to Colleagues After Violence Against Their Community - Melanie Prengler et al, HBR

The number of mega-threats like mass shooting continue to rise, and these events a make members of the targeted groups feel less safe even if they’re nowhere near where the event happened. Indeed, that’s partly their purpose.

Raising the topic can feel very sensitive and high-risk if (like me) you aren’t a member of a group that faces these kinds of threats. Prengler et al try to change this by suggesting small concrete things you can do:

  • Educate yourself about the issues, attend community events, etc., while keeping in mind the limits of your own personal experience
  • Consistently share with colleagues who face bias that they are valued
  • Mega-threats affect people well outside of the location where it happened
  • Don’t pretend nothing’s happened
  • Ask questions like “I’ve been thinking of you in light of what’s happened - how are you doing today?”
  • Ask what concrete things you can do to help - “What’s something I could take off your plate? How could I make this week easier for you?”
  • Attend community events

Project and Technical Leadership

Write standard operating procedures - Ben Cotton

We’ve talked at length about the benefits of runbooks (#30, #77) or SOPs (#116, #121) for teams. Cotton, program manager for Fedora, adds to that by suggesting SOPs for yourself:

Successful program management is the absence of surprises. Don’t interject your own surprises by being inconsistent in how you handle routine tasks.

He suggests a simple format:

  • Description,
  • Trigger, (when the SOP kicks in), and
  • Procedures - the step by step

He suggests not having the “why” of the SOP in this document; I’m not sure I agree with that. Having the purpose in mind, along with constraints it was developed under, makes it much easier to see when it can or should be changed.

Fixing “Too much WIP” - Jason Yip

At some level, we all know that switching between tasks is expensive, and that it’s better to get fewer things done quickly than more things done eventually. But, as Yip points out, too much work in progress / tasks in flight / etc. happens all the time: in his estimation, typically because

  • Lack of strategy - not enough focus on what matters (“incoherent problem-solving”), or
  • Overemphasis on utilization (“make sure there’s always something to for everyone to do”).

The reality is, there’s always more to do, and only some things are important to do right now.

Yip’s possible WIP countermeasures:

  • Have a clear strategy
  • Pay attention to throughput of tasks
  • Explicitly limit WIP: by items per time box, or items total, or queue size of items at each stage.

Product Management and Working with Research Communities

Guide to Organizing Inclusive Scientific Meetings (Updated) - Emily Jack-Scott et al, 500 Women Scientists.org

The 500 women scientists organization released their meetings guide for organizing inclusive scientific meetings. It’s not long, but it’s dense, and very actionable - full of very specific steps. As such it’s hard to summarize, other than to say like anything else important, it’s something that requires attention throughout the process, not something that one just bolts on somehwere.

There are steps here for planning the meeting, during the meeting, and afterwards assessing the meeting, with particular recommendations for in-person, hybrid, and purely virtual events.

Research Software Development

Lessons Learned Transitioning from Experimental to Computational Science - Jared O’Neal

I don’t think the connection between experimental and computational science gets made often enough. Here O’Neal, who comes from the world of experimental condensed matter physics and then astronomical observation, considers the last five or so years of his work in the world of computational science.

O’Neal viewed with some puzzlement how code quality management, documentation and testing were viewed as being outside the “real” work of getting science done, and how that was something for software engineers to do rather than the scientists:

While I sense that this work is seen as necessary, I also sense that some in the community consider such work to be outside of or less important than the scientific work of interest. However, for me these practices are not software engineering, but rather an important and necessary part of working with a scientific instrument. In this sense they are more precisely classified as a subset of low-level, foundational scientific best practices adapted from software engineering. I believe that researchers working on the high-level science need to be in control of and take ownership of the low-level science, as only they know what is needed at a low level to ensure that the high-level science is executed rigorously. I view scientific software as a scientific instrument.

He continues the analogy with codes as instruments (an analogy I agree with!) and talks about analogues to lab notebooks, lab environments, and more.

Much of computational science culture still comes from simulation, which grew out of the world of theory. As more of computational science becomes data-intensive, I hope that systems and software development work will be more strongly influenced by experimental science culture.

The code quality pyramid - Fabian Zeindl

Zeindl makes an argument I haven’t seen elsewhere - that to focus on code quality, one needs to focus one’s efforts initially not with coding standards and the like, but instead with fundamentals like build performance and test performance. The idea is that if building and testing takes forever, it will be unnecessarily hard to make the incremental changes necessary over time that contribute to code quality.

Zeindl’s pyramid categories, in inverse order (most fundamental on the top), are:

  • Build perforance
  • Test performance
  • Testability
  • Component code quality
  • Features
  • Overall performance

Zeindl’s Code Quality Pyramid, with Build Performance at the bottom, supporting the higher layers - test performance, testability, component code quality, features, and finally overall performance

Research Data Management and Analysis

Big Data is Dead - Jordan Tigani, MotherDuck Blog

Well, maybe not dead, but it’s clear that “everything is going to be big data” never came to pass.

Tigani argues that one of the reasons that embedded databases like SQLite and DuckDB (which MotherDuck commercializes) are having a resurgence is that a lot of the need for and technological drivers of “big data” kind of tapered off. Tigani points to two important reasons:

  • Lots of important data sets never got super huge, while
  • Individual compute nodes did get huge (in core count, RAM size, and local storage) so that distributed systems like Hadoop became less necessary

Along that, he adds two points about the broader ecosystem:

  • Important efforts like GDPR and CCPA made some kinds of data a liability
  • Even with unlimited data, most data in a data set is hardly ever touched - it’s uninteresting or old and superceded or…

Research is a bit of an outlier, of course. Many of our datasets do get huge, but they’re at least fairly structured which simplifies things. For the others, fat nodes plus new tools mean we can often process them locally in a much simpler way than relying on Hadoop or firing up Spark.

Research Computing Systems

High-Performance Computing (HPC) Security: Architecture, Threat Analysis, and Security Posture - Guo et al., NIST SP 223 (Draft)

NIST has sent out for comment on a draft document about HPC security. The ambitions for this document are modest - it’s basically to try to get everyone on the same page for a mental model and common language for discussing security needs for HPC cluster operators.

Draft SP 800-223 provides guidance on standardizing and facilitating the sharing of HPC security postures by introducing a zone-based HPC system reference model that captures common features of HPC systems and serves as a foundation for a system lexicon. The draft also discusses HPC system threat analysis, security postures, challenges, and recommendations.

Reading the document, it’s pretty clear that the goal of the document is as much to explain HPC clusters to security professionals as to explain security basics to HPC clusters.

The document isn’t very prescriptive - it does sketch out a simple model for dividing HPC clusters into different security domains, then describing the common threads against each and making broad recommendations for security posture for each zone.

NIST SP 800-223 (draft)’s Figure 1: an HPC centre divided into 4 zones, the external-facing access zone, the management zone (containing scheduler/resource managers, front end/provisioning servers, authN/Z, and other cluster services), the computing zone (compute nodes) and the storage zone (nodes and disk arrays).

Also, can I just say that I think the section about containers misses the key point?

Security is a major concern in deploying containers in HPC environments. Containers posses large attack surfaces due to the different underlying images, each of which can have vulnerabilities. In addition, securing the host is not enough to ensure protection. Container permissions and proper isolation are also necessary. Finally, monitoring containers can be difficult due to its dynamic nature.

Yes, containers have untrusted code which can have vulnerabilities, but the whole purpose of most academic HPC clusters is that they’re platforms for large-scale arbitrary remote code execution as a service. Any of the non-containerized code could have vulnerabilities, too. Similarly with the monitoring issue.

Instead, the issue with containers in HPC clusters is simpler than that. HPC clusters are almost invariably run on trusted bare metal with a very mature but trusting security model, where the user id of a process running on a node is enough to establish authentication and make authorization decisions for storage and other cluster services. It’s easy to understand but brittle, and if you can fool a node into running your process in another user id, you can cause havoc.

Containers come from a world with a newer but less trusting cloud security model, where no one much cares or even necessarily knows what a process’s user id is on its untrusted host. Rather, authentication is performed by exchanging secrets (keys, or certificate signing). You have to have the secret to be able to authenticate into the service, and then the service makes the authorization decision. This is also easy to understand, and outsources some of the security to encrypted communication channels. If you steal a secret that was exchanged insecurely, you can use it and cause havoc.

The issue is that container execution models make it very hard but not impossible to “break out” and run as a different user, because the environments containers were built for don’t care about that. And HPC systems typically don’t have secure ways to exchange secrets between compute nodes or cluster management services, because they historically don’t care about that. So combining the two causes all sorts of fun.

Neither model is necessarily more or less secure in general (although they do differ in how robust they are to specific security threats - I’d argue the POSIX user-id-on-trusted-bare-metal model is more fragile for things we increasingly care about these days, such as dealing with sensitive data and supporting web services). It’s the mismatch between the two that causes security concerns, not the nature of containers or HPC clusters themselves. And resolving the mismatch will take work on both sides of the equation.


An enterprise-grade, 1U blade system for… Raspberry Pis.

GCC Cobol is getting closer to reality, passing the Cobol-85 compliance test. (A benefit of a small language - the Cobol specific pieces are only 50k lines of code…). Although it maybe still has a way to go - there is a COBOL 2014.

Generative AI models as lossy compression algorithms, and as prose-based programming languages that favour people with humanities backgrounds.

The argument that micromanaging is better than being disengaged. I’d add that for managers like us I see under-management occurring way more often than micromanagement.

Trying ChatGPT and other AI tools to write tests for and help with initial refactoring of legacy code. It can’t just do the work for you, but it can help.

Want To Practice Your Management Skills? Try Being a Dungeon Master.

The structure of code is more important to its meaning than the structure of, e.g., web pages. How GitHub’s new code search works.

That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,


About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.

Jobs Leading Research Computing Teams

This week’s new-listing highlights are below; the full listing of 169 jobs is, as ever, available on the job board.