#106 - 16 Jan 2022

Priorities and strategy mean saying no; Make yourself obsolete; Decision making; Writing more good; lab notebooks for debugging; code review is feedback; SUSE and CentOS; HPC vs DL GPUs; SSH Bastions

Hi!

As you can see from this issue again being late (but less late!), here at RCT HQ I’m still getting into the rhythm of the new year. I hope you and your team are already firing on all cylinders.

There’s several articles this issue on decision making, priorities, and strategy. I keep coming back to this, because it’s important.

For research computing teams to succeed, we need priorities and a strategy. Without those, we can’t even tell if we’re succeeding, or on our way towards success.

Everyone agrees with that, but the next part gets uncomfortable. A strategy that doesn’t clearly show which perfectly good opportunities the team would say “no” to isn’t a strategy, no matter how many slide decks it appears on; it’s a motivational poster. A “priority list” which doesn’t explicitly deprioritize some opportunities, projects, clients, and activities is — by definition! — something other than a priority list.

Unfortunately, I see this quite a bit in academic research computing teams, especially around software and systems. New research software development teams, often one of several across campus, typically scramble to get funds in the short term, taking a “catch as catch can” approach to projects. A team led by a new leader will often have a “strategy” of something like “enabling research with high-quality software development” and “connecting domain experts in software development with domain experts in research”, which distinguishes them from literally no research software development team anywhere in the world, much less on campus. (Where are the stand-alone “enabling research with low-to-moderate quality software development” teams?) This makes it impossible for them to clearly articulate strengths compared to other teams across campus, attract clients that need those strengths, and build on those strengths. But doing that is uncomfortable, because then they have to at least implicitly come clean that the team has weaknesses, and thus clients they probably can’t (and shouldn’t) help effectively. That’s hard for a new leader, and even for some pretty seasoned and senior leaders.

When I see research systems teams fumble on clear articulation of priorities and strategy, it’s usually a little different — a stubborn refusal to admit aloud that they are implicitly prioritizing certain kinds of workflows (and thus clients) over others. Typically it starts with queuing and allocation policies put in place at the beginning of the systems team, aimed to support their first big user groups. Then as time goes on, those policies somehow become “decisions made for technical reasons” and “the way things are done here” rather than what they actually were, an explicit prioritization choice. At that point, saying aloud “we specialize in support for users [X]” is scary even though it’s manifestly true and would be the first step in deciding whether continuing to specialize in support for users [X] is in fact the right approach.

Prioritization and strategy, when done in a way that isn’t meaningless, is scary. Unfortunately, prioritization, strategy, and decision making is the core responsibility of managers and leaders. Every decision for something is a decision against something else. Every high priority area explicitly deprioritizes all other possible areas. A strategy necessarily implies a long list of paths not taken. If your priorities and strategies aren’t choices that could be wrong, they aren’t priorities or strategies at all.

With that, on to the roundup!

Managing Teams

How to Write a Strategy Statement Your Team Will Actually Remember - Michael Porter, Nobl Academy
Saying “no” - Mike McQuaid

Porter’s article highlights an idea that’s come up a few times in the roundup - a very clear way to write out priorities or strategy is to contrast two things, both positive, and explicitly say that your strategy values one over the other. It’s too easy to write out “motherhood and apple pie” strategies: “we value moving fast and solid code”. But those statements are fluff that don’t mean anything; you might as well be announcing that water is wet. No one disagrees that those are good things with value. Hard decisions come when good things are in conflict. When writing a new chunk of solid code will require moving slowly, which one wins? Those choices are your priorities, the outlines of your strategy.

And once those choices are made, you have to start saying no to things that go against those choices. Last January (#56, #57, #58) we had a series on stopping doing things; the first step is saying “no”.

McQuaid talks about saying “no” as “frontloading disappointment”, and maybe surprisingly as a way of building and maintaining trust. He also talks about Brené Brown’s BRAVING framework for trust (Boundaries, Reliability, Accountability, Vault, Integrity, Nonjudgement, Generosity). He writes that having a clear and explicit “API”, where it’s clear that if you say yes to something it’s going to get done, and that you say no early so they can move on, builds trust long term in exchange for a small bit of friction in the short term.


Make Yourself Obsolete: Your Team Will Thank You - Nathan Broslawsky

Porter’s article above had a simple measure of success for strategy statements - in your absence, would the team be able to make the decisions you would have made based on the stated strategy?

Broslawsky’s article continues on that theme and expands it. A good leader and manager is continually growing their team members, giving them broader responsibilities, and that means giving up some of your tasks. This means coaching rather than directly problem-solving, it means giving team members opportunities to take on high-visibility efforts, and making your old job description obsolete. That frees you, too, to take on other responsibilities.

If you like this article, a related article that roundup #79 included, Always Be Quitting, generated several positive comments from RCT readers.


The Decision-Making Pendulum - Candost Dagdeviren
Our Role in Effective Decision Making - Kathy Keating
Everyday Decisions, Done Dirt Cheap - Matt Schellhas

Dagdeviren talks about the decision-making pendulum, the spectrum of decision-making processes ranging from authority (“Because I said so”), through advice and consent, to consensus decision making (“So say we all”). The processes at the extremes aren’t great for either the teams or the decisions that likely result (I don’t agree with everything Manager-Tools says, but I think they’re 100% right about consensus - consensus is a desirable outcome of a decision-making process, but it is a bad choice for a decision-making process). Within the middle, the important thing is that the decision-making process is explicit, input is gathered and seen to be taken seriously, and decisions are communicated clearly.

That input gathering and discussion-facilitation phase before and after a decision made takes some practice. Keating gives us (from the point of view of an inveterate introvert!) an overview of facilitating discussions, which can be especially tricky in the cross-disciplinary world we work in, so that the full range of ideas and information becomes available and shared.

And Shellhas gives us a bit of breathing room when it comes to decision making. The “best” decision is unknowable beforehand, when you actually need the decision; even in retrospect, determining whether it was best available is usually still impossible. It’s not a power given to we mere mortals to watch all possible timelines unfold.

But there’s often one or more nearly-tied, “good enough” decisions available. Unless a given decision has huge ramifications — which does happen, but isn’t the common case — quickly making a good-enough decision in some consistent and transparent way (maybe, say, given some underlying strategy and priorities!) is probably better for you and the team than agonizing over trying to optimize over the unknowable.


Managing Your Own Career

Becoming a Better Writer - Gergely Orosz

Work for others is getting more and more remote and asynchronous. It’s always been a bit that way for us, with large multi-institutional collaborations a routine part of work. Whether it’s a new or ongoing challenge, we do better with clearer writing. It’s a fundamental tool for clarifying our thoughts, and for communicating those thoughts to a community of work. That’s true whether it’s for a pull request, a changelog, an email to users, or an institutional strategy document. And as we grow in our responsibilities, our community of work becomes broader with more diverse expertise. Writing clearly gets more and more important.

There’s only one way to get better at something - and that’s intentional practice with real feedback. Orsoz recommends writing a lot, critical editing - by ourselves and others - and actively soliciting feedback. Orosz also recommends a couple of tools, including Hemingway Editor, which I keep coming back to. In academia we write long, rambling sentences. Twitter and Hemingway Editor help me tighten my text. As long time readers will know, it’s a continuing struggle.


Middle Managers: The Forgotten Heroes of Innovation - Ben M. Bensaou, INSEAD Knowledge

Something to remember as we rise in our careers: middle management has an undeserved poor reputation. It’s glue work, which is largely unsung and hard to even understand the value of from the outside. But middle management is where much of the cross-discipline collaboration across a large organization happens. ICs and front-line managers are typically too focussed on their individual functions, and the view from the executive suite is typically too high-level to see fine-grained opportunities to work together. Large companies like Bayer rely heavily on middle-managers to find novel ways to collaborate across the organization, as Bensaou describes.


Product Management and Working with Research Communities

Former Google CEO invests in computing help for university scientists - Adrian Cho, Science

Schmidt Futures, a philanthropic foundation founded by Eric and Wendy Schmidt, will be investing $40m over 5 years to establish a Virtual Institute for Scientific Software across Georgia Tech, Johns Hopkins, Cambridge, and U Wash. Research software developers will be hired at each site.

So this is unambiguously good news, all great stuff, and it’s getting really frustrating that we need to have wealth tech patrons like Google’s ex-CEO or Mark frickin’ Zuckerberg (through the Chan Zuckerburg Initiative) to fund in dribs and drabs the absolutely fundamental work of research software development and maintenance. And at least for software there are those programs - other areas of research computing and data don’t even have that.

The article ends:

[UWash Dean of Engineering] Allbritton hopes federal funders might take note of the initiative. “One might imagine,” she says, “that NIH and NSF could pick up on something like this and really amplify it.”

Might one? At this point it’s not clear.


Research Software Development

Code Review is Feedback - Linnea Huxford

A reminder that code review isn’t just about the code in question, it’s feedback. So that means it’s an opportunity to give nudges to inform future behaviours (code submissions), it’s an opportunity to give positive as well as negative feedback, and it’s important that all team member are providing consistent feedback.


Git Organized: A Better Git Flow - Annie Sexton, Render
Make debugging suck less. Keep a logbook - Conor Lamb

A couple short articles nudging us to do a little better for ourselves when debugging, and for others when committing changes.

Sexton issues a plea for more coherent commits, suggesting following the “commit (locally) as you go along” approach with a git reset (which doesn’t change the working tree unless requested to) and then bundling changed files into coherent commits. Even parts of a file can be added to a commit with git add —patch or via editor support (in VSCode you can “stage select ranges”).

Even though a lot of us were trained in science, we can lose our scientific approach in the heat of a debugging session. Lamb suggests keeping a logbook; this is a good idea, coupled with the approach of having hypotheses you test during debugging rather than my usual approach of just flailing.


Stewardship of Software for Scientific and High-Performance Computing - US DOE
Responses to DOE RFI on Stewardship of Software for Scientific and High-Performance Computing - Various, organized by Todd Gamblin

The DOE put out a RFI on Stewardship of Software for Scientific and High-Performance Computing describing the problem and current state of things. There were 39 responses, most of which are online, but the official government website is a pain to review. Gamblin organized the responses in a GitHub repo - there’s a lot of good stuff there, worth reading. I’ll try to distill some of the common (and uncommon!) responses in a future issue.


Research Data Management and Analysis

Getting Started with R and InfluxDB - Gourav Singh Bais, The Next Stack

There’s a number of time series databases available these days, which are analytic optimized for collecting an extremely large number of (timestamp, value) pairs for a potentially large number of variables. Queries are aimed at identifying windows of time that meet some criteria within a variable or across several time streams, and potentially doing some analysis within that window. Sometimes there’s tools for removing old data or at least keeping it at only larger and larger granularity.

If you’re curious about knowing if something like that is appropriate for your usecase, this is an InfluxDB-sponsored post on The Next Stack for quickly playing with some example data (COVID-19 cases) in InfluxDB using an R client.


Research Computing Systems

SUSE announces new distro for those who miss the old CentOS: Liberty Linux - Liam Proven, The Register
CentOS 9 Stream Is Now Available but Should You Use It? - Jack Wallen, New Stack

Well that’s interesting - it looks like SUSE (drags on cigarette - It’s been a long time since I’ve heard that name) will be releasing a “classic CentOS” type distro based on RHEL, but with much its much newer kernel.

As you know, I think slow moving, long-term-stable OS images are a trap for research computing systems teams, albeit one that’s hard to dig out of until vendors stop using them as an excuse to update their device drivers as seldom as humanly possible. But if it’s a trap you find yourself in, there’s a new option from an organization a little less fly-by-night than Rocky Linux.

Alternatively, CentOS 9 is the first Stream version of CentOS, which is kind of the staging area between the faster-moving Fedora and the carved-in-stone RHEL. After a new version or package has been kicked around in Fedora for a while, it goes into RHEL nightly and CentOS stream. Honestly, I think something like that could be close to a sweet spot for some research computing clusters; the issue, as always, is cheap and lazy vendors updating their drivers only when absolutely forced to. As a community, we should force them to more often.


SSH Bastion Host Best Practices - Sakshyam Shah, Teleport
Restricting SSH agent keys - Jake Edge, LWN.net

A pair of SSH articles this week.

In the first, Shah describes a good set of best practices for setting up a bastion host. None of the individual steps here are ground breaking - that’s what best practices means - but the collection of them all in one place is useful. Strip down the packages and services on the host OS, lock down networking capabilities, limit user accounts, add OS-level logging and send the logs elsewhere, add host intrusion detection systems like OSSEC or Wazuh, harden OpenSSH by limiting ciphers, regenerating host keys, etc., add 2FA/MFA, use certificates where possible, and deploy. There are recommended tools, steps, or parameters for each step, which is nice.

In the second article, Edge describes an experimental feature in the upcoming OpenSSH 8.9, where keys can be ssh-add’ed to ssh-agent for limited uses, e.g. only connecting to certain hosts or even to certain users at certain hosts.


Lets Encrypt for internal hostnames - Julien Savoie

Savoie describes using dynamic DNS (RFC 2136) capabilities in certbot to temporarily create TXT records for relevant internal-visible-only hostnames so letsencrypt can generate certificates.


Emerging Technologies and Practices

AWS Free Courses - AWS

AWS has moved a lot (most? all?) of its self-paced online courses to being free, possibly following Azure’s lead - e.g. here’s an FSx for Lustre Primer. It’s a good move, but I don’t know why it took so long. I genuinely can’t comprehend why some vendors charged or even continue to charge money for this kind of training; if I were in charge of such a company I’d be chasing technologists down in the streets trying to get them to get certified in my various technologies. Yes, good courses cost real money to develop, maintain, and even run if there’s a lot of (e.g.) cloud resources that get spun up for hands-on components. But do you want them using your stuff or not?


Kubernetes at Home With K3s - Bruno Antunes

We’ve spoken before about edge distributions of Kubernetes like k3s (#40, #56). Here Antunes walks us through the next step of playing with Kubernetes after poking at Minikube - running a real (if lightweight) Kubernetes with real pods on your system using k3s and virtual box, and auxiliary tools like k9s (think top), lens (a GUI), and Portainer, then setting up a simple web service with nginx ingress, let’s encrypt, and storage, and then more complicated stacks (like using Kompose to translate a docker-compose setup of a wiki, BookStack, into a helm file).

Refreshingly, Antunes is less Kubernetes fan and in more of a “well, I guess this is what we need to use given the alternatives” sort of situation, so it doesn’t come off as a sales pitch. And by the time you’ve worked through this post, you will have seen a fair chunk of Kubernetes functionality.


NVIDIA Research Plots A Course to Multiple Multichip GPU Engines - Timothy Prickett Morgan, The Next Platform

There’s some interesting material here riffing on a paper by NVIDIA Research that went online in Dec 2021, describing how NVIDIA might make multichip GPUs - probably not shocking, since AMD is already going there (#100). Maybe more interesting, and certainly catching the attention of HPC twitter, is that the same paper suggests that deep learning and HPC are going to de-converge a bit, with proposed different GPU SKUs specialized for HPC and for AI/DL. The deep learning specialized GPUs will have even larger caches. Under NVIDIA’s simulations, HPC workloads benefited less from high-bandwidth access to large caches, while the AI workloads needed a very large “resident set” of high-bandwidth memory.

NVIDIA research’s imagined specialized GPUs for HPC as vs for Deep Learning


Calls for Submissions

Still a flurry of new-year CFPs and summer REU opportunities:

Call for May 2022 mentoring communities - Outreachy, Community deadline 25 Feb

Humanitarian open source communities or Open Science communities can apply for $6,500 to hire an intern from non-overrepresnted groups; like Google Summer of Code, communities apply for a place, then mentors submit projects, then candidates apply and are matched.


4th Workshop on Benchmarking in the Data Center: Expanding to the Cloud Workshop (BID 2022) - 2 or 3 Apr, Virtual, Papers due 31 Jan

This and GPGPU are held in conjunction with PPoPP22. From the call:

High performance computing (HPC) is no longer confined to universities and national research laboratories, it is increasingly used in industry and in the cloud. Users need to be able to evaluate what benefits HPC can bring to their companies, what type of computational resources would be best for their workloads and how they can evaluate what they should pay for these resources. Recent general adoption of machine learning has motivated migration of HPC workloads to cloud data centers, and there is a growing interest by the community in performance evaluation in this area, especially for end-to-end workflows. Benchmarking has typically involved running specific workloads that are reflective of typical HPC workloads, yet with the growing diversity of workloads, theoretical performance modeling is also of interest to allow for performance prediction given a minimal set of measurements.

We invite novel, unpublished research paper submission within the scope of this workshop. Paper submission topics include, but are not limited to, the following areas: Multi-, many-core CPUs, GPUs, hybrid system evaluation; Performance, power, efficiency, and cost analysis; HPC, data center, and cloud workloads and benchmarks; System, workload, and workflow configuration and optimization


The 17th International Workshop on Automatic Performance Tuning - 3 June, Lyon, Papers due 31 Jan

iWAPT (International Workshop on Automatic Performance Tuning) is a series of workshops that focus on research and techniques that address performance sustainability issues. It provides an opportunity for researchers and users of automatic performance tuning (AT) technologies to exchange ideas and experiences while applying such technologies to improve the performance of algorithms, libraries, and applications; in particular, on cutting edge computing platforms.


The 14th Workshop on General Purpose Processing using GPU (GPGPU 2022) - 2 or 3 Apr, Virtual or Seoul, Papers due 8 Feb

From the call:

Massively parallel (GPUs and other data-parallel accelerators) devices are delivering more and more computing powers required by modern society. With the growing popularity of massively parallel devices, users demand better performance, programmability, reliability, and security. The goal of this workshop is to provide a forum to discuss massively parallel applications, environments, platforms, and architectures, as well as infrastructures that facilitate related research. This year, we are no longer limited to GPU applications and architectures. We welcome research related to any highly parallel computing accelerators and devices. Authors are invited to submit original research papers in the general area of massively parallel computing and architectures.


The 5th International Workshop on Edge Systems, Analytics and Networking (EdgeSys’22) - 5 April, Rennes, France (hybrid), Papers due 11 Feb

Co-located with EuroSys 2022, this workshop aims to “discuss the latest research ideas and results on edge systems, analytics and networking, especially those related to novel and emerging technologies and use cases.”


25th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2022) - 3 June, Lyon, Papers due 13 Feb

JSSPP welcomes both regular papers as well as descriptions of Open Scheduling Problems (OSP) in large scale scheduling (see below). Lack of real-world data often substantially hampers the ability of the research community to engage with scheduling problems in a way that has real world impact. Our goal in the OSP venue is to build a bridge between the production and research worlds, in order to facilitate direct collaborations and impact


HPC-IODC: HPC I/O in the Data Center Workshop - 29 May - 2 June, papers due 24 Feb

This workshop held with ISC is looking for papers covering “all aspects of data centre I/O”, including:

  • Application workflows
  • User productivity and costs
  • Performance monitoring
  • Dealing with heterogeneous storage
  • Data management aspects
  • Archiving and long term data management
  • State-of-the-practice (e.g., using or optimising a storage system for data centre workloads)
  • Research that tackles data centre I/O challenges
  • Cloud/Edge storage aspects

Random

The argument in favour of using unsigned ints by default. Relatedly, gotchas from the “usual arithmetic conversions” in C/C++.

The simplest web server in bash, with while + nc. Relatedly, carbon.now.sh for making pretty images of short snippets of source code.

GitHub actions will now suggest possible workflows when you start to create a new workflow.

Five options for creating backups of git repositories.

iOS shortcuts for disabling twitter (or other time-wasting apps) before a certain time or unless you’ve done a certain amount of something else.

A key to being a good decision maker is having a good sense of the likely accuracy of your assumptions. How confident are you in your answers to true/false questions?

An immutable, tamperproof, cryptographically secure, centralized database.

I needed this a few weeks ago, when I had to pass a variable to an awk script (I just did, ok?) - passing runtime data to awk.

Another fast and nerdy shell prompt - starship.

Embedded and HPC development have in common the resource-limited need for parsimony - in embedded because the resources are small, in HPC because the need for resources is large. Trice is a small tracing library for embedded software.

A SQL playground in the browser based on SQLite - SQLime.

Another random one-on-one question generator.

Tuning vector addition and GEMM matrix-matrix multiplication in webassembly.

Fast hash tables for (single-node) parallel programming.

Simultaneously learn DMARC and test your own SPF/DKIM/DMARC setups with the “Learn and Test DMARC” console. If none of those acronyms mean anything to you, you have made better decisions in life and email hosting than I have; count yourself lucky and move on.

Whether you support work in the digital humanities or microscopy or fluid simulations, being able to rapidly and interactively shows user huge images is valuable. Here’s how the Rijkmuseum has optimized large full-screen images.

CryptoLyzer - a cryptographic settings analyzer for TLS, SSH, and HTTP.

Edge computing, including support in tools like fly.io or cloudflare workers, is going to be interesting for large-scale data collection for some kinds of research projects. If you’re interested in playing with fly.io (they’re doing interesting work with networking, firecracker…), they now have 3GB of free storage and even Postgres DBs.

Like cryptography or linear algebra libraries, licenses and other legal documents are not a roll-your-own kind of deal. However one might feel about it, there are real, legally sound “open source for non-commercial purpose” licenses out there - don’t try to cobble together a dual-license MIT, but not for commercial use situation, it doesn’t work.

If you’re going to implement a work queue in a database, first, you shouldn’t. Second, you should know about SKIP LOCKED.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.