#92 - 17 Sept 2021

Trying harder to hire; increasing silos; coaching and career development; mmap() vs specialized libraries; writing good documentation; datasheets for datasets; simulate, don't guess; Kubernetes or not; benchmarks using Optane

Hi!

We’ve talked here in several issues about the “great resignation” that many companies are seeing as the pandemic starts to wane a bit. I think this is playing out differently in our world; there are definitely people leaving, or making new work arrangements, but it’s less en masse. Rather than having trouble with people leaving, instead, it seems to be more of an (even bigger) problem bringing people on.

On the jobs board there have been academic positions open since late winter, some that have been re-posted an uncomfortable number of times. That’s for manager positions; I’m also hearing from a lot of people that hiring ICs is getting almost impossibly difficult.

And it’s not hard to imagine why. In the last couple weeks I’ve spoken with managers at tech companies (mainly through online communities) who were actively hiring ICs and the total compensation they were describing was routinely a comfortable 2.5-3x higher than I can pay. And my organization pays quite well for academia. These weren’t some horrible soul-rending ad-tech jobs aimed at increasing click-through-rates, either, but interesting, challenging, and high-impact infrastructure or software or operations jobs relevant to research computing.

I mentioned in #90 that teams needed to be ready to compete, and they do; not just for work but for workers. Having a well-defined specialty and being known for something helps. And in #81 I mentioned that too often we underplay our strengths as employers - stability of employment with variety of work; campus or remote, either of which is attractive to a population; work that we can be proud of.

Collectively, we’re going to need to up our game when it comes to hiring. Some things change slowly with time and a lot of collective effort, such as developing career paths, improving compensation, etc. But other things are entirely within our control. Writing job ads that raise the frankly abysmal bar set in our industry; spending time actively recruiting; being clear on what the benefits of our team is and what our team’s goals are; having a well-thought through, well-communicated, non- made-up-on-the fly interview process; making sure team members get the support they need, and opportunities to shine in venues they care about. These are things we don’t need anyone’s permission to do, don’t require HR sign-off, and by and large don’t require funding approval.

Hiring’s hard and isn’t going to get easier any time soon. There are a lot of exciting and well-paid challenges outside of our industry. We can compete! But we need to try.

On to the roundup:

Managing Teams

The effects of remote work on collaboration among information workers - Longqi Yang et al., Nature Human Behaviour

This is a peer-reviewed paper covering some data reported earlier in blog posts; Microsoft looking at internal messaging and calendaring traffic of its US employees over the first six months of 2020. They found that, internally, the size of communications networks fell; groups became more siloed, and as a result presumably it is harder to communicate across different divisions of an organization.

This is consistent with what we found in our team. We were in a slightly unusual situation - we had a multi-site institution. At least at our site, I find that the collaboration’s communications is actually much better than before the pandemic - we put a lot of work into that - but there’s no free lunch, and my communication with other teams at my home institution is relatively poor.

By and large, I think this works out ok for many research computing teams; we’re pretty self-contained, and consult with researchers on a project-by-project basis for focussed periods of time. But it does point to areas where managers need to work - making sure there are bridges built to other internal teams we need to interact with (IT? VPR?) and putting extra work into communicating messages that we want heard widely like successes or changes.


How to coach an employee who is performing well - Claire Lew, Know Your Team
Researchers not taking enough time for career development - Sophie Inge, Research Professional News (paywalled)

As managers, the tendency is to focus on problems, and on team members who aren’t performing well in their current tasks. As a result, team members who are capable and already growing tend to suffer from benign neglect; but this isn’t fair to them, and it’s a missed opportunity for the manager. Someone who’s already shown they can grow under conditions of benign neglect can grow faster with some help. What’s more, if you don’t offer that help, they may reasonably start looking for jobs where their growth and career development is actively encouraged - wouldn’t you?

Lew walks us through the steps of making sure someone who’s doing well and is already growing gets the coaching they deserve. As always, coaching isn’t about you telling someone how to do something better, but about giving them the opportunities and resources - and accountability - to grow their skills:

  • Understand their motivations
  • Give choice that aligns with their motivations
  • Connect their work with progress towards their core motivation

As with so much of management, this isn’t hard or complex! The difficulty is in the discipline to keep performing the practice. In #68 I shared the quarterly goal setting & review template I use; by checking in every three months, and understanding their goals in their larger career, as well as at this job, that makes it much easier to ensure that new opportunities and resources support them in their growth as well as in the work the team needs.

All of this is incredibly important; as Inge points out in a paywalled article, while researchers know that career development is vitally important, they find themselves “too busy” to improve their own skills and thus productivity. That attitude spreads outwards to researcher-adjacent staff. We’d all agree that putting time and effort into growth and development is vital, and yet too often it doesn’t get done. One of our biggest roles as a manager is to make sure our team members (and ourselves) do take the time to grow our skills, so we can take on more ambitious efforts and be more satisfied in our work.


Why the best way to hire is incredibly boring - Laszlo Bock, humu

Again - management isn’t complicated, it doesn’t require profound insights into the human condition or the right personality type - it consists of nothing more than the discipline of attending to the details.

Bock despairs of the “one weird interview question” genre of articles that promises to give you that profound insight into a candidate without having to do any real work.

But hiring doesn’t work like that. When hiring a human being to do a job for the team and grow the team, you need to find out if they can do the job, and if they’ll grow the team in a way that the team would benefit from. As Bock points out, the way to do that is exactly what you’d expect:

  • Define Job Attributes
  • Ask for a work sample
  • Ask behavioural questions
  • Average scores and make a decision
  • Constantly check that your hiring process actually works

It’s boring, it’s mundane, it’s a lot of work, and there isn’t another good general option.

We recently went through and have nearly completed hiring a technical project manager (I can’t even tell you how excited I am by that prospect), so this is all fresh in my mind. This is our first TPjM hire, so we spent a lot of time putting together interview materials to determine what we needed, what a new team member would need to do to demonstrate success in the role, what the criteria are, and then how to get that “work sample” - in this case, a “project planning meeting” interview. in addition to the behavioural questions interview (where we provided the questions ahead of time, along with the scenario). It was a lot of work, it greatly clarified within the team what we needed, it greatly clarified for the candidate what they were proposing getting into, and it basically sets up the goals for the first six months of the candidate. It improves our, and their, chances of success.

The one place I’d disagree with Bock is about averaging scores. You don’t want to hire someone that, on average, the team wants to work with.

You absolutely have to weigh all the input, and use that to inform your decision. But I’d suggest two things.

First, any well-justified objection to a candidate based on the agreed-upon criteria should be pretty darn close to disqualifying. Second, the “decision” can’t be made by an artificially dispassionate scoring system, as comfortable as that would be. The decision has to ultimately be your responsibility, the hiring manger - for the very simple reason that if it goes horribly wrong and they have to be fired, that responsibility will fall on you and no one else.


Product Management and Working with Research Communities

Toward Modern Fortran Tooling and a Thriving Developer Community - Milan Curcica et al., arXiv:2109.07382

Back in #56 we pointed out a blog post celebrating one year of the new Fortran-lang community; in this preprint released this week, Curcica et al go into more details about the creation of a community organized around GitHub, the development of a standard library (a short 64 years after the introduction of the language), a package manager, and a community-curated website where there had previously only been quarterly standards group meetings.

It’s an encouraging reminder that communities want to form, and that a small number of people willing to put the time in can catalyze a surprising amount of effort (here, over 60 people) focused on improving the state of affairs within their niche.


Research Software Development

Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5 - Itamar Turner-Trauring

This article is written with Python examples (and Zarr in particular is a python package), but Turner-Trauring’s basic discussion of mmap() as a tool for accessing large files vs specialized libraries is quite general.

mmap() isn’t magic - and it still involves a lot of the mechanisms implemented for file systems. It can be a very quick way to get started, and if you have genuinely random access it may be pretty close to optimal (though there’s some tuning you can do still) but if you know something about the structure with which you’ll be accessing your data, you can often do better - particularly if you can take advantage of compression.


What Makes a Good Changelog - Zeno Rocha, Herbert Lui, WorkOS Writing class documentation - Arnt Gulbrandsen

Here are two very pragmatic articles about documentation for developers. The common theme here is that the most important thing about changelings and class documentation is that they exist. And the easiest way to ensure that’s done is to have simple, clear, and non-onerous expectations of what they should look like.

Rocha and Lui, covering change logs, specify:

  • They should be clear
  • In images, highlight changes
  • Spotlight the people behind the product
  • Consistent formatting of versions and dates, and…
  • Teams should dedicate real technical staff time to them

Gulbrandsen, on writing class documentation, spells out the simple formula he used for Qt:

  • Write a sentence-long blurb with the most important thing you can say about the class
  • Write a few more sentences to give a compelte, but rough overview of what the class is (not does)
  • Make a list of methods, member variables, enum values, and other subordinates of the class, sort the list into aspects, then write about each aspect
  • Finally, make zero or more examples

Good writers will do it differently, but the goal here is to write something good, not to be a good writer. The best is the enemy of the good.


Research Data Management and Analysis

The Data Product Manager - Rifat Majumder, Locally Optimistic

I’ve been beating the drum for some time that research computing teams need to start thinking less in terms of projects and more in terms of products. So far, the only place I’ve seen that happen are in areas very close to data, with data products themselves, or data processing pipelines as products - and even so I’ve only seen 4/224 academic jobs on the job board with that title.

In this article, Majumder describes the benefits of treating curated data collections as a product, including:

  • Deeper understanding of client needs
  • Clearer priorities
  • Clarity around how to make changes

The article is written in the context of internal data products for use within a large company, but it works just as well with management of large data resources for research communities with some of the words changed.


Datasheets for Datasets Template - Audrey Beard Datasheets for Datasets - Timnit Gebru et al, arXiv:1803.09010

Beard provides a LaTeX template for Gebru et al’s suggested “Datasheets for Data sets”, a human readable high level description of a dataset - not a data dictionary, but describing the reason the data set exists, how data was collected, what preprocessing/cleaning/labeling was done if any, how or if maintenance will be done, what uses the dataset has been put to, and more.

A screenshot of a templated data sheet, with this section showing questions about motivation for creating the dataset and the composition of the dataset


Research Computing Systems

Napkin Problem 16: When To Write a Simulator - Simon Eskildsen

Eskildsen suggests, rather than rely on intuition (or back-of-the-napkin envelopes), we should

Simulate anything that involves more than one probability, probabilities over time, or queues.

When we’re thinking about batch queueing system parameters, or sharding, or load balancing, or anything else, our systems usefulness depends on what we choose, but we too often choose it based on trial-and-error or rules of thumb. We work in the sciences, and writing simple simulations of our parameters to guide our decisions isn’t normally very difficult; it can expose corner cases that you mightn’t otherwise consider; and it lets you run simple what-if calculations.


Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing - Yehonatan Fridman et al., arXiv:2109.02166

In this paper, Fridman et al use Optane persistent memory devices as 1) memory, 2) swap, or 3) as disks for checkpoint restart. They take a look at microbenchmarks, several of the NAS Parallel Benchmarks and Polybench benchmarks. There are definitely benchmarks for which using the Optane in memory mode offers only 50% performance drop but for much larger problem sizes, for more or less free, as in the plots shown below:

Screenshot of a figure showing slowdowns as problem size exceeds physical memory, with Octane being used as auxiliary memory.  Some slow down ~50%, others plummet much further

But there’s benchmarks for which the drop is much worse. Using the optane as swap gave pretty uniformly poor results. It worked quite well for crash/restart. The code the team used is available on github.

In general, I think we’re going to need to try harder than just using these nonvolatile devices as memory or as a drive to get sort of performance we’d like from codes that use them. There was a brief window in the 70s-90s where there was active development of external-memory/out-of-core algorithms (and then in the 2000s, cache oblivious algorithms) to make use of multi-stage memory hierarchies. As we move to NVMe, and to RDMA over PCIe busses, I suspect we’re going to have to dust those off. One interesting very long review is here (PDF).


Emerging Technologies and Practices

If you’re thinking of starting up programs around reliability for your systems, nginx has a free (in exchange for your email address) O’Reilly ebook, 97 Things Every SRE Should Know. It covers the basic mental models of tech-industry SRE basics, how to get started, then increasingly advanced processes (ranging from having a single person being the system reliability engineering “team” to 10-100).

SRE sounds kind of exotic and mysterious and advanced to those of us in the research computing sector, but it’s not magic, nor is it arcane - it’s a bunch of practices centred around being extremely clear about what “reliability” means in your context, defining metrics so you know if you’re being reliable enough (or even more reliable than you need to be!), and then being deliberate about maintaining reliability while making changes.

Many of the 97 (98, really) 1-2 page essays likely won’t be applicable to your particular situation, but it’s a good crash course in the language and models of SRE and will help you get a foothold in the huge SRE literature out there.


Deploying Kubernetes Based HPC Clusters in a Multi-Cloud Environment - Daniel Gruber,Burak Yenier, Wolfgang Gentzsch, HPCWire Owl Job Scheduler - submit user jobs to a Kubernetes cluster from anywhere - Eduardo Gonzalez “Let’s use Kubernetes!” Now you have 8 problems - Itamar Turner-Trauring

I’m pretty skeptical of using Kubernetes for running traditional HPC jobs - Kubernetes on the one hand and (say) Slurm or SGE something on the other serve two completely different use cases, and something that’s well served by category A is not likely to be a good match for category B.

However. HPC schedulers are so well optimized because they choose a simple problem - single “rectangular” jobs, N nodes x M hours - which aren’t a good match for all job types. Across jobs they have widely varying capabilities (mostly ranging from “meh” to “ugh”) for handling complex DAGs of tasks needed by modern workflows. Further, despite noble efforts with DRMAA v1 and v2, API-driven programatic control of schedulers, particularly across systems, is still really poor. Finally, if you are running in an environment with elastic compute resources, the extremely sophisticated HPC schedulers that are used to get the highest utilization to out of fixed static compute resources aren’t nearly as valuable.

So there’s a lot of interest in targeting Kubernetes as a single uniform interface to resources wherever they may be. Again, I think it’s a weird match for HPC - simultaneously overkill and lacking some key functionality - but it’s about as close as we have to a standard API for compute resources right now, so here we are.

In the first set of articles, Gruber, Yenier, and Gentzsch talk about their significant experience using Kubernetes as part of the Ubercloud effort to provide cloud-based HPC for engineering simulations for a variety of customers, culminating in a study of 3000 simulations on spot instances, each within their own ephemeral kubernetes cluster (!) on GKE, calling services from AWS and Azure. In their telling, the extra overhead cost involving GKE was significantly overcome by the savings from nicely handling spot instance preemption.

On the second page, Gonzalez describes Owl job scheduler for jobs on Kubernetes - testing locally on Docker Desktop, or running on production on GKE or other Kubernes clusters. Demonstrated are data science jobs using Python + Dask, pipelines on single-cell data, and Jupyter notebooks.

On the “con” side, Turner-Trauring gives the conventional take on Kubernetes, and one I share, which is that it’s a huge thing to take on, it’s enormously complex, and you’d do well to make absolutely sure it’s right for your team’s needs before starting with it. A recent article talked about one team moving off managed kubernetes (GKE) to nomad for a single site specifically because of ease of handling batch jobs.


Calls for Submissions

Software Sustainability Fellowships - £3,000/15 month, primarily for UK-affiliated people, due date 31 Oct

From the site:

This funding can be used for any activities that meet both the Fellow’s and the Institute’s goals, such as travel to workshops, running training events such as Software Carpentry, Data Carpentry or Library Carpentry, nurturing or contributing to communities of practice, collaborating with other Fellows, or for any other activities that relate to improving computational practice or policy.


Events: Conferences, Training

SeptembRSE Open Data Workshop - 28 Sept

This AWS/SeptembRSE event focused on hackathon for teams of up to five for running reproducible analyses on one of the AWS-hosted open datasets:

Join us for a distributed, hybrid workshop on using Amazon Web Services (AWS) to run reproducible analyses from the registry of open data. Inspired by the reprohack model, participating teams will recreate example workflows, deploy infrastructure, and run analysis.


NVIDIA GTC - 8-11 Nov, Virtual - Free registration for talks, full-day workshops $99

NVIDIA’s huge GTC is open for registration, with talks on methods and applications for genomics, image processing for astronomy, medicine, and other areas, natural language processing, molecular dynamics simulations, DPUs for supercomputing, infiniband, the current NVIDIA HPC stack, GPUs fluid simulations, and more.


PRACE Winter School 2021-Converging HPC Infrastructure & Methodologies - 7-9 December, Free, Registration limited (due by 15 Oct) and priority given to regional attendees

The Inter-University Computation Center (IUCC) in Israel is proud to host the PRACE 2021 Winter School which will feature a rich introductory workshop on how to use PRACE resources as well as trends in future software and data ecosystems for scientific discovery.


Random

Computer games for education have been around since the beginning - here’s the story of 1964’s The Sumerian Game, complete with teletype, slide projector and narration on cassette. I played the 1978 version of this, typed in from Basic Microcomputer Games.

It turns out a couple approximate math instructions (approximate single precision reciprocal and square root) vary slightly from AMD to Intel, which is a problem if you’re emulating. Here’s how the team working on the reverse debugging package rr handled it.

Finally, a Fortran web framework.

Not to be outdone, Racket (a variant of scheme) has a GUI framework too, they want you to know.

A couple make-golang-go-faster tools - a tutorial on go profiling, tracing, and observability, and go assembly language libraries for a variety of platforms for sortedsets, byteswapping, qsort, etc with native go fallbacks.

In praise of awk, a too-often unsung hero.

Using AWS ECS and Fargate to run containers on a raspberry pi at home. Here the author uses a tool called inlets to handle traffic back through a home NAT with dynamic IP, but you could use something like tailscale or other fireguard VPN as well I think.

Functionality of Alpline Linux I didn’t know about - individual service-level network isolation with virtual routing and forwarding (VRFs).

Case studies (from an Oracle employee) of using Oracle’s always-free tiers of services for real services, including a simple data-driven website or development/test instances.

Newish programming language Zig is taking a very different approach to Rust when it comes to working with C/C++ code; as opposed to the “Rewrite It In Rust” mantra, Zig is making interoperability with followed by incremental additions/replacements to C/C++ code with Zig code a priority.

The newish buildkit tool, most often used with Docker, also works with podman.

Distinguishing between OS package managers and e.g. programming language package managers - closed universe vs open universe.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.