The good news is that my team, and the larger organization I’m a part of, is going to be growing substantially in the coming year. That’s also the bad news. We have to hire.
Hiring team members is a time-consuming, exhausting job - and probably rightly so, since it’s the most important thing we do. A lot of planning, organizational, and process mistakes we make as managers can be mitigated if we’ve helped assemble a terrific team; on the other hand there’s only so much pulling on those same levers can help us if we’ve made poor hiring choices. Your research computing team members are the people who do the work of supporting research with working code, data curation/analysis/systems, or computing systems. Putting that time and effort into hiring that makes it time consuming and tiring is absolutely appropriate.
Hiring, like anything else in management, is a practice that you can improve on by having a process you go through that you learn from and improve each time through. That means being a lot more deliberate about hiring (or really any other aspect of management) than we usually are in academia-adjacent areas. It’s also, to be honest, more work. But hiring is the most important decision you’ll make as a manager. Decisions you make about new team members will last even after you leave. A good hiring choice will make everyone’s job easier, and a poor hiring choice will make everyone’s job worse.
The larger organization I’m part of is setting up a proper hiring pipeline/funnel, so the whole process has been on my mind this week. And every research computing team leader I’ve spoken to in the past couple of years -including one I just met this week - has described the same issues about hiring . So for the next several weeks I’ll write up how we approach hiring, and what we learn as we proceed with our hiring pipeline. The topics will look like:
The most important thing I’ve learned about hiring after leaving academia I first heard on Manager-Tools - here’s one of the relevant podcasts - but once you see it you see it in all kinds of advice, including in advice on how to reduce hiring bias.
The purpose of the screening and interviewing process is to find a reason to say no; to discover ways in which the candidate and the job would not be a good match
In academia-adjacent fields, we tend to reject this,. We’ve been trained in an environment which emphasizes “meritocracy”, and for students and postdocs and junior faculty everyone one is looking for the “smartest” and “best” candidate with “the most potential”.
But none of that is right for us. It’s not great for academic positions; it’s definitely wrong for research computing positions.
We’re hiring a new person because we have a problem; the team has things that need doing. It doesn’t matter who’s “smartest” or “best”, even if we knew how to reliably assess those things, which we absolutely do not. What matters is how well a job candidate can do the things we need them to do, and how well the tasks we have match what the candidate is looking for. The “smartest” candidate who has skills the team already has in abundance but lacks the things we really need in this new position of can’t help us. The “best” candidate who is looking to do things that are radically different from what we need them to do is going to resent the job and leave it as soon as possible if we do hire them.
In academia we still tend to reject this, either because we’re still looking for “the best potential” like we’ve been trained to do, or for the opposite reason - because we see it as toxic and biased (which it is) and we don’t want to do anything which looks for negatives, which smacks of “winnowing out”, “raising the bar”, “keeping candidate quality high”, etc. But that’s not what’s going on here. We need to think about it as scientists.
Once we’ve selected people for the next step, whatever that step is, the hypothesis is that they are suitable candidates for the job. That must be the hypothesis, or we wouldn’t be wasting their and our time. And so, as scientists, the right way to proceed is to attempt to find evidence that contradicts the hypothesis; evidence that they wouldn’t succeed in the job. We are not searching for evidence that a person would be a good hire; there is nothing that is easier than fooling yourself into confirming a hypothesis! Confirmation bias and the halo effect are the enemy here. They allow all sorts of other biases to sneak in (and not necessarily even nefarious biases, like the classic “looks like/thinks like me”; it could just be unconsciously over/under-weighting certain parts of the job based on experiences with previous hires). We want to find evidence that this person would not succeed in the day-to-day technical work and teamwork we’d be asking of them.
In French class in school, I used to do dictées, where the teacher would read aloud and we’d transcribe into written French. We’d start with 100% and lose points for every wrong answer. That’s the approach we’re taking here. We have a list of must-have skills or approaches - technical but also team-related - and we look for evidence that the candidate does not have what we need them to have. Except they lose all the points when they’ve failed to demonstrate they have a must-have requirement for the job.
You can absolutely still be biased - your bias against a candidate can be so strong as to blind you to the fact that they have in fact demonstrated such-and-such a requirement. But it’s harder - not impossible; harder - for us humans to deny evidence that’s clearly in front of them than it is for us to choose to see evidence that they do have the skill, or at least they’ve shown the potential for it. That’s why the scientific method is the way it is. You find evidence to disprove, not to support, the hypothesis.
Once you’ve gotten used to the idea that the hiring process is one where the purpose is for each side of the potential match to find reasons to say no, a lot of things become clearer — what the process might be, what job ads should say, and what your approach you should take to criteria and questions. We’ll talk about criteria and questions next time.
Make Boring Plans - Camille Fournier
Riffing off of Chose Boring Technology, Fournier advocates for making boring, plodding plans - well thought out, well communicated, based on trying things, making sure they work, and incrementally implementing.
While it’s absolutely true that teams can be very motivated by audacious, ambitious goals, the plans for getting should be nice and boring. This is especially true when one has to incorporate excitement somewhere else; “Novel Technology Deserves Boring Plans”.
4 Lessons Learned in 2020 as an Engineering Manager - Luca Canducci
I’ve been avoiding 2020 retrospective articles - honestly, who really wants to look back - but Canducci’s lessons and descriptions here are good:
Markets, discrimination, and “lowering the bar” - Dan Luu
There’s a common, dumb argument that there can’t be sustained discrimination in a competitive hiring market place, because competition would have gotten rid of any such inefficiencies.
Needless to say trying to negate actual reality with pure thought doesn’t work well, and Luu’s article here puts this article to rest. This is an older article that is extremely relevant to the hiring process text above; how you aren’t trying to hire some “best” candidate out there - and that even if you were, making steps to ensure you have a diverse candidate pool and making a point of hiring candidates who face discrimination is the very opposite of “lowering the bar” and so presumably having worse outcomes.
Questionable Advice: “How do I feel Worthwhile as a Manager when My People are Doing all the Implementing?” - Charity Majors>
The Non-psychopath’s Guide to Managing an Open-source Project - Kode Vicious, ACM Queue
Majors’ article is a good reminder for new managers that it’s really hard to recalibrate job satisfaction or the feeling of accomplishment when you’ve moved into management. All you can do is focus on the big, long timeline stuff while still taking joy in the little moments, and make sure that you’re a source of calm not chaos in a crisis.
Vicious takes on the same topic but from the point of view of a new open source maintainer, which is managing a software development team in hard mode. Most of the same rules apply.
Newcastle University Research Software Engineering 2020 (PDF) - Newcastle Research Software Group
BEAR - Advanced Research Computing Research Software Group 2020 Report (PDF) - Birmingham Research Software Group
These two reports on the 2020 activities of the research software development groups at Newcastle and Birmingham are extremely interesting if you run a research software development core facility-type operation, and very interesting even if you don’t int terms of the clear product and strategy mindset (and communications efforts) behind the groups. In Newcastle’s, we get some interesting overviews of what 2020 held:
Their service offerings are discussed - offering MSc computing projects supervised by faculty and the RSE group, offering coding services with fractional FTEs written into grants They also are pretty transparent about where they’re going; they’ll changing to a simpler and easier to administer model but one with a little less certainty - rather than allocating (say) 40% of a staff person to a project, they’ll be charing facility day rates like a core facility. This will greatly simplify taking on shorter-term projects.
They’re also getting into higher-level services, like architecture/design training and consulting rather than doing the hands on work, and trying to tie into the institutional open data repository.
Birmingham’s is just as interesting, and reflects an organization with a different focus. The team has up to this point been a small number of centrally-provided RSEs plus dedicated RSEs for different colleges/schools, all sitting together; rather than longer-term contract software development they provided free, time-limited advice, coaching, coding, and mentoring. (The coaching is particularly interesting to me, as I hadn’t heard that before; they won’t be hands on keyboard but sitting with and advising as you take on projects - or even reviewing your code).
They’re also responsible for software maintenance on the main Birmingham research computing systems, and perform training.
They have started moving, however, to including grant-funded research software developers for longer-term projects, allowing researchers to “hire” a fractional software developer without having to recruit people with expertise they may not be able to judge, and have that intact whole person sitting in a team of colleagues.
Theres are really interesting documents, and our jobs as managers would be easier if more teams wrote up such descriptions of their operations routinely.
Strengths, weaknesses, opportunities, and threats facing the GNU Autotools - Zachary Weinberg
Another very transparent product-focused assessment; a simple but thorough SWOT analyses of the current GNU Autotools stack, which hasn’t been updated in some time (which itself makes the updates harder since the entire process is “rusty”), and which has enormous legacy baggage, but still has opportunities.
Recognizing the value of software: a software citation guide - The FORCE11 Software Citation Team, F1000 Research
A style guide for citations of research software; following the American Psychological Association style guide, something like this
Coon, E., Berndt, M., Jan, A., Svyatsky, D., Atchley, A., Kikinzon, E., Harp, D., Manzini, G., Shelef, E., Lipnikov, K., Garimella, R., Xu, C., Moulton, D., Karra, S., Painter, S., Jafarov, E., & Molins, S. (2020, March 25). Advanced Terrestrial Simulator (ATS) v0.88 (Version 0.88) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.3727209
is recommended, with the obvious implications for what software authors should make available. There are variations on the link URL for some software archives.
(I have complex feelings about this, due to my entirely heretical belief that much of research software - and necessarily the most-used research software - isn’t in and of itself a research output.)
Creating a Risk-Adjusted Backlog - Mike Griffiths
Here’s an example of a concept that I think research software development teams probably “get”, if implicitly, more than teams in other environments.
Research software development spends much more time further down the technology readiness ladder; we spend a lot more time asking the question “can this even work” than we do “when will this feature ship”. The risks are higher, because most promising research ideas simply don’t pan out. So we spend a lot of time prototyping, fully aware that the answer to “will this work” could well be “no”.
Griffith’s theme is that even as you march up the technology readiness ladder to more and more production code, you should still be explicitly prioritizing risk-mitigation tasks on the backlog rather than just prioritizing the most valuable feature. That might be code cleanup, it might be doing research on uncertain new steps way earlier than seemingly necessary, it might be documentation - it depends on the risks you’re worried about. In our context it might include going to conferences and giving talks about your tool, if the risk is lack of adoption.
Number Parsing at a Gigabyte per Second - Daniel Lemire
And here’s a reminder that efficient deserialization of floating-point numerical data from text is still an open area of research. Even the 3-4x improvement in performance enabled by the algorithm referred to in this blog post (with links to this paper by Lemire) pulling numerical data off of disk is still dominated by the deserialization, not disk I/O, at least on SSDs.
The Next Gen Database Servers Powering Let’s Encrypt - Josh Aas and James Renken
Let’s encrypt, wihch provides certificates for 235 million websites, is powered by a single MariaDB 64-core server with 2TB of RAM and 24 NVMe SSDs. There’s a tendency in tech to want to use the latest distributed DB for things, or in research computing/HPC to complain “why don’t they just write it with MPI”, but if you can have a simpler solution running on one fat node that does the job, that’s usually going to be the right choice.
Firecracker: start a VM in less than a second - Julia Evans
63-Node EKS Cluster Running on a Single Instance with Firecracker - Álvaro Hernández, OnGres
Firecracker is a lightweight, stripped-down VM engine that uses hardware support for isolation, and it’s intended for use as a host for small or short-lived processes (like Amazon Lambda functions) that don’t require support for a million devices like a virtual desktop would.
Evans has a use case that could well be useful for research-computing applications, for interactive analysis or training, where each user gets their own VM, and walks through the process of creating a Firecracker image and starting it up. It’s also (as if often the case with Evans’ posts) a nice collection of related resources.
Hernández has another use case - spinning up a many “node” kubernetes cluster for devel/testing/messing-around-with purposes on a single actual node. In his blog post (and associated gitlab repo) he shows how to use Ansible, Firecracker, and EKS-D, an open source AWS EKS-compatible distro of Kubernetes, to run the cluster on a single, admittedly very large, AWS instance.
Scaling Kubernetes to 7,500 Nodes - OpenAI
If your workflow is just running batch jobs, there are much simpler tools than Kubernetes. But if you want to schedule some batch jobs while also spinning up and down web services, have batch jobs connected both ways to web services, have dynamic software-defined storage and networking…. then Slurm’s going to be a challenge.
The downside for all of that flexibility is complexity and challenges pushing to new scales. Here OpenAI walks through the changes they had to make to networking, deployment of API servers, monitoring tools, and health checks to handle what in simpler, more uniform HPC clusters would be a healthy but not exceptional number of nodes.
PEARC’21 - Full/Short papers due 9 Mar/13 Apr, virtrual conference 18-22 July
PEARC is a great conference about research computing and providing research computing quite broadly. Tracks include
There are also tutorial and workshop proposals due 9 Feb.
Events & Conferences in 2021 - Research Data - Springer Nature
Rather than a single event this is an updated resource of research data events; some events that look intersting are Open Science conferences in Feb and June, and the RDA plenary in April.
A Compaq AlphaServer emulator running OpenVMS on your linux box.
A beginer’s guide to capture the flag events.
Interested in trying Nix, but not so much that you’re willing to wipe a whole machine? dev-env is a local nix env “on training wheels”.
I’ve been looking for better markdown-presentation tools for a while. Marp looks really promising.
Research computing, and simulation, has a history spaning over a half-century. Here’s video and oral history of the field.
We live in an age of static-analysis wonders; the Enzyme package will autodifferentiate functions at the LLVM-IR level, meaning not only does it work with a large number of languages but it will autodifferentiate optimized code, making for potentially substantially faster results.