#90 - 3 Sept 2021

Quick: what’s your team’s specialty?

Your team’s specialty is its reputation for what it’s good at. Not what you think your team is good at; what matters is what specific thing your stakeholders (funders, clients, institutional decision makers) think your specialty is. What they recommend you for to peers, what they recommend funding you for to decision makers.

In the post-pandemic world, researchers are used to getting their support remotely from anywhere. To compete, your team will need well-defined specialties; and “HPC” or “research software development” isn’t a specialty.

Read on, or go straight to the roundup.

The pandemic isn’t over, but the end of this phase has begun, and with September (“academic new years”) here, it’s a good time to think about the future. Last October I wrote about what post-pandemic research computing is going to look like, and it’s holding up pretty well. With researchers now very comfortable getting research computing and data support virtually and with budgets under pressure, there is going to be a lot more competition for research computing and data teams. Research collaborations are going to be looking elsewhere more and more often - academic teams at other institutions, or with commercial companies (either commercial cloud vendors for compute, or emerging collaborations between well-known names, like NAG and Azure, for services).

This is an opportunity for well run, focussed teams to grow and prosper. But it’s going to take more planning and forethought than decades past, where one could count on having a near monopsony, of being the only available seller of services to local researchers. It’s going to take developing and maintaining a strong reputation for a small set of specialties.

“HPC” may sound and feel like a specialty within the community, but to researchers and decision makers it’s incredibly generic and so meaningless. It’s not a technical term, but a term of advocacy and marketing which has been come to mean resources for anything from high throughput batch services to huge tightly coupled simulations to single-node multi-GPU code runs. Even advocates for the term define it as “anything bigger than what a researcher could provide on their own” which is incredibly generic, and so necessarily meaningless. How can your team’s specialty be “anything”? A team is expecting researchers to recommend them for “anything?” There’s a reason why VPRs would be just as happy contracting it out (e.g. see table 2 here).

“Services and expertise for quickly analyzing public-health bioinformatics data”, “a platform for firing off and monitoring aerospace CFD calculations”, “a centre of excellence for digital humanities data curation and archiving”: these are examples of specialities - products, services - that researchers and institutional decision makers can see the value of and be willing to put money into, services and products and teams that researchers can recommend to each other. They are areas where a team could build a strong reputation - they could be the group that researchers recommend to collaborators when they chat about research needs.

“Research Software Development” at least, to its credit, doesn’t pretend to be a narrow specialty - it’s a broad area which can encompass any area of software development in support of research work. As a result, a team can’t have a specialty in “Research Software Development”; it can have a specialty in “web applications and mobile apps for data collection”, or “GIS analysis tools” or “agent-based simulations for social sciences modelling”. But almost certainly not all three at the same time.

Even so, research software development is too specific in one unhelpful sense. It could be that researchers are just looking for your team to write some software for them, hand it over, and be done. But increasingly, researchers are looking not just to be delivered some software, but for a team to host the software, run it, operate it - and/or collect and curate data to be used with the tool, for tests or otherwise. Focusing solely on research software development, as a separate activity from systems operation or data analysis and management, can be overly limiting.

Ok, so what does all of this have to do with competition?

One of my venial weaknesses is spending too much time on twitter. I’m seeing increasing concern there from research computing teams that cloud vendors or teams using cloud vendors are coming into their institutions and winning or trying to win contracts for projects that “should” have gone to the in-house teams. I’m hearing complaints that the external bids are for amounts of money 2x or more what the in-house team says they could do it for. Incredibly (and almost certainly incorrectly) I’ve even heard 10x.

Reader, as hard as it is to believe, those complaining see this as an affront, and a threat, rather than the enormous opportunity it is. (And affront was taken. There were lots of dark murmurings about slick sales teams trying to fool gullible senior administrators. And, you know, I’m sure it’s comforting for the teams that might lose out on these contracts to think that the vendor mesmerized the simpleton decision makers with their entrancing slide decks, and so hoodwinked them into considering an overpriced contract. But (a) have they never seen a vendor pitch? Being sold at for 50 minutes is just as excruciating for senior decision makers as it is for us, and (b) it’s self-serving twaddle to imagine that just because someone higher up made a decision to work with someone else they must clearly be dumb. If they assume the only reason someone wouldn’t work with their team is that the decision maker is dumb, they’re going to end up making a lot of poor and uninformed decisions.)

If a contract at your institution is won - or even in serious contention - that is 2x what you estimate you could have provided the services for, that’s not evidence that the external contractor is overcharging. It’s evidence that your team is undercharging, that you could have proposed doing more to support that project and the researchers, and that you’re leaving money on the table. It’s also evidence that you haven’t fully convinced the relevant decision makers that you can provide that service; they don’t see it as being part of your specialty.

Clearly your institution found it worthwhile to spend or consider spending that 2x, because they understood that it was worth at least that much to them to have those services. A bid for half that amount having failed or being questioned means that they really didn’t believe the in-house team could do it as well. That’s revealed-preferences data that you can use. (And if I truly believed someone at my institution was seriously considering spending a premium of 10x (1000%!) to work with an outside company rather than work with my team, well, that would occasion some serious soul searching.)

Cloud providers and other external contractors do have advantages. They have a library of reference architectures they can deploy, so they can pitch (say) CFD solutions to the mech eng department, and bioinformatics pipeline solutions to the biology department. They can pull from a library of testimonials to demonstrate that they can do the work.

But so can you. You have access to all the literature to search for how others have deployed such solutions. You have (or should have) testimonials from the people that matter - research at that very institution. And you have a network of deep relationships in the institution, relationships based on collaboration on research problems. Those relationships and collaborations and shared expertise is something the external contractors have no chance of matching.

If you’re in danger of losing out on these sorts of competitions, it’s because you’re not communicating your specialities in a way that matters, in a way that’s convincing, to the people who could pay for your services. They can’t see how your “HPC batch services” connects with “a digital twinning platform for building simulation”. They don’t see “GIS exploration for private social sciences data” as being an obvious of your “Research Software Development” effort - where’s the data part?

You have specialities - if you don’t know what they are, ask the researchers who keep coming back. How do they describe what you do? What would they say your speciality is, how do they talk about you to their colleagues? What would you have to demonstrate to them to have them recommend their colleagues to you?

Once you have those specialities, you can start playing to your strengths, and communicating them endlessly. You can make a point of reaching out, having your team talk at conferences in the specialties, and at departmental colloquia. You can be well-regarded enough in your institution for those specialties that external contractors pitching work within your speciality never get in the door. You can start more easily hiring people that are interested in that specialty. A specialty builds on itself, snowballs. You can start steering future work towards that specialty to build on it, and start directing work well outside the specialty to somewhere else - where it does fit inside their specialty.

Yeah, that last part is scary. Sticking to this path isn’t easy. It means turning down opportunities that aren’t in or adjacent to your specialities. Especially for new teams, eager to please, this can be scary.

But as anywhere in research, your team’s reputation is all that matters. Your team has a reputation, has stuff it does and doesn’t do. Did you choose it, did you shape it, or are you content to just let it happen?

Your team can be extremely strong in, specialize in, develop a reputation in, any of a number of things. But not all of the things. Being a manager or leader means choosing.

And now, the roundup:

Managing Teams

Yak Spotting - Aviv Ben-Yosef

A good way of catching yak-shaving (video example here), for ICs or for managers - “Why are you doing this?”

Creating an Internship Program for Software Engineers - Tom Sommer & Gábor Zöld, Level→Up Podcast
Understanding the mentorship mesh - Daniel Peck, LeadDev

Academic research computing teams take on student interns more commonly than teams in industry do. The industry teams who do implement internship programs, however, build them out to take much fuller advantage of the opportunities of such a program than most academic teams (mine included) do.

Internships are a lot of work for those hosting the interns. But that work can pay off, for large enough organizations. Some advantages of a robust internship program, routinely bringing in several interns at a time, include:

Constant hiring, and so continual maintenance and improvement of your hiring process, even at times when you’re not hiring permanent staff;
Constant onboarding, and so continual improvement of your onboarding documentation, explicit documentation of implicit knowledge, etc, which makes information easier to find for everyone, not just the new interns;
Improved recruitment of known-quantity high achieving candidates from your internship alumni;
Constant influx of new tools, new ideas, etc. from the interns (a trivial but telling personal example: it was interns who finally got me using VSCode after 20+ years of vi use).

Level→Up engineering, which is a routinely interesting podcast, has started writing up blog posts from their interviews, which is handy (I’m very hesitant to include un-skimmable resources like audio or video in the roundup unless they are exceptionally relevant). In this episode, Zöld interviews Sommer about Redbubble’s internship program, and what’s worked for them.

Some highlights:

Actively working with local organizations to source interns
Defining clear expectations
Having a lightweight version of the same interview process as for permanent staff
Interns arrive in cohorts, so they go through the experience together, even though…
They are embedded separately into teams, and may rotate between teams
Interns have volunteer mentors separate from their managers
Intern’s growth is monitored, for their own sake and for future hiring - rapid growth is a stronger signal for later hiring than high but constant level of technical competence
Interns who are on a path for hiring still go through the permanent-staff hiring pipeline

Peck’s article talks about the next stage of onboarding, for permanent hires. For this longer-term mentoring, having a single mentor isn’t enough; Peck prefers a “mentorship mesh” approach, with a team of people providing different kinds of mentorship:

A peer they’ll be working with, on the same career path (but somewhat advanced)
A senior person who is still further advanced
A domain expert in the area they want to develop expertise in
Their team lead/mentor

Management as a Technology - Nicholas Bloom, Raffaella Sadun, John Van Reenen, Harvard Buisness School Working Paper 16-133

Management is important, and the issues involved are complex, and like a lot of things that are important but complex, it’s the subject of a lot of study. That study looks different than those of us who came up in the natural sciences are used to - people systems are way harder to examine than, say, fluid systems - but it can be every bit as insightful.

This working paper from a few years ago results from interviews with 11,000(!) manufacturing firms in 34 countries, asking questions about whether the company followed industry best practices, and their practices around process improvements, performance review and tracking, feedback given, clear targets, high performers rewarded, low performers removed, and hiring and retaining staff. (The anonymized data is available by filling out a form at http://worldmanagementsurvey.org/)).

They then analyzed the data in a slightly unusual way - using the existing framework of analyzing the adoption of a new technology across firms and companies, and looking to see if the new technology improved productivity or not. But here the “technology” is good managerial practices.

The technology of management (literally a body of knowledge and techniques) might be a useful mental model. It’s not magic, it doesn’t require inspiration or a particular personality type - it’s the roll-out of a well understood set of practices. That doesn’t make it easy, but hey, adopting new technology can be challenging.

Bloom et al. find that management “technology adoption” accounts for 30% changes in total factor productivity across entire organizations, or even between countries. If anything, I’d guess that number is understated, since management approaches can be pretty heterogenous within an organization so there’s likely an averaging effect. I don’t find it hard to imagine at all that well-managed teams are at least 30% more productive than replacement-level management of teams; just take a look at the software development waste article below.

Managing Your Own Career

Ditching high overhead and running your own research institute - Travis Metcalfe, Astro Better

A lot of us assume that the only way to work on grant-supported research is in academia or adjacent institutions; but that’s not the case. Metcalfe’s article has numbers that are specific for US granting agencies, but granting councils all over the world have over the past decades become much more friendly to working with small for-profit or non-profit corporations. In the article, he describes how he now accepts awards from NASA and NSF through his own nonprofit organization.

Research Software Development

Software Development Waste - Greg Wilson, It Will Never Work in Theory
Software Development Waste - Todd Sedano, Paul Ralph & Cécile Péraire, ICSE 2017

Wilson briefly summarizes a paper by Sedano, Ralph, and Péraire, who looked at eight software development projects at Pivotal, a software development company, for 2 years and five months, interviewed team members, and analyzed retrospectives. They identified nine broad categories of wasted time and/or effort in the projects:

Building the wrong feature or product
Mismanaging the backlog
Having to re-do work
Implementing unnecessarily complex solutions
Extraneous cognitive load (from technical debt, large/complex stories, poor tools/code, etc)
Psychological stress
Waiting
Knowledge loss, and
Ineffective communication

These will likely sound familiar, and can be a useful list to keep to hand as a caution at our next planning meetings.

Kaitai Struct - the Kaitai project

So on the one hand, I think that in research computing we invent our own bespoke file formats way too often, spending more attention to how the data is laid out on disk than the APIs to get the data on and off disk.

But sometimes HDF5 or the like is overkill (double check though!) and format-create you must. You can improve matters by (a) making sure the format is binary, so you’re not serializing/deserializing to text all the time, and (b) do it with some sort of well-defined specification so that it’s easy to have implementations in a number of languages (and can test their conformance) against the spec.

Kaitai Struct won’t help the “rush to define a byte layout” problem, but it is a well-established and new-to-me tool and compiler for a declarative yaml-based definition of a binary format (for files or for transport), that then “compiles” to readers and writers for a number of languages, including Python, Java, Javascript, C++, and Go. The spec serves as a form of documentation as well as a source for code generation.

Research Computing Systems

Rethinking Best Practices - Will Gallego

This is a nice thoughtful article on the role best practices do and should play.

I find that in technical fields, there are two bad and opposite attitudes to best practices; overly-deferential unquestioning “that’s what everyone does”, and just as unquestioning knee-jerk “that would never work here”. Gallego walks us through a third way, with a quote from a 2016 paper by Klein, Woods, Klein, and Perry in J. Cognitive Engineering and Decision Making:

We should regard best practices as provisional, not optimal, as a floor rather than a ceiling.

If you’re starting from scratch, or even rebooting practices in a team, industry best practices are the right starting point. They’re common for a reason, and if you choose a different starting point odds are better than even that you start off further from optimal. They also give you access to a language and a literature to discuss and compare approaches with other practitioners.

But effective teams are constantly experimenting and measuring, and so learning, what works for that group of humans on that set of problems. Very little should be left to received wisdom.

Using snakemake to do simple wildcard operations on many, many, many files - C. Titus Brown

Researchers often have libraries of bash scripts or make files for processing data, especially when there’s lots of files - but error-handling and logging code is hard (and prone to bugs!, e.g. #11 with the Ariane 5 crash, and #80 with iPhones). So bash scripts are typically not particularly robust, and make’s “the filename is the metadata” approach works really well right up until it doesn’t.

Workflow management tools are starting to get simple enough to set up and get started with that they can be plausible replacements for scripts or make for these routine tasks. They’re also becoming popular in bioinformatics, where abusing the filesystem with fragile bash scripts has been a traditional way of life.

Snakemake is a popular workflow package for those already doing a lot of work with python. In this short article, Brown shows how to replace a simple bash-loop script with a snakelike file, which is a little longer but has several advantages:

Extends in complexity gracefully (python libraries and templating readily available)
Automatically gives parallelism, like make, but allows external metadata sources readily
Robust error handling
Built-in logging available

Emerging Technologies and Practices

Docker with IPv6 and Network Isolation - Sumit Khanna

One - maybe overkill - way to avoid the confusion of host and container networking, and to be relatively certain the local container network is isolated, is to have the local container network be over IPv6. Docker handles this natively but the configuration isn’t obvious.

Here Khanna walks us through setting up an IPv6 NAT (to avoid having to have a dual IPv4/6 stack) and then configuring a load balancer and services behind the NAT in its own isolated IPv6 network.

Migrating from Docker to Podman - Marcus Noble
An Overview of Docker Desktop Alternatives - Matt Rickard

If you use Docker Desktop, you’ve probably heard that while it remains free for personal use and non-commercial open-source development, it will cost money for professional use in other contexts.

We’ll wait to see how that plays out for (say) research use which is often (but not always, but maybe mostly should be) open source. In the mean time, you might want to check out some alternatives, especially if it’s a tool that only gets used intermittently.

BUT please consider not running for the hills every single time a productivity tool starts charging money.

We’re pretty hypocritical about this in research computing. We (rightly) complain that it is way too hard to get ongoing funding for research computing and data products (software, data resources, etc). And it is! But as soon as a tool we use isn’t free any more, we bail.

This particular tools is going from free to something like $7/mo/team member. $84/year/team member for a tool that the team finds useful is not a lot of money. If this particular tool genuinely isn’t worth that for your team members, because they only use it occasionally, then ok, cool, makes sense. But - stuff costs money. If there’s stuff that makes your team more effective, please consider finding the money.

Calls for Submissions

Gateways 2021 - 500 word abstracts due 22 Sept, Conference virtual 19-21 October 19-21

From the website: Topics include, but are not limited to:

Architectures, frameworks, and technologies for science gateways
Science gateways sustaining productive, collaborative communities
Support for scalability and data-driven methods in science gateways
Improving the reproducibility of science in science gateways
Science gateway usability, portals, workflows, and tools
Software engineering approaches for scientific work
Aspects of science gateways, such as security and stability
AI and ML for science gateways
Social research on science gateways
Use cases and lessons learned from science gateways

Events: Conferences, Training

INTERACT for Engineering Leaders - 30 Sept 2021, Online, Free

From the website,

Interact is the community driven conference for engineering team leads, managers, VPs and CTOs looking to improve themselves and their teams.

PMI Kickoff: Free Project Management Training (Review) - Elizabeth Harrin, Girl’s Guide to PM

A review of a short (45 min) free intro to project management training from PMI - worth thinking about if you’re considering having someone on your team play more of a role in managing projects.

Random

A 1981 TRS-80 adventure game found, fixed and put online.

Making homemade silicon chips.

Awk at scale to process 256 M records, or sort/uniq/comm to compare lists.

Learning modern C++ by writing a JSON parser from scratch (editor’s note: do not write a JSON parser from scratch).

We often have to post blog posts, etc to illustrate research software development topics: klipsify allows you to easily include runnable python, go, or javascript snippets (or clojure or scheme if that’s your thing. I don’t judge).

Go is pass by value, and references to slices get…. complicated.

The tac command, if you’re not familiar, is cat but prints the lines in reverse. Ever been tired of how low-performance TAC is? Finally, using SIMD acceleration in rust to create the world’s fastest tac.

A unikernel implementation of a VPN using fireguard.

Writing a virtual machine (in the sense of the Java or Erlang VMs, or the original LLVM) for the Little Computer 3, in C and in Rust.

A discussion of Linux’s three “stopwatch clocks”, CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_BOOTTIME, why they exist when the linux clock exists, and where they may or may not fail you.

Computing technologies come and go, but deep research computing expertise doesn’t really have an expiry date. Here’s a nice article on an image processing person being introduced for the first time to FFTs when they were looking instead for machine learning magic to improve scans of printed images.

Intersting. A commercial (but free initially for modest use) whole-system profiler for C/C++, Rust, Go, Python, and more - Prodfiler.

Interesting comparison between the “efficiency” and “performance” cores on the M1, may be relevant to other such big-little architectures. For a vector dot product they can (but normally wouldn’t) perform within a factor of 2 of the performance cores.

RCT Newsletter