#80 - 25 June 2021

Hi there, everyone!

It’s been a busy week here of contacting recommended potential job candidates, building consensus among stakeholders, debugging code, and juggling finances - a normal set of activities in research computing management land, where if we’re very lucky we got some training in maybe one of those things.

I’m starting to see some out-of-office notifications in response to newsletter issues, which genuinely delights me - it’s been a long 16 months and we all deserve a break. I hope that those who haven’t managed to take significant time off in a while get the opportunity over the summer or fall.

I also had a very nice conversation this past week with one reader; we’re looking into one possibility for a community forum (which might make unnecessary a revisit of the “Ask Managers Anything” feature from last year), and toying with the idea of community video chats. I’m pretty excited about how a couple of those things might work.

But for now, on to the roundup, and the weekend:

Managing Teams

How to Re-Onboard Employees Who Started Remotely - Rebecca Zucker

Over the last 16 months or so, many of our teams have hired new people - if you are planning to go back into the office, even part-time, Zucker reminds you that you still have some onboarding tasks for those newer team members! Introduce them to people in person, show them the facilities and where everything is and how things are done, etc.

The Next Great Disruption Is Hybrid Work—Are We Ready? - Microsoft WorkLab

Most of the material here won’t be new to long-term readers, but it’s a data-driven exploration of the fact that for office workers, especially technical staff like us, flexible work - including the huge phase-space of ”hybrid” options - is here to stay, and it’s going to be challenging.

One point that I don’t see addressed enough is their point five about shrinking networks; Microsoft WorkLab has taken a look at anonymized outlook emails and team meeting statistics and found that networks have shrunk - there is more communication between immediate teams and less more broadly.

I think how that plays out will depend on the team - for instance, for us (a multi-institutional collaboration) the silos between sites more or less disintegrated between our team, but there are fewer communications between each sites team and other groups within their site. For instance, our team moved institutions during the pandemic, and we have very little communication with other groups in the new institution.

This can be addressed deliberately, but it takes work.

How to manage former peers as a new manager - Claire Lew

It’s kind of goofy that one of the most awkward situations for new managers to navigate is the most common situation we put people in for their first management role in research computing - taking their old boss’ job and managing their peers.

This is not actually a difficult role - you have a huge advantage over taking over a new team both by knowing the work of the team and the team members, and being a known quantity. Coming in from outside to run a team is challenging too! But for new managers who are uncomfortable getting used to the idea of being the manager, managing their former peers feels awkward. There’s really nothing to be done about that except accept it and move forward anyway - the sooner you stop seeming awkward about things, the sooner your team members will adjust.

Lew’s specific suggestions are to not avoid one-on-ones and to take consistent action.

Managing Your Own Career

Time Management Won’t Save You - Dane Jensen, HBR

Just a reminder that time management will help you be more efficient at getting discrete tasks done, which is all well and good, but that’s far less important than being discerning about what you choose to do.

Product Management and Working with Research Communities

Experiences and lessons learned from two virtual, hands-on microbiome bioinformatics workshops - Dillon et al., PLOS Comp Bio

An overview of the (first ever) virtual QIIME-2 workshops held by this team over the pandemic. Many people on this team had been involved in multiple in-person workshops in the past, but despite requests had never run a virtual one. This gives an overview of not just one but two workshops, and how they learned from the first to improve the second.

In the first, they:

More cleanly separated lecture from hands-on components, so the lectures could be watched asynchronously
Pre-recorded sessions, made them available ahead of time, and streamed them live over YouTube along with live discussions and segues; this was done by an emcee using Open Broadcaster Software and Skype (which integrates with OBS more nicely than Zoom)
Interaction was handled by (paid) slack, and in addition to the all-participants channels there were instructor-only channels and ”pod” channels assigning learners into smaller subgroups
One instructor was in charge of OBS and video streaming, and one in charge of slack question triage, with cross-training incase either became unavailable
As usual, they had cloud instances for participants with pre-loaded data, with easily used supported methods for connecting (SSH app in Google Chrome)

In the second, they made a few changes:

They used cheaper self-hosted Zulip, which also made it easier to bulk-invite participants; users seemed broadly happy with it
They tried to do without the chat-question triage manager instructor, and it didn’t work as well; they’ll return to having a triage manager in the future

Making a PACT (Purpose, Attendees, Community, Tech Tools) for More Engaging Virtual Meetings and Events - Centre for Scientific Collaboration and Community Engagement
Scientific Community Profiles - Centre for Scientific Collaboration and Community Engagement

Relatedly, the CSCCE has released the fourth section of their guidebook to virtual events, and related webinars (seebelow). To my mind, the first section of the book, ”A guide to virtual events to facilitate community building: event formats” is the most valuable - it lists twelve kinds of events and suggests formats and tools for each.

CSCCE also released a collection of thirteen community profiles, to give a sense of what some other major groups are doing for community engagement - what kinds of events they run, what their community is like, the resources needed to manage the community.

Research Software Development

Quick Analysis for the SSID Format String Bug - Zhi

Way back in #11 when talking about the Ariane crash we had a paper describing how error handling or logging code was implicated in surprisingly many big failures. I mean, who ever tests their error-handling or logging code? Well that bug you probably read about where connecting to WiFi APs with SSID of ”%p%s%s%s%s%n” would disable an iPhone’s Wifi? Yeah, it was in logging code.

What Every Programmer Should Know About SSDs - Viktor Leis

A quick overview of SSDs from a software developer’s point of view:

There’s much more parallelism available in reads and writes
On-disk cache can hide it, but writes are 10x slower than reads - to hide that latency at volume you’ll need to use the concurrency
Out-of-place writes and garbage collection means writes can lead to significant write amplification, increasing wear and slowing writes down further

This all means that a lot of thought that has to go into designing write patterns that perform well.

Don’t use raw loops - Thomas Lourseyre, Belay the C++
Subclassing in Python Redux - Hynek Schlawack

Two articles exhorting us to use modern language features and some thought in two languages in common use in research computing - C++ and Python.

In the first, Lourseyre echos and expands on Sean Parent’s earlier call to arms, ”No raw loops”. With C++’s extensive and growing algorithm library, C-style for (i=0; i<n; i++) approach is less explicit about what is happening, less likely to be able to take automatic advantage of parallelism, and avoids common off-by-one errors.

In the second, Schlawack goes on a deeper tour of different reasons and approaches to subclassing in Python, and along the way points out the benefits of Protocols and type checkers like Mypy.

Using pre-commit hooks makes software development life easier - Werner Dijkerman

Dijkerman walks us through setting up git pre-commit hooks and gives us some examples (linters, code style checkers, index builders, etc.) of ways that you can automate certain routine work before commits are even made to make sure that code reviews don’t waste time on these automatable details and can focus on more meaningful review.

Research Data Management and Analysis

Datastation - The Data IDE for Developers- DataStation Project

Datastation is a cute-looking in-memory data exploration environment with an open-source version that supports javascript, python, and SQL all in-browser.

Research Computing Systems

7 Lessons From 10 Outages - Tom Kleinpeter, The Downtime Podcast

In #75 we mentioned the new podcast, ”The Downtime Project”, reviewing high-quality post-mortems of notable failures. They’ve gone through 10 post-mortems in their first season, and reflect on what they’ve learned, finding 7 lessons that each touch on multiple of the failures (big failures never having merely a single cause):

Circular dependencies will break your operational tools - for instance, don’t send your telemetry data to your production database
Dumb automation is more robust than ”smart” automation
Don’t run huge operations on production databases
Avoid ”magic” middleware in data persistence
Backups are useless, restores are everything, test your restores
Roll out in stages
Have tooling (runbooks and switches) in place to prepare for failure

Counterfactuals are not Causality - Michael Nygard

Relatedly, when you’re digging into the (likely multiple) causes of a failure, Nygard reminds us that things that didn’thappen can’t, necessarily, be the cause of something. To steal an example from the post, ”The admin did not configure file purging” is not a cause. It can suggest future mitigations or useful lessons learned, as ”we should ensure that file purging is configured by default”, but looking for things that didn’t happen is a way for blame to sneak in and takes our eyes off of the system that lead to the bad outcome.

CentOS replacement distro Rocky Linux’s first general release is out - Jim Salter, Ars Technica

As you know, I think long-term stable operating systems are a trap for systems teams. But it’s a trap a lot of people are stuck in, and so it may be of interest to know that the first release of Rocky Linux 8.4, a stable plug-in replacement for CentOS 8.4, is out, with migration scripts to aid in moving from CentOS 8.4 as well as others.

Your CPU May Have Slowed Down on Wednesday - Travis Downs

Intel recently updated microcode for skylake and icelake which significantly slows down zero-fill of memory pages, to mitigate yet another side-channel vulnerability.

Emerging Technologies and Practices

In the search for performance, there’s more than one way to build a network - Brendan Bouffler, AWS HPC Blog

One of the few major differentiators between the two biggest commercial cloud providers, AWS and Azure, is how they handle networking for large-scale HPC workloads. Azure has made the much easier-to-explain but harder-to-operate and integrate choice of having resources available that are connected by Infiniband - but because those nodes are special, they’re harder to come by. AWS has gone the engineering route of building a more internally consistent solution, at the cost of making it much harder to explain to consumers - building their own.

In this article, Bouffler argues the AWS side. What matters in real networks even on HPC clusters, he claims, is less median latency than tail latency, and by guaranteeing QoS and using datagrams rather than ordered packets, scalable reliable datagrams (SRD) over the elastic fabric adaptors can provide robust stained performance while still giving access to datacenter-scale compute resources.

I think it’s fair to say that scaling over Infiniband still beats EFA + SRD for a lot of real-world codes, but I also don’t think that it’s a given that things will stay that way.

Events: Conferences, Training

Juliacon 2021 - 28-30 July with workshops in preceding weeks, free but registration required

Julia, a programming language of growing interest in some research software development communities, is holding their annual conference at the end of July, with both talks and a number of workshops.

Scientific Community Engagement Fundamentals - 23 Sept - 4 Nov, two sessions a week, $750

The Center for Scientific Collaboration and Community Engagement - we mentioned a couple of the resources earlier in the newsletter - is running a 7-week course on managing scientific communities:

Scientific Community Engagement Fundamentals is designed to offer new or existing community managers core frameworks and vocabulary to describe their community’s purpose, refine or create strategic engagement programming to match community member goals, and describe their own roles and value as a community manager in STEM. While the content is designed for any level of learner, it should not be thought of as a ”beginner” course. Rather, it is intended to create common ground so that scientific community managers can converse across disciplines, more efficiently learn from one another, and build successful engagement strategies that are grounded in research.

Also relevant are individual webinars:

Making a PACT for engaging virtual meetings and events - 20 July, $150

Event Planning: Selecting tools to supplement your online meetings and events - 3 Aug, $150

Event facilitation: Making decisions during your virtual meetings - 17 Aug, $150

Event facilitation: Supporting information exchange before, during and after virtual meetings and events - 31 Aug, $150

Hybrid events: Making hybrid events the best of both worlds - 14 Sept, $150

Random

There are enough static analysis tools out there for code that early adopters are starting to switch to second-generation tools. Here’s semgrep’s take on GitLab’s switch from Bandit to Semgrep for python code.

It turns out that ”for historical reasons”, you can name bash functions almost anything, including emoji, and have long been able to.

I’m a big fan of ”learn how X works by building one” kinds of tutorials - here is a 10-part series of building a debugger, with the full source code on GitHub.

Cloning a PC into a virtualbox image and running it as a VM.

NCBI has released some training material on “Getting Started with Python and Cloud Computing” for bioinformatics.

Graphana dashboards for postgres metrics.

NVIDIA has a performance tuning and resource selection tool for choosing GPUs to run a multi-GPU workload on in a multi-node environment - NVTAGS.

This is the first I’ve heard about the Modelica language, a DSL for simulations based on … ODEs, maybe it looks like? Maybe more? … that seems to mainly be used in the electrical engineering community. Anyone have any experience with it?

An introduction to locality sensitive hashing for approximate neighbour searches, which is not only a nice description but a lovely example of clear visualizations for explaining.

Pocketlang is a tiny, python/ruby flavoured embedded language for including in tools (think e.g. Lua), with concurrency built in.

Canonical has announced that they have Ubuntu 21.04 working out of the box on some RISC-V development boards.

That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, if you’re on the West coast of North America try to stay cool, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.

RCT Newsletter