#165 - 23 Apr 2023

Holding ourselves to high standards. Plus: Missteps of a new technical lead; Balancing customization and standard offerings; Comparing embedded analytics databases; RDMA within Azure; DDR4 organization and bandwidth; checklists for psychological safety

I’m going to close out this discussion of focussing on the ends (outcomes and impact) and not the means, and measuring what matters, with a point that I would hesitate to make in a public talk.

One of the nice things about writing a weekly 3000+ word newsletter on research computing and data teams is the selection effect. With you, reader, I can raise some uncomfortable topics. The fact that you take a significant chunk of time out of your week to read and think about managing and leading research computing and data teams well, that you agree such a thing is important, that it’s worth considering our role more deeply, already means that the RCT newsletter community here isn’t full of typical teams.

So here’s the thing. We’re in funny sorts of roles.

The people we report to in our institutions or regions mostly have a pretty fuzzy understanding of what impact our team has now, much less what impact it could have.

Individual researchers and scholars who depend on us to support them, their trainees, and the work of the group are fighting to get their own work done in a funding environment and highly competitive academic establishment which makes that challenging. They deeply understand the context of their work, and how well we’re supporting them in particular, but don’t have a way of assessing the influence we have on research and scholarship beyond that.

Our funders do see the big picture, and thankfully typically give us some key component of funding which is pretty stable for its duration. That’s terrific. But the funding decision makers, too, mostly have a pretty indirect understanding of what our teams can accomplish in advancing research and scholarship. They honestly can’t tell if we’re doing well or not, and more importantly have no real way of assessing if we’re achieving even close to what is possible.

There’s only one set of people who can make us accountable for having as much impact on research and scholarship in our institutions and communities as possible. Only one group can make us live up to our duty to significantly contribute to the success of research and scholarship. Really only one possible community can hold us to high standards.

It’s up to us.

We have outsized ability to shape the tasks of our teams, to decide how to allocate resources, and to weave together the threads of our work so that they add up to something bigger than individual strands of effort.

We’re the ones who do investigating and reporting on impact, and the ones that can educate our funders and institutional decision makers on the significance our teams can have in research and scholarship in our communities.

We’re the only ones who can try to achieve more. We’re the only ones who can push back on expectations, who can add come coherence to what we’re asked to do, who can aim for better. We’re the only ones who know to what standards we can hold ourselves and our teams, and who can raise those standards over time as we grow and improve.

And that’s an important responsibility.

Because the opposite of success in our line of work isn’t failure.

Realistically, teams like ours are not going to utterly fail to run a computer system or develop software or analyze data. And because of our funding model, we’re not likely to go out of business.

Researchers and scholars are going to get some kind of support in their work. They’ll get some answers to their questions. Their research support teams will be performing some set of activities.

For teams like ours, the opposite of success is mediocrity.

You’ve probably noticed that not all of your peers are as interested in trying new things as you are. That “we’ve always done things this way” is a phrase that comes up disappointingly often in our world. That challenging current practices with quantitative and qualitative data is not always welcomed. That driving for continuous improvement and professional development and greater ambition aren’t as universal in the RCD world as you had once assumed.

We can’t fix other teams. Even making change within our own spheres happens only slowly and with sustained effort.

But within those spheres, we do have the opportunity to make a real difference.

We can choose to measure what really matters. We can focus on the outcomes and impact that truly make a difference to research and scholarship. We can hold ourselves and our teams accountable to our current level of impact, and slowly raise the bar as we learn what works and get better and better.

It’s not just about showing our team members the impact they can have. It’s not just getting recognized by our institutional and funding decision makers for that impact, although hopefully that will happen too. It’s not just about modelling what’s possible. It’s about making a meaningful contribution to the wider world of research and scholarship, in a world that very much needs advancing knowledge.

In the end, we know that the work we do is too important to settle for “the way we’ve always done it”. Focussing on how our work has impact, measuring that impact however imperfectly, and having high standards for that impact helps us communicate what’s possible to the people who need to hear it. We are the ones who have the power to make that happen, and to make a real difference.


One note before the roundup - last week I highlighted the fantastic student workforce program at Northwestern Research Computing Services, but I left out a key contributor to the work. My apologies to Colby Witherup Wood - I’ve updated the article. Thanks to all three for the excellent work and for sharing it with the community!

Now on to the roundup!


Managing Individuals and Teams

At Manager, Ph.D. this week I talked about how we from the research world have basically trained our whole career in the advanced management skills required in handling remote and asynchronous teams.

In the roundup, I covered:

  • A Miro report surveying knowledge workers on asynchronous work: people want more asynchronous work, are more comfortable sharing ideas asynchronously, and there are diversity benefits - but we need to do a better job of setting positive expectations and onboarding juniors
  • An article by Molly Graham on Minimum viable process: keep them nimble and changable, and keep their purpose in mind
  • An article by Daniela Ostovic teaching us that just like we work with team members to find them tasks that stretch their skills in directions they want to grow, we should do the same for ourselves
  • An article by Deb Liu on getting better at reading the room.

Technical Leadership

7 common mistakes when growing from an engineer to a team lead - Gregor Ojstersek

Ojstersek outlines seven mistakes he’s seen when someone technical becomes a team lead; the issues are ones we cover here regularly:

  1. Not letting go of control
  2. Lack of transparency - my own note here is that it is shockingly easy to get out of the habit of sharing information with the team, even unintentionally, when you get very busy. Transparency is something you have to keep making yourself do, it’s not automatic
  3. Too much shielding of the team
  4. Trying to do too much overall - delegate delegate delegate
  5. Doing too much preparation work for team members
  6. Trying to be involved in every detailed small decision
  7. Not trying to get to know the team members a bit more personally (He recommends one-on-ones, and I agree!

Product Management and Working with Research Communities

Balancing Customization with Business Objectives - Amy Mitchell

I talk with a lot of research computing and data teams - research software development, bioinformatics services, data science services especially, but to a lesser extent research computing systems teams too - who struggle to establish standard offerings. All the requests coming in are custom in some way, and it’s easy to focus on those differences.

But there’s advantages in establishing a clear standard offering, making that available, and then have a process for handling customization. By having a standard repeated process for the common pieces, the team can get extremely efficient at it (which is good, because it means we can support more science). We can offer a fixed price for the standard piece, which has a lot of advantages (#140). With the customizations clearly documented, we can start to more clearly categorize those over time and look for commonalities there, too.

Mitchell writes about establishing a balance between customization and standardization. In this article, the context is a software product business, but it applies in our contexts too. (And yes, we have “business objectives”, although we don’t talk about it that way - our business objective is to advance research and scholarship as much as possible, given our resources and priorities. We even have “sales motions”, although we use phrases like “project intake” or “incoming support tickets” or… )

Mitchell lays out a three-step process for establishing a balance between standardization and customization; the steps for us would be:

  • Clearly define what is standard and what is custom
  • Make customization requests a clear part of the discussion with researchers when discussing an engagement with them
  • Communicate the distinctions you’re making a lot

I asked Mitchell about using the same approach to refactor out a new standard offering from commonalities in bespoke engagements:

Yes it can work for a new offer or to establish a bit of a standard on similar bespoke projects. It is important to start small and celebrate completing a part.


Research Data Management and Analysis

Comparing embedded databases for OLAP workflows - Roman Zaynetdinov

Zaynetdinov compares duckdb vs Polars/Arrow + Sqlite + Apache DataFusion with some simple tests.


Research Computing Systems

Empowering Azure Storage with RDMA - Bai et al, NSDI 2023

This is a fascinating paper of how Azure heavily uses RDMA (including RoCE) internally in their data centres, especially for storage:

Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings.

They use it not only within clusters, but across data centres in the same region, over long-haul links as long as tens of kilometres:

…there are large RTT [round-trip time] variations from several microseconds to 2 milliseconds within a region, due to long-haul links between T2 and RH.

They’ve developed their own protocols over RDMA to abstract the software layer. They use one protocol in kernel space (for the storage front end - client nodes reading to/writing from storage) and in user space (for within the storage back end). This allows them to handle things like failover gracefully:

On top of sU-RDMALib [the library for the user-space back-end protocol], we built modules to enable dynamic transitions between TCP and RDMA, which is critical for failover and recovery. The transition process is gradual. We periodically close a small portion of all connections and establish new connections using the desired protocol.

They heavily used SONiC to manage and test their RDMA fabrics, which is highly heterogeneous with several generations of NICs and switches. The test suite is particularly extensive, they developed a test suite of 32 software and 50 hardware test cases.

Diagrams (Figures 7-10 showing lower CPU usage on servers and hosts, faster message completion time, and reduced latency for reads and writes with RDMA vs TCP)

They outline a very pragmatic set of steps they took to solve some of the problems they found:

  • Wide variations in implementations of Data Center Quantized Congestion Notification (DCQCN)
  • FMR hidden fences halting RDMA one send before another could continue
  • Priority-based Flow Control (PFC) storms, and PFC and MACSec interactions
  • “Congestion leaking” - the slowest flow on a NIC would throttle all other flows
  • Issues with loopback RDMA

And they share some lessons learned and recommendations for vendors:

  • RDMA failovers are very expensive, especially since the CPUs being saved by using RDMA in the first place are now no longer available (having been used for, e.g., customer VMs)
  • The internal network (host network) of fat nodes is increasingly complicated, and should be merged with the outward-facing external network
  • Switch buffer is increasingly important and needs more innovations - it sounds like too-small switch buffer size caused problems routinely in the Azure data centres, especially with the wide spread in latencies
  • The lack of clear expected standard behaviour with DCQCN, PFC, MACsec, etc was a continuing source of problems
  • Testing new network devices at scale is hard and challenging

Despite those challenges and the work they had to put in, the charts above show why they did it; at cloud scale, a 20% reduction in CPU usage, and a factor of 10 reduction in message completion time, adds to very large cost savings.


Xiaomin Shen on the Cloudflare Blog has a good writeup on how DDR4 memory is organized and the effect on memory bandwidth.


Psychological Safety: Checklists - Psychological Safety Blog

I talk a lot about the importance of runbooks/playbooks and incident response

The introduction of checklists for these situations didn’t just improve our outcomes, but also reduced anxiety and stress, and allowed team members to spend their cognitive capacity on more germane problem solving.


Random

Finally, the ability to boot straight into vi without bothering with an operating system. Why bother with all that other stuff?

You probably read about the excitement about a aperiodic monotile (tile the plane in a way that won’t have resonances!); here’s how you can actually generate the aperiodic tilings.

This is fun - distributed systems challenges of increasing difficulty by Fly.io and the person who brought you Jepsen - Gossip Glomers.

bytebeating music with gnu awk, or I guess you could use a compiled language too using Linuxes entropy.

It’s hard to even keep track of all the openly available LLM models right now - there’s StableLM, Dolly, Vicuna (in the browser!).

Koala is tuned for chats on academic research, it might be useful for augmenting with RCD team documentation…

Course materials from a CMU class on Multimodal Machine Learning.

New version of OpenVMS available if you miss Teco and DECterms and…

A step-by-step build-your-own-database book, and a very simple lock-free hash table.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.