#87 - 14 Aug 2021

Research Computing teams are awesome. How do we help peers?; Communicating firings; Effective oversight; GitHub updates; Efficiency and AMD Zen2; Ransomware is about the basics

Hi, all:

So the last newsletter resonated quite a bit (welcome to new members of the Research Computing Teams community)! It was a bit of a cri de coeur about research computing’s inferiority complex that comes from unfairly comparing ourselves to both research and tech industry computing, and - to my mind - explains why we don’t advocate as well for ourselves as we could, support each other as well as we could, and why too often we don’t hold ourselves to high enough standards. It’s easy enough to see some of the results; poorly supported and run teams, not enough of institutional backing, demoralized team members, people leaving or checking out.

Today’s is a bit shorter - the beginnings of a call to action.

Because the fact is in research computing, our teams have enormous advantages over other kinds of work, which are can attract and engage the right team members. We have inherently meaningful work (supporting many researchers pushing forwards human knowledge) like say nonprofits, while having high and reliable salaries and the ability to work on many different projects and with many different tools.

Our teams - and we as managers - come with terrific advantages that in industry executives pay large sums of money to either train for or screen hires for: if you seek out jobs advancing human knowledge by supporting research, you have a growth mindset, the expectation that people can grow and develop new skills, “baked in”. You’re comfortable with uncertainty. It’s you and your team vs the problem.

We have the benefits of a startup (defining the problem while working on the solution, everyone gets to do a little of everything, wide scope) with the mission-driven focus of a nonprofit with the (at Universities, say) large-institution benefits stable jobs, salaries, pensions, and tuition benefits.

As a result of all this we attract some of the smartest, driven, curious, people around who are all eager to engage with hard problems. We just need to help support them, lead them, give them opportunities to grow, and advocate for them - and ourselves.

We’re getting close to having a critical mass of readers here - almost 150 community members. Some joined last week, some are here from a year and a half ago. We’re almost at the point where we can start helping each other, sharing information within the community from member to member, and reaching out to peers elsewhere to offer them help and guidance.

So I’d like to ask you for your advice. What here has helped you, and how can we help other managers or leads together?

  • What problems have you faced, as an early manager/lead or now, which you think peers probably need help with?
  • What resources might help them? Or you?
  • How can we contact other research computing team managers, leads, or those thinking of becoming one? Should we collaborate on a “Ten Simple Rules” paper? Give talks - where? What kinds of materials would peers want to see?
  • What kind of community communications would you be interested in participating in so you’re hearing from others, not just me? A slack or discord or Zoom meetings? Slacks or discords? Something new like Circle?

Email me - hit reply or email jonathan@researchcomputingteams.org - or set up a 15 minute call with me. I appreciate the advice some team members have already given me, and I’d love to hear yours. I’ll summarize results next week.

For now, on to the roundup!

Managing Teams

Words Matter: Is Your Digital Communication Style Impacting Your Employees? - Samantha Rae Ayoub, Fellow

“We need to talk”. “Fine.” These all messages or responses that would be very uncomfortable for us to receive from our boss; but when things are busy it’s pretty easy for us to communicate in exactly that way with our team members or peers. Your boss (probably) isn’t a jerk, and neither are you, but when we have a lot of things on our mind it’s easy to not pay attention to how our words might seem.

In this article, Ayoub councils us to routinely put a tiny bit of effort into written and to some extent even video communication with our team members:

  • Put yourself in their shoes - how will this be received? This is really easy to understand with a moment’s effort: before you hit send just read the message over imagining it was coming from your boss.
  • Mind the punctuation - a little “!” goes a long way to a “Thanks” or other message
  • Always respond - even if just to say that you read the message
  • Practice CATTE - make sure your response gives Context, Answer to the question, Timeline (does this need to be done now or is it for a week from now?), Transparency, and some kind of Emotion
  • Balance digital interactions - with whatever kind of face time is feasible

Of Ayoub’s recommendations, I’m personally the worst at always responding, even (especially) when I don’t have time to do anything about the message at the moment. I need to get better at that. I know for a fact that it makes my team members feel ignored (and why wouldn’t it?)

This stuff is especially important with us all being distant from each other. When we switched to everyone working from home, I started putting a lot more exclamation marks in my emails, started using emojis a lot more in Slack, and added a lot of gushing to positive feedback and compassionate (but still firm) tones to negative feedback.

You know how in silent films, it initially seems to modern eyes like the stars are overacting? Compared to today’s movies, they lacked an entire channel of communication - sound - and so they had to make up for it in expression and body movement. It’s not overacting, it was acting the appropriate amount to convey the intended meaning given the limitations of the medium.

It was initially uncomfortable and unnatural for me to emote that way in text. But written communications, unleavened by in-person interactions, is pretty limited. You have to “over” emote to effectively convey your intended meaning. Otherwise, people will imagine all kinds of things lurking behind your opaque words, and they are apt to latch onto the worst possibilities.

Three ways to lead effectively when you fire somebody - Sarah Milstein, LeadDev

In a way, this is related to Ayoub’s article above. If you aren’t communicating effectively with your team, that won’t stop people from thinking and talking about the meaning behind actions; it’ll just encourage that thinking and talking to go somewhere farfetched and ugly, unburdened by facts.

We don’t like to talk about firing, and it doesn’t happen often (enough?) in academia or R&D but it does happen; that relative rarity makes it all the more dramatic. Even someone who chooses to abruptly leave for completely benign reasons causes the same waves. Worse, managers are generally (and by and large correctly) constrained in what they can say about why someone has been fired or left. The combination of drama and relative silence makes for stress and speculation for your team members, even if it’s happening outside of your immediate team, leading to fear, uncertainty, and doubt.

Milstein provides some guidance of what to do when you (or someone else) fires someone in your organization”

  • If you’re the manager on the case, have a communications plan, and communicate immediately
  • If you’re a manager but not directly involved, reach out to your direct reports once the news is public
  • If you’re a leader at the company, tell people that this is normal, and regrettable, and it will happen again

As Milstein points out, even if you are extremely constrained in your environment about what you’re allowed to say (and are you sure you’re not allowed to say anything at all, or are you just assuming?), there are lots of things you can say:

  • You can talk about why you can’t give reasons, and communicate that in a way that (correctly) emphasizes that you and the organization cares about team member privacy
  • You can talk about the process, including feedback, and that these things don’t come as a surprise - your team members don’t have to worry about waking up fired some day out of the blue
  • You can recognize that this impacts them both in terms of professional relationships and workload, and can come prepared with a plan for handling work

Managing Your Own Career

How to Give Difficult Feedback to Your Boss (Even When You’re Scared) - Karen Hurt, Let’s Grow Leaders

Giving negative feedback to your team members takes some courage the first times you do it as a new manager; when you’re providing feedback to your own manager that’s a whole other level. Here Hurt provides some steps for how to proceed (paraphrased)

  1. Be clear up-front about your intent and goal, and how you’ll communicate
  2. Set up a time to talk in a private place
  3. Be objective and specific - this is a place where it’s especially important to focus on behaviour and impact
  4. Ask for their perspective (and really listen.)
  5. Look for opportunities to help
  6. Follow up

The biggest difference is that, as with your peers, you can’t simply ask your manager to change their behaviours - all you can really do is point out an impact.

A prerequisite to any of this of course is that you trust your manager; if you haven’t felt comfortable raising issues in the past, this probably isn’t the way to start.

And for you as a manager or lead, keep in mind for your team members - you want them to bring you their feedback, and they’ll agonize over doing so in the same way you would with your manager! You can help them by routinely asking for feedback in your one-on-one sessions and in retrospectives, and taking the feedback seriously when it is given.

Product Management and Working with Research Communities

An example of Governance as a Service - Richard McLean

We often either provide oversight to efforts, or enjoy the oversight of others (often some kind of scientific advisory board - we mentioned in #61 that we can learn from nonprofits and their boards how to report to such bodies).

Like everything else, oversight/governance can make things better or worse. It is not carved in stone that it has to be a waste of time. The more we resign ourselves to unhelpful pro-forma meetings, the more likely we make that outcome.

When we provide oversight we can be prepared and provide useful input that gives the team the confidence they need to keep solving the right problems; when we receive oversight we can make sure those overseeing the work are given the right information ahead of time, asked the right questions, and are encouraged to give constructive, helpful guidance.

Here McLean gives an example of receiving oversight going well:

They went into the meeting with:

  • three clear questions that had just come up
  • a presentation on an element of their plan they were concerned about

And left with:

  • An interesting problem and a challenge to consider how they might solve it
  • A useful connection between their work and another project they hadn’t thought of
  • Backing and encouragement

(It also includes a shout-out to a Camille Fournier article we saw in #84 advocating for structured monthly checkins for critical projects).

Just as we have to shake off the inferiority complex about research computing and can hold ourselves to high standards, we have to have high expectations of other parts of our community too - user groups, scientific advisory boards - and hold them accountable to meet those expectations. When they do, great things can happen.

Research Software Development

Visualizing a codebase - Amelia Wattenberger, GitHub OCTO
GitHub’s Engineering Team has moved to Codespaces - Cory Wilkerson, GitHub Blog

A couple really interesting pieces of tooling coming out of GitHub this week. In the first, Wattenberger walks us through what I hope will be the first of many visualization tools for looking at the structure of a codebase. This one shows the overall structure of a repository at a glance. showing the directory structure, programming languages, and relative sizes of files, like this one below for numpy; mousing over the files shows filenames. It’s also released as a GitHub Action, so that you can ensure that you have an up-to-date overview of your repo structure in (say) your README; or you could run it locally. It’s a simple but useful overview and I hope more visualization tools continue to come out - we’ve talked in the past (e.g. #20) about how simple metrics like hotspot analysis, readily calculated if you have the version history, indicates where and where not to focus refactoring effort, for instance.

“Fingerprint” visualization of the numpy/numpy repository, showing the directory structure with nested bubbles, and files with their sizes and coloured by language

In the second, Codespaces (#23, #75) is now available for Teams (and if you’re in an academic organization, you can get Teams for free), and Wilkerson summarizes GitHub’s experiences using Codespaces switching to use of Codespaces for GitHub.com development. Codespaces is a configurable VM with your code and dependencies available, that you can ssh into, use with VS Code remote workspaces, or use directly from the browser. After a trial period, it isn’t going to be cheap - $.18/hr for a 2-core CM, $2.88/hr for a 32-core VM - but having a complete working dev environment with your code that a new hire could get working on immediately, even if it’s just for a week or two while they get the environment set up on their own system, could be really valuable. And it’s not hard to see how something like this could be useful for software development training. I’m not sure if GitHub will necessarily be the vendor that makes this sort of approach catch on, but it’s a model which is going to work really well for some use cases.

PS - also the VSCode in the browser for github repos is available for free everywhere; just change a github.com link to github.dev, or type ‘.’ when viewing a github repo, and you’ll be in a browser-based VS code environment. It’s completely suitable for editing tasks; I used this to diagnose a bug just yesterday by searching through a code base (the code is a fork, which you can’t search from the GitHub web interface). It’s also great for interactively digging into a pull request.

Research Data Management and Analysis

A lot of items with possible training applications this week -

How to avoid machine learning pitfalls: a guide for academic researchers - Michael A Jones, arXiv

Researchers love a new tool. A lot of researchers are starting to use ML for the first time, with sometimes field-expanding and sometimes wince-inducing consequences. Jones gives a “do and don’t”s overview of things for researchers to keep in mind, with 26 different tips, with resources but, probably wisely, without examples to illustrate the problems he is aiming to prevent. The tips are broken up into stages - before building models, reliably building models, robustly evaluating models, comparing models fairly, and reporting results. This would be an excellent resource for including in an ML for researchers training, which might in turn be based on…

An Introduction to Statistical Learning with Applications in R, Second Edition - Gareth James, Daniela Witten, Trevor Hastie, Robert Tibisirani

ISLRv2 is out, and available as a free PDF along with code and data and figures. This is a terrific resource for learning or teaching modern machine learning approaches rooted in a rigorous statistical background.

A future for SQL on the web - James Long

The introduction of absurd-sql, webassembly sqlite on top of browser-based IndexedDB - which in turn is generally implemented on top of sqlite! It sounds goofy but is quite fast and, like “Hosting SQLite databases on GitHub pages” which we saw in #73 allows for both the delivery of simple user-specific web applications and training on SQL. Here though the advantage is that writes persist locally.

Research Computing Systems

Energy Efficiency Aspects of the AMD Zen 2 Architecture - Robert Schöne, Thomas Ilsche, Mario Bielert, Markus Velten, Markus Schmidl, and Daniel Hackenberg, arXiv
High-Efficiency Gurus Grapple with AMD’s RAPL (Running Average Power Limit) - Nicole Hemsoth, The Next Platform

New CPUs are complex and getting weirder. The AMD Zen 2 architecture, underlying recent EPYC processors, has a lot of different control mechanisms for keeping power consumption under some tuneable limit, and there isn’t a lot of information out there on how those mechanisms interact. That means those running systems don’t have a lot of guidance about how to set parameters to get the most performance for their workloads for a given power use.

Schöne et al. describes how they experimentally measured power consumption under different workloads given various P-states (power management states allowing scaling of frequency and voltages of CPUs), C-states (idle sleep states when idling or turning off various CPU resources), and frequency options (including mixed frequency between CPUs), and AMD’s running average power limit (RAPL) functionality.

The paper is quite extensive - Hemsoth gives an overview of the work and their conclusions, including

  • Energy measurement shouldn’t depend on AMD’s RAPL - RAM isn’t included
  • Don’t disable hardware threads because it can disable C-states
  • Avoid mixing frequency within CCXs (tiles of four CPUs and their caches)
  • Keep an eye on frequency throttling, which can significantly limit SIMD performance

Defending against ransomware is all about the basics - Mike Loukides, O’Reilly Radar

I’m waiting without any enthusiasm to start hearing about a wave of ransomware attacks at research computing facilities. Yes, sure, they’re typically Linux, but let’s face it, their openness is a feature that makes them less than fully-hardened targets.

And detecting them in research computing systems is probably harder than in enterprise IT shops, too. From the article: “detecting a ransomware attack isn’t difficult. If you think about it, this makes a lot of sense: encrypting all your files requires a lot of CPU and filesystem activity, and that’s a red flag. “ Um, well, not so great for us in HPC or research computing where high-CPU, high-IO workloads aren’t unusual. And “The way files change is also a giveaway. Most unencrypted files have low entropy” - also not great when our users are hopefully writing in compressed formats. Not to say things are undetectable - constant high-IO and rewriting existing files is still a signature - but ransomware activity certainly would stand out less than in other contexts.

Loukides advises the basics - backups, routinely restoring from the backups (he mentions “chaos-monkey” type tests, about which more in a moment), keeping backups offline except when in use. Keep things patched, keep permissions minimized, have playbooks and plans for attacks.

None of this is novel - routine tending to the basics is the essence of professionalism! - but too many of our systems teams are not doing these things systematically. Too many research computing systems out there still don’t have a library of playbooks. Too many teams don’t run postmortems or incident responses.

This is harsh to say, but I’m a friend and I’ll say it - on average, research software development teams have evolved way faster than research systems operations teams in the past two decades, and now routinely operate at high professional standards much more often than do research system operations teams. And for standards I don’t mean rigid old-school practices like ITIL or COBIT aimed at IT for payroll systems or email servers. I mean flexible, evolving, agile, research-suitable things. Running postmortems and developing playbooks are not suffocating levels of process any more than using version control is. The counterpart of testing, things like chaos-monkey or disasterpiece theatre planned incidents, are too-often scoffed at, when they are the epitome of science - you have a (positive) hypothesis about how your system will behave under some minor incident, and you test that hypothesis. If you were right - experiment over, good job, well done everyone. If you were wrong - well that’s important to know about, yes? Because that incident is certainly going to happen one day with or without your provocation.

Emerging Technologies and Practices

DevOps Engineer Crash Course - Section 1 - Mathew Duggan

This is a crash course for an existing small team to start building the awareness, documentation, and reproducibility needed so they have the guardrails in place to be able to make changes with confidence.

The changes imagined here are to support the introduction of devops-style processes, but really it could be for any changes - giving you the confidence to add new components to the system, take pieces down, or form hypotheses about incidents. The tools recommended here are for cloud deployments - one advantage of pay-for-use infrastructure is that a complete listing of all your kit is never more than a few API calls away - but the steps would apply just as well to on-prem infrastructure.

  1. Get a copy of the existing stack - or, for on-prem, make sure you have an inventory of all your stuff and how they’re wired together
  2. Write down how deployments and changes work
  3. Where do logs go and what stores them
  4. How does SSH access work
  5. How do we know the applications/services are running
  6. Run a security audit so you have a baseline state
  7. Make a diagram, including any external pieces and dependencies

Each step comes with an explanation that covers why, and how for the specific case of AWS deployments (including tools, many of which are paid).

Calls for Submissions

8th International Workshop on Large-scale HPC Application Modernization (LHAM) - 23-26 Nov, Abstracts due 10 Sept, papers due 15 Sept.

The International Workshop on Large-scale HPC Application Modernization offers an opportunity to share practices and experiences of modernizing practical HPC application codes, and also discuss ideas and future directions for supporting the modernization.

Events: Conferences, Training

Cmake Training - 23-26 Aug, Virtual, Free

From the event page:

ECP is partnering with Kitware, ALCF, NERSC and OLCF to offer a 4 day CMake Training class on August 23-26. The training class will be virtual and will use computational resources available at NERSC for the exercises. The tentative agenda for the training is given below. The training is targeted at a deeper understanding of CMake. It seeks to assist ECP developers in learning how to resolve issues outside of their control, in addition to writing a build system generator capable of seamlessly configuring for multiple unique architectures with a variety of compilers.

IEEE Cluster 2021 - 7-10 Sept, Virtual, Registration $85-$140

A wide range of topics are covered, under the areas of

  • Applications, Algorithms, and Libraries
  • Architecture, Networks/Communication, and Management
  • Programming and Systems Software
  • Data, Storage, and Visualization


Great news: you remember that atomic clock and satellite receiver on a PCIe card you wanted? Facebook’s new Open Compute Time Appliance is an open-sourced solution which will stay within 1 microsecond of accurate time for 24 hours even if you lose radio signals to GNSS.

Interesting comparison of AWS instances: bare metal vs those using nitro hypervisor (which include more or less all non-metal newer instances). Impressively, there’s less than a 1% difference with GROMACS, WRF, HPL, or OpenFoam.

In praise of nano as a development editor.

Been meaning to learn something “new”? How about APL.

Building and running your own NeXTstation in a VM. A NeXT was the first “real” computer I was paid to work on; lovely to see that UI again. Or you could emulate a Mac Plus and write a simple THINK C program for it.

Speaking of me being old, I’m used to using gnuplot as a command line plotting tool - clip is a newer tool that has a simple DSL for defining plots.

I’ve grown quite fond of having GitHub Copilot on when noodling around with some simple go code. It’s like having a naive, overeager, but sometimes surprisingly perceptive junior programmer constantly offering suggestions to you while you type. Ok, it sounds terrible when I say it like that, but I find myself really enjoying the experience. The underlying model, OpenAI Codex, is being made available via an API.

Sysdig, an all-in-one linux data collection tool that monitors system calls and events from the kernel, now understands containers.

A deep deep dive into C++ exceptions.

I always love a good performance debugging story - why 28 byte writes were 2x slower than 32 byte writes. And here’s a performance mystery in the other direction - how did Clang/LLVM turn an obviously terrible joke implementation of evenness-checking of an integer into essentially optimal code?

Leadership is hard. This is a nice post about the arc of leading a team to a challenging goal.