#57 - 15 Jan 2021

Stop Doing Things Challenge #2; pros and cons of decentralized teams; engineering management books; creating a new virtual conference; 2 reasons for code review; analysis of HPC support tickets

Hi, all:

Most of us are now well and truly back into the swing of things; I hope you and your team are doing well.

Skip to the roundup

Last time we talked about, as a manager, focusing on tasks with leverage (LeadDev also had a good article on leverage this week). Our main jobs are to support our team members and ensure the team works as effectively as possible; we’re principally multipliers, not makers and our tasks should as much as possible reflect that.

What about what the team as a whole is doing? What activities, if any, should the team stop doing? The steps are are straightforward enough, but doing them properly takes some work:

  • Look at where you want the team to be in a year - opportunities you want to take advantage of, threats you want to avoid.
  • Look at what the team is doing that is most valuable to your researchers and other stakeholders (funding, etc.)
  • Start dropping activities that are less valuable and aren’t actively getting the team where you want it to go.

The first, identifying opportunities and threats, is probably something you already have a pretty good idea about. Maybe you’re trying to get your systems team away from being merely a group that tends after colocated individual systems and into running a larger shared system; maybe you’re trying to move your software development team towards a specialty you see as growing like data science/ML or health/private data. Maybe there’s a funding stream whose future is uncertain and you want to aim towards performing work that’s more clearly under a different funding stream. Your boss will also have some ideas about directions to move or more away from, and this should be a conversation you have with them (at least one).

The second, finding out what the team is doing that’s most valuable to your user community, is something you have to go out and talk to your user community about. If you haven’t explicitly asked one of your researchers or other stakeholders the question “what are the things we do that are most important and valuable to you” in the last six months or year, you definitely don’t know the answer to this question. I’ve never seen anyone go through this process with their community without being repeatedly surprised.

It’s generally pretty easy to arrange say quarterly one-on-one meetings with key research users or stakeholders under the (true!) pretence of “status updates”; that gives you the opportunity to tell them what you’re doing that’s relevant to them, but then - more importantly - to ask them questions and advice about what their changing priorities are. For groups you work more closely with you can even do monthly.

If someone won’t take - or have someone else take for them! - a quarterly half-hour meeting with you where you’re telling them what you’re working on for them, that’s a pretty clear signal that the work you’re doing for them is not highly valued, and might not be worth the effort the team is putting in. These are almost certainly activities that can be phased out.

These first two steps are a pretty decent bit of work. They involve talking with a bunch of people - not as a one off, but continuously - and distilling the likely contradictory information down into an understanding of what your most valuable activities are - valuable current activities as seen by your community, or activities that are valuable for the opportunities they will unlock.

But it’s high-leverage work; it’s making sure your teams activities are aligned with needs and activities. The steps above are basically a mini strategic planning process for your team for the next year or so. For longer times and larger organizations it gets bigger and more comprehensive, but the basic approach is recognizable. It’s also a pretty good approach to your first 90 days of a new managerial job, for pretty much the same reasons.

So doing it properly will take some time - setting up and preparing for these meetings; but it’s really important if you want to make sure that what you and your team are putting all that effort into matters as much as possible.

The third thing is by far the hardest. Once you’ve collected all the data (and yes, as our digital humanities colleagues remind us, non-quantitative information is still data), and distilled it down, your team has to stop doing activities that are low value and not “strategic” (getting your team where you want it to go). The steps here are clear and simple, but a little intimidating, because not everyone is going to be happy:

  • Communicate with your manager to make sure they’re happy with the direction, if there’s significant changes
  • Communicate with your team - the why as much as the what
  • Start communicating with your community about what your team will be focusing on
  • Start saying no: phase out existing projects
  • Start preemptively saying no: where possible intervene early in the design stage of new projects to steer them in directions that make sense for your team, or talk about why you won’t be taking this project on.

This isn’t easy, especially this last part; but without frequently saying no to low value tasks that don’t get your team where it needs to go, you won’t have time to say yes to the tasks that really matter, for research and for your team.

Let me know if you have questions - this stuff isn’t hard, but can be intimidating - or feel free to share with the community some success stories about things that worked for you and your team. Just hit reply, or email me at jonathan@researchcomputingteams.org. And now, on to the roundup!

Managing Teams

In Bad Times, Decentralised Firms Outperform Their Rivals - Philippe Aghion and Isabelle Laporte, INSEAD Knowledge
Hold People Accountable (But Keep It Safe) - Dexter Sy, Tech Management Life

Having decision-making centralized can work really well for routine operations - which is almost never the research world - but falls apart in times of rapid change, as the article by Aghion and Lahorte points out, summarizing a paper led by one of the authors including data on companies from ten countries.

Things are always changing rapidly in research - especially in research computing - and we need that same sort of distributed decision making. Our goal is to have the right people on our team, those doing the work hands-on, making as many as possible of the decisions, with us only a “safety check”, making sure those decisions line up with larger-picture priorities and goals (and if not, sending them back to make a new decision, not overriding the decision with one of your own).

The article by Sy talks about how to do that, how to delegate responsibility to team members who have enough task-relevant maturity. This lets them be accountable to you for their decisions and mistakes they make, by creating the environment where they can make mistakes without it being a huge problem, and helps the team members grow in responsibility, confidence, and skills.

Maximizing Developer Effectiveness - Tim Cochran

This is aimed at software developers, but much of it would apply just as easily to those running systems or curating research data. Team members are effective if they’re quickly and frequently getting feedback - did this change work, does this solution meet the requestor’s needs - and not waiting for things or having their day chopped up into little pieces.

That means as managers it’s important to make sure we have the tooling and processes in place to get team members their results quickly, have things ready to show users/stakeholders quickly, and be respectful of their time. Because the pace of research is generally relatively slow and fitful, we tend in research computing not to make this a focus - but team members can only move at the speed of their tools and processes.

Recommended Engineering Management Books - Caitie McCaffrey

McCaffrey has gone from an individual contributor to a director running teams totalling 20 people over the past 3.5 years, and recommends these books, some of which have come up in the newsletter an some which haven’t. McCaffrey covers the books in much more detail, but an overview is:

  • The Manager’s Path, Camille Fournier - Focussing on the transitions between diferent roles and the mindset changed required
  • Thanks for the Feedback - Douglas Stone and Sheila Heen - Understanding how (and why) to receive, and implicitly to give, feedback well
  • The Hard Thing About Hard Things, Ben Horowitz - covers a lot, McCaffrey felt the importance of training your team members really came through
  • Accelerate, Nicole Forsgren, Jez Humble, Gene Kim - A classic DevOps but also high-performance computing team book
  • Dare To Lead, Brene Brown - The importance of daring (and vulnerable) leadership
  • Switch, Chip and Dan Heath - Understanding how to enact change by better understanding people
  • Atomic Habits, James Clear - How to enact change in yourself by building good habits and breaking bad ones.

Product Management and Working with Research Communities

Ten simple rules for creating a brand-new virtual academic meeting (even amid a pandemic) - Scott Rich, Andreea O. Diaconescu, John D. Griffiths and Milad Lankarany, PLOS Comp. Bio

The pandemic has meant that virtual conferences are now accepted as being meaningful ways to disseminate work and gather communities. That acceptance opens up enormous opportunities to arrange workshops and conferences that would otherwise be too niche to have people travel nationally or internationally for. It also allows workshops to be put together start-to-finish on a much faster timescale than we’re used to.

This paper by Rich et. al highlights some of the key points, many of which come down to finding ways of taking advantages of the flexibility that virtual events enable Two key points worth highlighting are pretty broadly important:

  1. Ensure your meeting addresses a need unmet by current conferences and fills this gap in an engaging and creative fashion
  1. Craft a coherent, themed conference itinerary to make the content accessible to as broad an audience as possible

Presenting virtually? Here’s a checklist to make it great - Tamsen Webster, The Red Thread

The downside of lacking in-person social cues of an “on-prem” conference is that virtual events, especially long presentations, can be easy to tune out. So presentations should be shorter. But there are other tricks too!

Here well known speaker and speaking coach Webster gives very concrete steps to take to make your virtual presentation as engaging as possible, whether it’s live or prerecorded. Many of the visual tips we’ve all learned by now (although the tip to stand was new to me), but the tips about making the talk sound engaging, and doing a better job of decoupling visuals from the words in the script, are things most of us in research could do better at, and are especially important when you’re “presenting” in one small window on someone’s screen.

Research Software Development

Two Kinds of Code Review - Aleksey Kladov

This is another good article of a number we’ve seen here on the topic of code review as asynchronous pair programming, a way of sharing knowledge both ways - about the code itself but also about expectations and goals of the team. From the article:

  • “One goal of a review process is good code.”
  • “Another goal of a review is good coders.”

Managing technical quality in a codebase - Will Larson

This article is about the steps in improving code quality over time from an initial messy code base; the idea is marching up a ladder, solving increasingly high-level issues.

This is particularly relevant for research software development. Successful research software marches up a technical readiness/maturity ladder from proof of concept to prototype to community use to production research infrastructure. As code marches up that ladder, the tradeoffs change, and the needs for code quality change with them.

The rungs on the code quality ladder for managers, in Larson’s estimation, are:

  • Hot Spots - Get the bits that are causing immediate problems fixed
  • Best Practices - Update team practices and tools to bring up to best practices, so there are fewer hot spots
  • Leverage Points - Clarify interfaces, data models, and other leverage points within the project to clarify overall design and make the code cleaner - this could be but doesn’t necessarily mean refactoring
  • Technical Vectors - Improve training and strategy to make sure the whole team is aligned on what and why they’re building
  • Measuring technical quality - Start proactively measuring code quality (by whatever metrics are important to your team and project) before it becomes a problem

But it works on my machine - Anton Sergeyev

A nice overview - with references and examples - on the sorts of things that can go wrong with problems that aren’t reproducible between systems, what you can do to diagnose the problems, and what you can do to avoid the problems.

Sergeyev breaks the issues into 6 major categories:

  • Human Errors
  • Unexpected environment
  • Unexpected infrastructure
  • Unexpected users
  • Unexpected state
    • Application memory
    • Databases
    • Cache
  • Poor performance

Research Computing Systems

Analyzing HPC Support Tickets: Experience and Recommendations - Alexandra DeLucia, Elisabeth Moore

This is the first paper I’ve ever seen trying to analyze HPC support tickets. Given that supporting research is what we do for a living, and interacting with researchers and research staff trying to use our systems are such a key part of that effort, I’m surprised that I haven’t seen efforts like this before.

The authors looked at a set of a bit under 70,000 tickets at LANL (scraped using a script from their ticketing system), looked at some patterns, and ran some NLP analysis on the text to both try to categorize tickets and see if they could predict responses.

They found some interesting things:

  • Their ability to analyze the data was greatly hampered by lack of metadata - there’s no summary on closing the tickets, so it’s hard to even see what the underlying question was and what the solution was
  • Only one category is allowed per ticket, which often mislabels tickets because tricky problems often involve interaction between two or more types of things
  • The support staff’s ability to look at other information about the user, or the system, was essentially nonexistant
  • The support staff’s ability to search past tickets by content was extremely limited
  • The number of templates the support staff could use for answers to common questions were very limited
  • The researchers generated (using LDA) a list of ticket clusters, which support staff then named - I’d be very interested to see the full list, only some examples are given

This is a very useful piece fo work for understanding how we handle system support, and frankly it’s not very encouraging.

I’m always surprised about how shoddy the user-facing support tools research computing teams use are. If we cared about supporting users well we’d use existing really good commercial tools with integrated client relationship management service (who is this person, how have we interacted with that group in the past, what questions do they typically have), make it easy to extract data so we can see how we were doing, and spend time writing good scripts, templates, and summaries for tickets on closing. That would help both support staff - by developing a knowledge base which could forestall questions or make them easier to answer - and the users, who would receive better service. Instead we mostly use crappy open source warmed-over perl scripts to do the bare minimum of ticket tracking, mainly so we don’t get yelled at if an email falls between the cracks.

SLO — From Nothing to… Production - Ioannis Georgoulas

We’ve talked about Service Level Indicators/Objectives/Agreements (SLI/SLO/SLA) in the past as ways to focus operations effort in ways that are visible to users. Service Level here often means “availability” under some specific measure (the indicator) but it could just as easily be a wait time (jobs in the queue, emails awaiting responses, waiting list for training), disk space, or almost anything else (time until a new user successfully runs a nontrivial job?). The indicators are the measures you define; the objectives are internal targets for what those indicators should show; and agreements are external-facing agreements with your users what those numbers should be.

Other links in the roundups have focussed on the basics; Georgoulas’ article focusses on developing and advocating for SLOs in your organization, with a number of useful resources linked including some slides you could use was a starting point. This could be used by a manager or an individual contributor to advocate for adoption of internal SLOs in a team.

Emerging Data & Infrastructure Tools

RISC-V Vector Instructions vs ARM and x86 SIMD - Erik Endheim

This is a nice long-read introduction to SIMD instructions like x86 SSE or AVX, as opposed to vector processing as in old supercomputing systems like Crays, or (with some modifications) in GPUs.

Fixed-width SIMD instructions can have the advantage of being specialized, but, well, they’re fixed width, and code needs to be recompiled to take advantage of new wider SIMD instructions. Too, as a practical matter, there’s typically a lot of overhead preparing or moving data to take advantage of the SIMD instructions. Endheim quotes from an 2017 article by Patterson and Watterman, SIMD Instructions Considered Harmful:

Two-thirds to three-fourths of the code for MIPS-32 MSA and IA-32 AVX2 is SIMD overhead, either to prepare the data for the main SIMD loop or to handle the fringe elements when n is not a multiple of the number of floating-point numbers in a SIMD register.

On the other hand, Vector processors provide a much simpler API to user code. Rather than being fixed width, there simply vectors, of whatever known (or computed) length - go do your thing. New version of the CPU can handle vectors better? The API probably doesn’t change much if at all.

Now, vector processing on vectors of arbitrary length isn’t of much use to general purpose computing - but it’s very very useful for scientific and data-intensive computing. And it may be coming back - in particular, upcoming RISC-V processors don’t have SIMD instructions but there is a vector-processing extension. This wold be of significant interest to scientific computing, and to AI/ML workloads.

Calls for Proposals or Papers

HiCOMB 2021 - 17 May, Portland, OR, USA (hybrid); paper deadline 29 Jan

HiCOMB is the IEEE International Workshop on High Performance Computational Biology, the intersection of HPC and Computational Biology. It’s a workshop of the IEEE International workshop on Parallel and Distributed Processing Symposium (IPDPS) and many outstanding methods and applications papers have been published here. Other IPDPS workshops of note are Parallel and Distributed Scientific and Engineering Computing Workshop.

Events: Conferences, Training

SORSE Lightning Talks - 20 Jan, 3pm UTC

A round of lightning talks covering a poster or blogpost. Talks lined up for this session so far include:

  • Development of an Automated High-Throughput Animal Training Platform
  • Code Review Community
  • Embedding a Jupyter Notebook

FOSDEM 2021 - 6-7 Feb, Virtual , Free

The 2021 Free and Open Source Developers’ European Meeting is online, this year, and the schedule is more or less in place.

Some tacks or talks likely of interest - Tools and Concepts for Successfully Open Sourcing Your Project, First Ph.D then Open Source Startup, and dev rooms like Testing and Automation, HPC, Big Data, and Data Science, CI/CD, and others for particular OSes, languages, databases, and tools.


Finally, CMake is good for something - Raytracing in pure CMake.

A nice set of SciPy lecture notes starting from intro to Python to sparse matrices and optimization to scikit-learn and 3D plotting with Mayavi.

TIL: Mutation testing is kind of like monte-carlo methods for test coverage; you change bits of your code and make sure at least one of your tests fail. If they don’t, then obviously the mutated bits of code weren’t covered by your tests. Mutmut is a python package for mutation testing.

ORNL looks ag GPU lifetimes, and find that GPU lifetime is very dependent on heat dissipation.

An open-source tool similar to Roam Research - linked note-taking - based on GitHub and VSCode; foam.

A complete course for Raku/Perl 6, if you’re into that.

You can apparently compile Scala into javascript now, if you’re into that.

A fairly detailed walkthrough of setting up a WireGuard VPN.