#99 - 5 Nov 2021

Shaking off the status quo; Being directive without being a jerk; Stop looking for mentors; Testing with global variables; Software team productivity; 20k simulations with AWS Batch; annual backup checkup

Hi, research computing team managers and leaders:

In our team there’s been a lot of passages lately - a paper for our original work is (finally!) coming out, as our new version is (finally!) coming together; we’re gearing up for a new batch of co-ops as our current co-ops are starting to document and getting ready to present their work; a project manager is joining the team for the first time now that the effort has reached a size and scope that it needs one (well, it needed one a year ago, but here we are).

These passages - and especially the influx of new people, new tasks, new scope - are really important for a team’s well being. Stasis isn’t stable; systems, including systems of people, are either growing or stagnating.

In academia sometimes it’s far too easy for groups to become very comfortable with “the way we do things”, and set in their ways. As Boulanger points out in the first article in the roundup, that can quickly lead to problems not being addressed - or even really noticed any more - and eventually people both within the team and “clients” of the team starting to drift away. In fact, I was talking to a colleague this week about one group’s services becoming ossified to the point where consumers of those services started moving to those of a different and newer group - the first group didn’t take feedback or feature requests seriously, and now there’s a real chance it will simply be disbanded (or, maybe worse, left go on indefinitely with less and less actual purpose).

Stagnation isn’t inevitable, but it takes active and continual effort to avoid. In research, and technology, and certainly at the intersection of research and technology, it should be easy - there’s a constant influx of new ideas available. But ideas don’t adopt themselves, and practices don’t adapt themselves. They’re adopted and adapted in units of teams, and one of our important jobs as managers and leads is balancing real needs for degrees of stability and certainty with equally real needs for change and growth. Balancing openness with continuity is hard, but vital.

And now, the roundup:

Managing Teams

Why The Status Quo Is So Hard To Change In Engineering Teams - Antoine Boulanger

Boulanger here points out a situation that is especially common in academia, with slow-growing teams where individual team members have long tenure. The issue is that a team gets so used to the way things are they don’t even see it any more, and forget that things don’t have to be this way. There can be a sort of learned helplessness to the procedural, technical, and complexity problems within an organization.

Having new people come in regularly - even short term team members like interns - can be very helpful for this, as long as they are comfortable making comments like “why is X? Isn’t that bad?” and the team takes the points they raise seriously.

Boulanger has recommendations for us managers or leads:

  • Regularly get into your team’s shoes
  • Notice when people stop complaining about an issue - this can be a negative, not a problem
  • Create some metrics around known issues so you can see if they’re getting better or if the team is just getting more inured to them

There are a lot of strengths that can come from a long-lived stable team, if you’re careful, but the default outcome is stagnation. The manager, and the team, has to be constantly and actively looking for things to improve and areas in which to grow to prevent the default.


You can be directive without being a jerk - Lara Hogan Being Nice and Effective - Subbu Allamaraj

I think one of the hardest things for new managers - especially those coming from the very hands-off collegial culture of research - is determining the right amount of directiveness appropriate for a given situation. The usual failure modes, in order of the frequency which I see them, is the very common laissez-faire absence of direction and the less common tech-lead-becomes-manager “do this, this, then that, and my way of doing it is exactly like this. In fact, why don’t I just…”

Hogan’s article is a followup to an earlier one, fixing a team going in circles. So here it’s a big topic of setting some direction for an entire team at once. But the approach works for a particular team member, too - being specific about whose job is what, focussing on the important thing (the team’s work) and not about individuals, and firmly but kindly applying direction at the level needed - whether that’s on tasks or goals or somewhere in between.

Allamaraj points out that being effective doesn’t necessarily mean not being nice, and being “nice” isn’t necessarily an end in itself anyway; we want to be kind, and sometimes “nice” gets used as meaning sort of inoffensive. Letting someone trudge aimlessly in circles while just smiling and not saying directive may seem from a distance like ‘nice’ but it’s certainly not kind.


Managing Your Own Career

Stop Looking For Mentors - Stay SaaSy

We could all use a bit more mentorship, but searching for A Mentor may make it harder to get the input we need. This article suggests making it easier on yourself:

Instead of looking for a mentor, just find somebody who can answer some questions you have. Then, if you think they can answer some more, ask them again. In reality, a mentor is mostly just somebody that answers questions more than once. That’s it. It’s not cinematic.


Product Management and Working with Research Communities

Using Amazon Service Workbench for Remote Training - Ann Gledson, Danielle Owen, Anthony Evans, and Peter Crowther, Manchester Research IT blog

So AWS Service Workbench is a free offering I hadn’t heard of before that lets you do what you might have done previously with Cloud Formation or a bunch of home grown scripts - spool up individual environments for researchers or (in this case) for a training course - but with aspinnice self-service UI that lets the IT staff approve requests. (“Free” in the sense of no extra cost - don’t worry, they still charge you for the resources that are being used!)

We’ve all tried having students use their own laptops and requiring them to pre-install packages, and know how challenging that is. The Manchester RIT team used service workbench to support an all-virtual Python course that would normally be done in a computer lab they control. Interestingly, the blog reports the feedback of the course instructor, an RIT team developing service to handle restricted data, a TA for the course, and some feedback from participants.

In this case, the RIT staff liked the control, the instructor and TA liked how they could get started teaching the material right away, and the participants seemed happy.

The downside of this approach of course is that if the students are to continue on using the material on their own, they now still have to go through the install process - but certainly the teaching is easier.

And of course this sort of tooling could be made available for on-prem systems, but in practice it never is; cloud providers have an incentive to make their systems as easy to use for these kinds of use cases as possible, because it means more revenue, while typically fixed on-prem systems generally have different (frankly, the opposite) incentives.


Research Software Development

Bring Legacy Code under tests by handling global variables - Nicolas Carlo

When trying to implement component testing legacy code with global variables, Carlo has a simple suggestion - don’t over think it, just pass the global variables in as parameters. It may look ugly, but it’s not new ugliness, it’s just revealing existing ugliness; and that’s the first, necessary, step in defining refactoring plans.


Well-researched advice on software team productivity - Ari-Pekka Koponen, Swarmia

Management is hard, management of something as complex and ambiguous as software development is especially hard, but that doesn’t mean we don’t know anything. There has been a lot of research on what works for making teams work well, and recently particularly in the area of software development. It doesn’t mean there are cookie-cutter solutions for anything, but we do have good guidelines. Koponen walks us through several well-supported (and in some cases ongoing) reports, many of which RCT readers will have already known about

  • Project Aristotle (#46) - The follow up to Google’s Project Oxygen, going beyond the effect of single managers to team behaviours. They found that five factors were very strongly correlated with high performing teams
    • Psychological safety - team members feeling comfortable speaking up
    • Dependability of other team members
    • Structure and clarity
    • Work having meaning, and
    • Work having impact
  • DevOps Research and Assessment (DORA) metrics - These are aimed for those developing and operating software systems, but many of them are relevant to those writing and releasing software: they find that teams that are performing well have:
    • High deployment frequency - it’s easy to update the software and keep it working
    • Low Mean Lead Time for Changes
    • Low Change Failure Rate for changes to need to be rolled back
    • Low Time to Recovery for the times that errors do occur
  • SPACE (#66) found that focussing on these five areas were important for self-reported high-performing teams:
    • Satisfaction and well-being
    • Performance
    • Activity
    • Communication & Collaboration
    • Efficiency & Flow

And most importantly, Retrospectives - learning and adapting practices based on what is actually happening on your team - allows you to tune.


Embedded malware in NPM package coa - GitHub advisory, RW Overdijk

Another reminder of how vulerable software supply chains are - coa (command-option-argument, a command line argument parser) used in 200 other packages and a gillion repositories, had malicious releases ˜with malware inserted:

The npm package coa had versions published with malicious code. Users of affected versions (2.0.3 and above) should downgrade to 2.0.2 as soon as possible and check their systems for suspicious activity. See this issue for details as they unfold. Any computer that has this package installed or running should be considered fully compromised. All secrets and keys stored on that computer should be rotated immediately from a different computer. The package should be removed, but as full control of the computer may have been given to an outside entity, there is no guarantee that removing the package will remove all malicious software resulting from installing it.

The good news is that it seems to have been spotted quickly if I’m understanding what’s been happening.


Research Data Management and Analysis

Choosing good chunk sizes in Dask - Genevieve Buckley

As with any kind of parallel or distributed computing, choosing granularities over which to calculate is complicated. Too small, and you end up spending too much time on coordination/communication and too little time on computation; too little and you have too little flexibility in scheduling or can even run out of memory. In simulation, it’s usually pretty clear what size over which to run; for data analysis, which is normally a lot less computationally intensive, it’s often less so.

In this article Buckley gives some rough rules of thumb:

  • Rely on previous single-node prototypes for guidance
  • Chunk sizes below 1MB are almost always bad
  • Avoid more than 10k or 100k chunks
  • Have at least as many chunks as worker cores, preferably significantly more
  • Have each chunk take at least seconds to run

and shows how the Dask dashboard can help provide some guidance.


Research Computing Systems

Scaling a read-intensive, low-latency file system to 10M+ IOPs - Randy Seamans, AWS HPC Blog

This is an AWS blog post, but it’s relevant more broadly - it’s a pretty direct use of NVMe-oF, NVMe over fabric.

Here Seamans describes a very high-speed read-nearly-only filesystem, where a gluster file system is replicated onto multiple instances with very high-speed NVMes, and then the NVMe are exposed read-only over NVMe-oF to provide extremely fast read access to files, for use cases like a large number of nodes are doing a read-intensive analysis of a directory full of data.


The yearly backup restore test - Remy van Els

Backups are useless, restores are invaluable. van Elst walks us through his personal annual backup restore test, marked on his calendar, including file integrity checks:

Have you done your backup restore test recently? An untested / unverified backup is the same as no backup, so doing a restore test is a major part in your backup scheme.


Emerging Technologies and Practices

Five-P factors for root cause analysis - Lydia Leong

Rather than “root cause analysis” or “five why’s”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine:

  • Presenting problem
  • Precipitating factors - what combination of things triggered the incident?
  • Perpetuating factors - what things kept the incident going, made it worse, or harder to handle?
  • Predisposing factors - what long-standing things made a bad outcome more likely?
  • Protective factors - what helped limit impact and scope?
  • Present factors - what other factors were relevant to the outcome?

Running 20k simulations in 3 days to accelerate early stage drug discovery with AWS Batch - Christian Kniep

Following up on earlier Gromacs benchmarking posts, in this post Kneip describes their final use case - running a large suite of simulations for early stage drug discovery. By choosing their instance types based on the previous work, they could tune turnaround time and cost, and by using spot instances and Batch they could fan out 20k simulations over multiple regions relatively straightforwardly:

Price-performance for a variety of instance types for these single-node high-throughput calculations

For our binding affinity study, we completed 20,000 jobs over the course of three days. By using benchmarks and choosing optimal Spot Instances, we were able to achieve a cost as low as $16 per free energy difference (∆∆G value). As we chose to broaden the set of instances for a shorter time-to-solution, we achieved an average of $40/∆∆G value. With AWS Batch, we were able to create pools of resources in different AWS Regions around the globe and handle orchestration within the region. By the end of this, it was clear that we could achieve both a really fast wall-clock time (and hence time-to-result) as well as a low overall cost.


Calls for Submissions

International Super Computing (ISC22) - 29 May - 2 June, Hamburg, Papers due 29 Nov

SC isn’t even here yet and papers for ISC are coming due. ISC of course covers almost everything in HPC:

  • Architectures, Networks, & Storage
  • HPC Algorithms & Applications
  • Programming Environments & Systems Software
  • Machine Learning, AI, & Emerging Technologies
  • Performance Modeling, Evaluation, & Analysis

EUROSIS Industrial Simulation Conference 2022 (ISC 22) - 1-3 June Dublin, Papers due 21 February

The aim of the conference “is to give a complete overview of this year’s industrial simulation related research and to provide an annual status report on present day industrial simulation research within the European Community and the rest of the world in line with European industrial research projects.” Tracks include:

  • Discrete Event Simulation Methodology, Languages and Tools
  • Artificial Intelligence, IOT AND VR Graphics Applied to Industry
  • Complex Systems Modelling
  • Simulation in Robotics
  • Simulation in Engineering
  • Simulation in Collaborative Engineering
  • Simulation in Manufacturing
  • Simulation in Logistics and Traffic
  • Datamining Business Processes, Geosimulation AND Big Data
  • Simulation in Economics and Business
  • Simulation in Economic and Risk Management
  • Simulation in Automotive Systems
  • Simulation in the Power industry

51st International Conference on Parallel Processing - 29 Aug-1 Sept, Bordeaux, Workshop proposals due 28 Nov, Papers due 14 Apr.

ICPP2022 is interested in “he latest research on all aspects of parallel processing”. Topics of interest include algorithms, applications, architecture, performance, software, and multidisciplinary work.


Events: Conferences, Training

Oak Ridge National Centre for Computational Sciences Virtual Career Fair - 11 Nov

Four hours of talks, tours, and career tables staffed by 11 teams that are hiring.


Random

I’ve you’ve wanted to start messing around with functional languages, OCaml is a reasonable pragmatic choice that does get used in the wild. Here’s a getting-started-with-the-tooling guide to OCaml.

Or you could use a lisp that fits entirely into 512 bytes.

Use bash functions! They make bash scripts less crummy!

The nice thing about the internet is you can find nice resources about incredibly obscure things. Want a good annotated bibliography and sample code for tree edit distances, the minimal number of edits you can make to transform one tree to another? Great news!

A complete embedded USB stack in Ada.

Learn how X window managers work by writing one.

fork() considered harmful.

The anatomy of a terminal emulator.

Lovely explanation of Bezier curves, splines, and smooth surfaces.

Fascinating look at the data infrastructure around python environments that have grown over time in some banks: Bank Python.

Unicode attacks on source code.

Debugging stories are always good! Here’s one deep within the bowels of the Linux TCP stack.

Causing data leaks via maliciously-crafted log messages.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.