#43 - Link Roundup, 25 Sept 2020

Hi, everyone:

Sorry again about the slightly irregular time, and the lack of much of a preamble - heavy deadlines these past two weeks, and I’ve been thinking about directions we can steer the newsletter in the coming year. But the deadlines are past now and it’s back to normal.

As always, share your thoughts with me - just reply to this, or email me - about directions you’d like to see this go. There’s so little out there specifically for us research computing team managers that there’s a lot of things we could usefully do! I’m always interested in direction from the community.

In the meantime, on with the link round up!

Managing Teams

Does Research Software Engineering have a diversity crisis, and what can we do? - Neil Chue Hong

This is a talk that Neil Chue Hong gave at the 2020 International RSE Leaders Workshop. The numbers he gave are UK based - he’s with the Software Sustainability Institute in the UK - but they’re pretty grim. UK research software developers (RSEs) have as low or lower percentages of people who identify as women or are black, asian, or other ethnic minorities in the UK than either academia or tech, which themselves are decidedly unrepresentative of society as a whole.

This is posed as an issue with research software development, but I think it’s more than that. One of the themes of this newsletter which I hope comes through is that binning research computing into “software development”, “systems”, “data management”, or “HPC” isn’t helpful - we all have the same challenges, the same goals, interlocking needs, and the boundaries between the bins are super fuzzy. We only solve issues by working together. And this is a research computing problem which the whole discipline needs to address together.

And it is a problem. Witness the fiasco around Numpy. Numpy’s paper, which I celebrated being in Nature last newsletter, had 23 authors - every single one of them men. When that was pointed out, the numpy twitter account started blocking people(!!) and then a number of contributors started trolling and dog piling on critics which … did not dampen the concerns.

We do have this issue, it’s worse than in just academia or tech individually, and we need to start fixing it. As managers, we can make sure our own hiring processes are surfacing excellent talent from all communities.

Use a Pre-Mortem to Identify Project Risks Before They Occur - Mike Cohn

We’ve talked a lot about the importance of psychological safety in teams - making team members comfortable expressing their opinions, including raising issues. Without that, you’re missing important input and potentially running into foreseeable (and foreseen!) problems.

Premortems give explicit encouragement to raise issues. I’ve used these to good effect in some project-kickoff situations - trying to get the team to see obstacles ahead so they can be avoided. With pre-mortems, one step is actually brainstorming ways that things can go wrong. This makes it much easier to chime in with foreseeable issues, and for you to get those insights that they might not be willing to share. And it’s a good way to get people comfortable raising potential problems. Cohn’s article is a good intro to the idea if you haven’t seen it before.

Meeting everyone on a new team - Anna Shipman

Last time we talked about leaving a team, this time an article about doing one specific thing when joining a new team as a manager or director - speaking with every person in your new organization. Shipman describes having 30 minute meetings with each person in her new 50 person organization over the course of several months. Long time readers will recognize it as looking a bit like the first half of a weekly one-on-one; mostly listening, driven by the team member. Shipman made it clear that this was for informational purposes, and that she wasn’t intending to attach the team member’s name to comments, and structured the discussions around five questions:

I’d love you to tell me a bit about yourself – as much or as little as you feel like sharing
What do you think the most important things we should be doing over the next year?
What will get in the way of us doing that?
What’s going well, i.e. what should we make sure we don’t change?
Is there anything you think I should know about?

Managing Your Own Career

How To Handle Email Calmly - Lukáš Linhart

Email is the bane of all managers and so any article on handling email almost always gets a quick look.

Linhart’s first suggestion is something I follow that I only recently learned not everyone does - religiously keep multiple email accounts, and keep them separate:

Private communications - this is my “friends and family” account. This one you get notifications for.
Work/project emails - I usually have two, my persistent email account and whatever institutional email account I have at the time, which I mainly use for administrative tasks at that institution. Very very few contacts here get notifications
Giveaway email - for things likely to incur spam

He also suggests

Official bureaucracy email - like for banks, etc, which I have sitting uneasily in my friends and family email

I’d add that I also have “email addresses” associated with my RSS reader account that I use to sign up for long-form reading like newsletters - bringing them out of my inbox and instead routing them somewhere that I go when I’m actually looking forward to reading the contents.

Other suggestions:

Archive emails once you don’t have anything more to do with them
Inbox is not a todo list - this one took me forever to learn, but it’s super valuable. Mark them as unread once you’ve read them, and if there’s still something to do with them, add that task to your real todo list!
Process your emails from oldest-first - Again, just incredibly valuable tip
Set expectations - to yourself and to others. For me, emails should have an expectation of a next-business-day turnaround. That response might be “Hey, good question, I don’t know, I’ll get back to you”, (and then that task goes on the todo list). But you should respond by then, and it’s unreasonable to have much shorter (hours or minutes) expectations of responses.

I’d also add: Set your email clients to update every hour or so, not every couple of minutes. There’s no possible email that you need to have read within minutes of it arriving in your inbox. I’ve tried other tricks, but those are what have worked for me. How about you?

Improving your RSE application - Ian Cosden

We don’t often get career articles actually written specifically for our discipline, so I wanted to include this short 3-part set of articles by Cosden, the Director of Research Software Engineering for Computational & Data Science at Princeton. It came as a follow up to an RSE community chat about hiring.

The articles are simple but they’re useful to look at both from the point of view of seeing what a fellow hiring manager looks at, and as a reminder of things we need to be keeping in mind ourselves as we look to our next roles.

Cosden’s CV tips:

Use formatting to your advantage - “Your goal should be to make sure that on my quick initial skim I can see your relevant skills/experience/education quickly and clearly”
Include a link to [relevant stuff you’ve created]
Use hyperlinks wisely

and his cover letter tips:

Passion is contagious [not sure I necessarily agree here - LJD]
Highlight one or maybe two key experiences [2-3!]
Do some research
Spend the time to do it well.

At some point I need to do a whole thing on hiring… what are issues you’ve had with hiring and what biggest issues do you face? Hit reply or email me (jonathan@researchcomputingteams.org) and let me know if there’s something you’d particularly want to read.

Research Software Development

What Does This Line Do? The Challenge of Writing a Well-Documented Code - Miroslav Stoyanov, Better Scientific Software

In this article, Stoyanov describes how the team behind Tasmanian, a library for high dimensional integration and interpolation, went from PDF-based documentation to a Doxygen-powered web based documentation system. The lack of good internal documentation was a problem for Tasmanian:

Tasmanian has always had a well-documented external API, but not internal documentation, the lack of which is especially problematic when chasing moving targets such as GPU support for multiple vendors. Porting code to GPUs is hard, doubly so when it is undocumented and comes from an external contributor no longer working on the project.

They took what internal documentation that existed and moved it into Doxygen and their regular build system. By moving to web-based internal documentation, this automatically raised the visibility of gaps in the documentation, and made it easier to spend development time filling those gaps - which in turn makes it easier to port to multiple GPUs.

Enhancing software development through project‐based learning and the quality of planning - Marco Antônio, Amaral Féris, Keith Goffin, Ofer Zwikael, and Di Fan, R&D Management

We’ve talked about sharing knowledge across the organization before - whether through talks, pair-programming, or other shared experiences (documentation, share, but not just documentation). In project management, “Project Based Learning” (PBL) is a number of techniques that builds into the project planning ways to make sure the things our team members learn from the project becomes shared knowledge within the team and organization.

This is a paper that shows that PBL works; it’s not just a good career and skills development practice (which it is) and not just a nice-to-have, but it actually measurably improves performance on future projects. The authors looked at 47 software development projects across three multinational organizations and found that the data supported all five hypothesis they tested; project based learning:

is positively associated with the quality of planning in subsequent softwa
re development projects.
in software development projects is stronger when uncertainty levels are high.
is stronger when team collaboration is higher.
is stronger when an organisation adopts a project-based structure (as opposed to matrix and functional structures).
is stronger under high time pressure.

Since research software development is generally under time pressure, is always quite collaborative, is normally organized into teams around projects, and is notoriously uncertain, this is quite relevant to us.

Research Computing Systems

Spindle: Scalable Shared Library Loading - Matthew LeGendre, Dong Ahn, Todd Gamblin, Bronis de Supinski, Wolfgang Frings, Felix Wolf

When large number of modes start running a task, they often hammer the filesystem as each process independently loads the necessary often large number of dynamic libraries (or .py and .pyc files, or..) to start the program. From the authors:

We encountered cases where it took over ten hours for a dynamically-linked MPI application running on 16K processes to reach main.

And during that time any other processes trying to access the filesystem are also slowed. The Spindle package, on github, is a (largely) LLNL team’s approach to this problem. On task startup, for each file (dynamically linked library, dlopen()’ed library, configuration file, etc., configurable) a process will be chosen to read it and broadcast it to the ramdisk of the other processes, greatly speeding startup and reducing the impact on other users. No recompilation needed, one just starts the program with, e.g., spindle mpirun -n 128 mpi_hello_world.

The Cost of Software-Based Memory Management Without Virtual Memory - Drew Zagieboylo, G. Edward Suh, Andrew C. Myers

I had just been reading an older article on Virtual Memory Tricks, as well as one on mmap tricks, both making the argument that we don’t use virtual memory features often enough to simplify our programming when this link crossed my inbox. This article makes the argument that virtual memory has outlived its usefulness, and now is a drag on performance and system predictability:

With large memory workloads, virtualized environments, data center computing, and chips with multiple DMA devices, virtual memory can degrade performance and increase power usage.

It’s not hard to understand how virtual memory greatly complicates DMA devices. With virtualization, the argument is about performance: large-memory, complex-read-pattern workflows like many data analytics workflows often incur TLB misses and that with virtualization this now requires nested page table walks to resolve.

Embedded systems have done without virtual memory for eons, and the authors suggest that it’s time this became more common. We simplify OS issues dramatically without virtual memory, and can increase performance; further and perhaps more controversially, the authors suggest that applications can more easily now handle their memory directly, forgoing the virtual memory system-provided contiguity, migrating, and swapping by doing it at the application level.

The authors try their hand at implementing their recommended approaches with several SPEC and PARSPEC benchmarks, and claim that the programming and computational overhead for implementing memory management at the application level is modest, and encourage more work in the area.

What do you think? Most “HPC codes” do this anyway, because memory placement is so crucial, and HPC sysadmins would typically rather see a code crash than swap - but I’m a little more skeptical about its application more broadly.

Emerging Data & Infrastructure Tools

SRE Classroom: exercises for non-abstract large systems design - Google Cloud

Google, which is notoriously close-lipped about technology development in the company, is getting more and more open with their training materials. This is terrific, because google takes training materials very seriously, and they’re quite good.

In Google’s systems reliability practice, they emphasize large systems design and “back of the envelope” estimation approaches which will seem quite familiar to those of us who were trained in the physical sciences. They teach this approach with quite concrete examples, their so-called “Non-abstract large systems design” (NALSD) examples. This lets them quickly evaluate the feasibility and tradeoffs of different approaches before they start building things. There’s a nice chapter in the SRE book working through a simple example.

They’ve just released a nice workshop on NALSD with a pub-sub worked example. In the package are slides, worksheets for attendees, and a workbook for workshop facilitators. It looks like a nice set of materials for you or a team member to work through if you’re curious about architecting these kinds of systems, or a cool afternoon course to offer within teams or externally.

Oracle Cloud Deepens HPC Embrace with Launch of A100 Instances, Plans for ARM, More - John Russell, HPC Wire
Oracle bulks up high-performance computing services on its cloud - Paul Gillin, SiliconAngle

Oracle, trying to carve a niche for itself in the commercial cloud provider world, has clearly set HPC (and other areas of research computing?) as a strategic target. With bare metal instances, the newest A100 NVIDIA GPUs, and ARM plans, Oracle is clearly betting this can be a specialization that can pay off.

The SiliconAngle article points out something of particular interest for research computing workloads - much more flexible instance types, allowing you to choose your number of cores and amount of memory directly:

“If you want three cores and 15 gigabytes of RAM, you can’t get that from anyone else,” Batta said. “It’s like a slider: You pick cores and memory and we give you an instance on demand.”

Has anyone tried Oracle’s HPC efforts?

Calls for Proposals

SORSE Call for Contributions - 30 Sept

The ongoing Series of Research Software Events has their monthly abstract deadline at 8pm BST on Sept 30. They are accepting proposals for talks, panels, software demos, and more.

Were you going to submit something to an RSE conference this year? Do you have a project you’d like to share? Is there a talk/workshop/panel you’d like to see happen? Then you’ve found the right place! We encourage contributions from all time zones and will schedule events on a day and at a time that suits the presenter. We would like to record events where appropriate. The Call for Contribution form asks for your permission to record the event, for you to give permission to provide your uploaded materials under a CC BY licence and be happy to publish them to Zenodo.

GPU Technology Conference - 5-9 Oct, multiple timezones, $99 USD

Nvidia’s GTC has gone all-virtual this year and really embraced it - with sessions, live or recorded, being given in multiple time zones (and some given in multiple languages). As you might expect, there’s a lot of AI and deep learning sessions, but also multiple sessions on topics of interest to research computing such as geospatial data, drug design, HPC, GPU + Infiniband, genomics with both short reads and nanopore sequencing, IoT/edge computing, and technical sessions on pseudo-spectral fluid dynamics methods, RAPIDS (GPU-powered database analytics), GPU + Spark, algebraic multigrid, CUDA programming, and more. Not hard to find $99 worth of talks to attend.

Events: Conferences, Training

Digital Humanities RSE: King’s Digital Lab as experiment and lifecycle - 29 Sept, 15:00 – 16:30 UTC, James Smithies, Arianna Ciula, SORSE talk series

Next up in the Series of Online Research Software Events series, a walk about a research software lab at Kings College, London:

This SORSE event describes King’s Digital Lab (KDL), a Research Software Engineering lab operating within the Faculty of Arts and Humanities at King’s College London (UK). The KDL team of 18 project managers, analysts, designers, engineers, and systems managers specialise in arts & humanities, cultural heritage, and creative industries research and development. The talk will provide a current state overview of the lab, and describe our RSE HR roles (see https://zenodo.org/record/2564790) and a relatively recent trial initiative that defines the different ways the team can contribute to research.

Hacktoberfest 2020 - 1 Oct - 31 Oct

Not getting much coding in these days as a manager? More time spent in spreadsheets than in editors? Here’s your chance. Sponsored by Digital Ocean, this annual project encourages contribution to open source projects. If you’re one of the first 75,000 participants to complete the challenge by submitting 4 valid non-spammy PRs to any public GitHub repo (many projects will label issues with #Hacktoberfest in addition to good-first-issue or the like), and you will be eligible for a prize like a t-shirt. There’s a bunch of research software projects participating in climate science, geo sciences, and of course a zillion COVID-19 projects.

Fluid Numerics Cloud HPC livestreams - Weekly starting 1 Oct

Joe Schoonover of Fluid Numerics is doing weekly livestreams of doing HPC fluid dynamics in the cloud.

Random

Graduated beyond Little-Bobby-Tables style SQL-injection mischief in the “name” field of your various web accounts? Worried your service provider is storing passwords in plaintext? Maybe try choosing antivirus test strings as your password.

With the update of Windows you now read Windows-System-for-Linux linux files from within Windows … by way of plan 9.

Relatedly, someone’s put together a DOS system for linux, so you can run your favourite MS-DOS commands from linux.

A nice style guide for SQL from the folks at kickstarter.

Rookie HPC is a beginning set of tutorials and exercises for MPI and OpenMP. It even has a really cute MPI datatype creator/visualizer.

In fact, there’s a lot of really nice coding exercise websites out there. One I just found, exercism.io, has 118 exercises for… Tcl?