#24 - Link Roundup, 15 May 2020

Hi!

I asked last week about onboarding; people who responded last week said they weren’t bringing on student interns or new hires at all at this point in the pandemic. It sounded like at least our research computing team managers had succeeded in avoiding layoffs in their teams, sometimes despite an unfavourable - in one case very unfavourable - pre-COVID-19 environment, but several were not bringing on research students, either because of complexities on onboarding, that they just didn’t have the time, or because schools had stopped their placement programs.

A question for this week - how are work hours and productivity going in your team right now? There’s some evidence that after an initial dip, productivity has actually risen; but I think it’s more complicated than that, especially for research computing teams. GitHub has seen more hours worked (and more variable hours) but it’s not clear that total number of PRs or commits has gone up - people are working longer but not necessarily getting more done.

Anecdotally it seems like in among some teams I’m familar with, productivity for lots-of-small-tasks work has gone up, but productivity for deep-work will-have-to-think-about-this-for-a-few-days type tasks has dropped noticably. Has that been your experience, or are you seeing something different? How are you handling it?

Managing Teams

The Engineering Manager Event Loop - David Loftesness via Chris Eigner

This isn’t new, but I really like the idea: what a generic tech software development manager should be thinking of daily, weekly, and monthly on people, projects, processes, and themselves. It’s not quite right for research computing - thinking about recruiting and hiring on a daily basis is to put it mildly not the regime we’re normally in - but a lot of the other items hold up. What other changes would we have to make?

Engineering Manager Event Loop

Attitudes toward Change and Transformational Leadership: A Longitudinal Study - Matthew David Henricks, Michael Young & E. James Kehoe, J. Change Management

There’s a lot of people who spend their careers studying how workplaces work. I don’t think most of the management literature percolates into STEM fields — the papers look funny. They’re full of impenetrable jargon (unlike our normal completely buzzword-free reading) and the text seems impossibly fluffy compared to dense STEM work (but, well, people’s behaviour doesn’t lend itself to formulae and small dense sentences).

This is a longitudindal study that probably aligns with you already believe, but it’s worth highlighting - you absolutely can change culture or practices in a team or larger organization, but you can’t do it by having a single event and then putting up some posters. It takes time and sustained effort. People systems are like large ships; it’s hard to get them to change course, but once actually you get them turning they have a lot of momentum.

Managing Your Own Career

Practical Skills for The AI Product Manager - Peter Skomoroch, Justin Norman and Mike Loukides

I’m not saying you should leave academic research computing and move into AI product management — in fact, if you care enough about managing research computing teams well to be reading this, please don’t, the community needs you — but we often undervalue how transferrable our skills are to other areas. Read some of these practical must-have skills and see if they don’t sound like your day job:

Design/Innovation: “…It’s incredibly important to determine what outcome is desired, how that outcome will be delivered, and how the product will be used before embarking on the long (and expensive) development journey.”
Feature Development and Data Management: Data management, you say?
Experimentation,
Research, and
Modelling: yeah I think we’re good
Serving infrastructure: check

How to expand conflict capacity in times of crisis - Marlene Chism

We’re all tired and stressed right now, and it’s easy under those circumstances for small things to cause conflict where it normally wouldn’t. This is something that we can work on, but it takes some time and practice. Pay close attention to things that trigger unhelpful reactions, try to be aware of them as they come up and short-circuit them, and then buy some time saying you’ll get back to them and step away.

Research Software Development

NVIDIA HPC SDK - NVIDIA

The newest NVIDIA C++ compiler, NVC++, as part of a new HPC SDK, looks pretty cool - it includes GPU-accelerated versions of the C++ STL parallel algorithms. There’ll be a talk about the HPC SDK at the virtual GTC on Tues May 19th.

Five Code Review Anti-Patterns - Trisha Gee, Oracle

We’ve talked before about having clear expectations on code review; here’s five common traps to avoid, and that could be made explicit as part of your team’s CONTRIBUTING.md or similar:

Nit-Picking
Inconsistent Feedback
Last-Minute Design Changes
Ping-Pong Reviews
Reviewer Ghosting

Has someone on your team been meaning to play with Webassembly to run some C++ code in client’s browsers? Here’s a template repository for turning C++ code into webassembly and deploying it as a node.js package.

Research Computing Systems

Brit research supercomputer ARCHER’s login nodes exploited in cyber-attack, admins reset passwords and SSH keys - Gareth Corfield, The Register

Look, none of us want to read about breaches of our systems in the press, but the fact that this article appeared is a mark of seriousness and professionalism on behalf of the ARCHER staff and leadership.

In the private sector the emerging standard, increasingly enforced by law, is full, frank, and fast public disclosure of breaches. That is — that has to be — the future in our community, too. Our researchers deserve at least as much consideration as ecommerce site customers.

Especially since it’s not just our researchers. Research is inherently collaborative and international in nature. People have accounts on lots of different systems. I have heard about breaches in November and February that I’m 95% certain are related to this breach - and if not this one, some less-publicized ones in Germany and the US. The breach spread almost certainly from one system to another via accounts common to two or more clusters.

In research computing we have this culture of secrecy around breaches, so they were kept quiet, not publicized; maybe information was sent in dribs and drabs to a few trusted others through back channels. We are way past the time when that was good enough. These breaches could have been contained, or at the very least slowed down, by earlier public disclosure like the ARCHER team made so that other teams elsewhere knew to be on the lookout. Sure, hold back some signatures of the rootkit so that the attackers don’t make cosmetic changes to avoid detection if you want, but our researchers and our colleagues deserve to know about your breach. We’re in the middle of a pandemic, for pity’s sake! We should understand how irresponsible it is to be walking around infected while acting as if nothing is amiss.

I get it, no one wants their funders or VPRs asking awkward questions. But that’s not a good enough reason for this amateurish practice of hiding the inevitable under the rug. We’re ashamed about these breaches because we know at some level we’re not doing a good enough job on security. But the first step towards having some professionalism is to hold ourselves accountable. As a community we need to do better.

In other security news: no, you probably don’t need to worry about that Thunderbolt Vulnerability if you’re running an up-to-date OS. (“All the evil maid needs to do (with an un-updated system - LJD) is unscrew the backplate, attach a device momentarily, reprogram the firmware, reattach the backplate, and the evil maid gets full access to the laptop” - oh, ok then.) But still, don’t leave your laptop lying around, and be careful what you plug into your computers.

Emerging Data & Infrastructure Tools

New – EC2 M6g Instances, powered by AWS Graviton2 - AWS

AWS’s new ARM chips, the Graviton2, are now generally available on AWS in some regions. This is a great opportunity to play with them - not only are they substantially cheaper per unit performance than Xeons but they seem to actually have better performance on some database-y type loads than other M-class nodes.

This is particularly well timed, coming alongside the announcement of what’s likely the new top supercomputer in the world, the ARM-based Fugaku.

Azure HPC Cache: File caching for high performance computing - Scott Jeschonek and Scott Hanselman, Azure

The various commercial clouds are developing products expressly around the issues research computing teams have migrating to using more cloud. HPC cache caches your on-prem storage to an Azure region (exporting an NFS-esque posix FS or object store via Blob Storage) and handles transfer back and forth automatically - meaning you can burst workloads more or less automatically. It has some sophisticated features like being able to specify the type of I/O workload (read-heavy, write-heavy, mixed).

I’d really like to see more solutions like this for caching even in-cloud S3/Blob object store to POSIX FS so that the staging in and out of data from workloads that require expensive POSIX fs to much cheaper object storage…

Sort of relatedly, Azure now finally has spot instances.

How I switched from classic hosting to Kubernetes - Anders Borch

A nice walk through of moving a side project from docker compose to kubernetes, for those who been meaning to give it a try.

Calls for Proposals

ARM Dev Summit - Proposal deadline June 9th 2020

If you’ve already been playing with ARM or intend to, this October virtual conference is calling for proposals for workshops, demos, or panels. Tracks most relevant to us are:

AI in the Real World: From Development to Deployment
Building the IoT: Efficient, Secure and Transformative Software Development
Cloud Native Developer Experience
Infrastructure of Modern Computing

ALGOCLOUD 2020 - Paper deadline June 24th 2020

This virtual conference in September “ is particularly interested in novel algorithms in the context of cloud computing, cloud architectures, as well as experimental work that evaluates contemporary cloud approaches and pertinent applications. ALGOCLOUD also welcomes demonstration manuscripts, which discuss successful system developments, as well as experience/use-case articles and high-quality survey papers.”

Events: Conferences, Training

Chapel Implementers and Users Workshop 2020 - Friday May 22, 2020 8:30am–4:30pm PDT, Free via Zoom

Chapel seems to have come a long way since last I played with it; they have an all-day pacific time workshop next Friday.

Random

NSF has funded an NCSA/Chicago effort to develop and deploy serverless (function-as-a-service) computing for research computing: the software product is funcX. Anyone know the advantages funcX has over e.g. Apache OpenWhisk?

Programmatically draw architectures for AWS, Azure, GCP or on-prem systems by writing python code. The cool thing there of course is now your architecture diagrams can be meaningfully version controlled.

I’ve often said that one of the problems people new to computing at scale have is that laptops and their filesystems work too darn well, so their intuitions about what’s easy and hard actively mislead them on clusters or distributed systems. This article expresses that more clearly, and might be useful for discussions with users who are having trouble making the jump.

You shouldn’t write code when avoidable. Even something “trivial” like CSV parsers.

If you haven’t seen this already, there’s a nice survey of deep learning for scientific discovery on arXiv.

A nice looking coursera course on research data management.