#64 - 5 March 2021

Hi!

It’s been a big week here - we finally got a long-awaited paper submitted, attended one virtual conference, and are gearing up for another.

I hope you and your team has had successes this week too - are there any you want to share? If so, or if you have any feedback, suggestions, ideas, or questions, let me know - reply, or send me something at jonathan@researchcomputingteams.org.

For now, on to the roundup!

Managing Teams

Mentor, coach, sponsor: a guide to developing engineers - Neha Batra

A good overview of the different ways we can contribute to a team member’s development:

Mentoring: sharing your experience so an engineer can leverage it themselves. Coaching: asking the right questions to help an engineer reach a solution that works for them. Sponsoring: giving stretch projects or opportunities to help an engineer grow.

Of these, mentorship is the least valuable - our experience may or may not transfer over. Working with our team members to help them solve the problem (coaching), and explicitly making room for them to learn new things (sponsoring) ask more from us, but have better results and are higher leverage by helping them grow further, faster.

Batra works through some examples of what the three might look like, from submitting a talk to getting more visibility in the organization.

Ten simple rules for writing compelling recommendation letters - Jennifer H. Kong, Latishya J. Steele, Crystal M. Botham, PLOS Comp Bio

Writing recommendation letters is part of life in research. If your team member trusts you enough, when it’s time for them to move on here are roles where you may be asked to provide a reference. And trainees you or your team members work with may ask you or them for a reference.

Kong et al. offer their ten rules - some that strike me as particularly under-used are:

Respond with enthusiasm or decline. Just don’t write the letter if you can’t give them a really strong reference; you’re not helping anyone.
Provide context for why you are a suitable letter writer
Address the requirements needed to be successful in the new position
Be memorable by adding illustrative anecdotes - concrete stories are enormously more compelling than just asserting a fact.

A Simple Compliment Can Make a Big Difference - Erica Boothby, Xuan Zhao, and Vanessa K. Bohns, HBR

A reminder of something we should know - a complement, or positive feedback, measurably improves peoples mood through the day, and repeated positive comments over consecutive days does not diminish the effect!

As managers we tend to think most about negative feedback, which absolutely has to be given whenever it is needed, but we often forget to give concrete positive feedback or even just more general praise. Don’t do that - call out jobs well done as often as possible. You won’t overdo it, you’ll get more of what you’re looking for, and it makes everyone’s day better.

Product Management and Working with Research Communities

Training Feedback - University of Bristol Advanced Computing Research Centre

Like most of us, Bristol’s training model changed pretty abruptly one year ago. Bristol’s ACRC started having short standardized feedback forms with free text fields starting Jan 2020, which meant that they had three months of data on in-person delivery and now a years worth of data on virtual delivery. With admirable transparency, they’ve posted the results and

Online response rate is lower (probably not surprising, in-person events have higher engagement overall)
The feedback was very positive in both cases, with no significantly different “would you recommend this workshop” positive answer rates between the two (95+% in both cases!)

What’s nice about this is that there’s both quantitative results (so you can relatively confidently ask the binary question, “did things get worse with online teaching”, and get “no”) and get more nuanced qualitative results from the free-text fields. As is often the case, the most useful input comes from the free-form text fields; distilling those takes more time than plugging some numbers into a plotting package but can provide really useful feedback.

Having long-running standardized feedback on your “product” from your “customers” allows you to more confidently experiment and make changes - kind of the same effect that version control has on software development, or backups and documentation for the operation of systems.

Research Software Development

LibBLAStrampoline, a BLAS and LAPACK demuxing library - Elliot Saba et al

This is really cool - a liblas and liblapack interposer that you can compile your linear-algebra-using tool against, and then switch BLAS/LAPACK implementations at runtime using environment variables! Julia 1.7 is already using this to allow the user to select an implementation interactively.

How Was Your Weekend - Software Development Teams Working From Home During COVID-19 - Courtney Miller et al.

The authors surveyed software developers at an unnamed “large software company”, one survey with 2,265 respondents and one with 608. They found a number of changes:

For example, as one of our respondents put it, formerly shallow questions from team members like “How was your weekend?” are now a deep inquiry of well-being and genuine concern. Many developers also discussed how their teams had trouble finding satisfying replacements for the low-effort in-person social activities like lunch or coffee breaks that were used to help maintain camaraderie and social connection on their team.

In a followup survey focussing on team aspects, “57% of respondents felt a decrease in their ability to brainstorm with their colleagues”. Some aspects of communications within teams over the time - a big increase in number of meetings, and a decrease in feeling socially connected to their team.

In good teams, the respondents largely felt that the despite communication challenges, team culture hadn’t changed - which is good, because that would be hard to come back from - and there were definite advantages:

more inclusive communications
better integration of employees that were already remote to a mostly on-site team
more flexibility and understanding of challenges

I would have liked to see some cross-tabs - e.g. did teams that were more purposeful about social communication end up doing better? What tools at a managers disposal seemed like they helped?

What We Expect From Software Developers on Each Level - Dafna Rosenblum

Related to the recent hiring discussion on requirements, here Rosenblum walks through the very concrete requirements Apegroup is using for levelling software developers at a number of levels - from entry level to tech lead. It’s focussed on front end/UI work but the very concrete requirements at the lower rungs are very clarifying. She also links to a post describing the process as they rolled it out.

We’re thinking of a levelling system in our institution for software developers; right now it’s unfortuantely very… IT-y. I’m hoping to use articles like this one to steer the conversation in more useful ways.

Giving Ada a chance - Anthony
Performance analysis and tuning of SPARKNaCl - Roderick Chapman

Rust prides itself on safety, but only one particular aspect - memory usage. That’s a good and useful thing to be safe about, bit it’s only one piece of the safety puzzle - and by itself memory safety does very little for correctness, which for much of research computing is an even more important concern.

People who thought very carefully about safety in the sense of plane-doesn’t-fall-out-of-the-sky or embedded-device-doesn’t-overheat-and-explode, which includes a lot of correctness, put together Ada. Ada is still evolving and is very much in use, which is why I (and Anthony, apparently) find it surprising that other languages haven’t learned from what Ada does well.

Anthony points out how its (then still fairly novel) strong use of types for correctness, and in particular its allowing of strong subtyping of ranges of enumerables (whether enums or integers) is extremely useful for correctness of low-level programming (and indeed some of Ada’s strongest areas of use right now is in embedded systems).

And that doesn’t even touch on other areas more relevant for research computing, like programming-by-contract. Spark, an extension-to-a-subset of Ada, takes that even further with support for formal verification. As Chapman writes, Spark can give comparable-to- or with some work even faster-than-C performance such as SparkNaCl, a cryptographic library modelled after libsodium - all while being formally verifiable.

Even if Ada isn’t a language suitable for research computing work, it would be very valuable if those that are would take advantage of some of what the Ada community has learned.

Research Computing Systems

Fending off Bitcoin Mining HPC Thieves: Idaho National Lab’s Cryptojacking Detector - Inside HPC
Investigation into hacking at University of Oxford lab - Sophie Inge, Research Professional News ($$)

A couple of reminders, if we don’t need it, that security is always an issue.

In academic research computing it was less of a problem when research computing was a more niche activity with large buy-in from only a small number of fields. When I was doing astrophysical fluid simulations - well, who cares, right? Our administrator’s threat modelling could safely rule out advanced persistent threats targeting my galaxy cluster simulations.

But with cluster cycles now themselves potentially valuable, and with research computing becoming increasingly broad (and policy- and commercially-relevant), the landscape has shifted significantly.

From Inside HPC, INL has developed a malware detector that looks for cryptocurrency tools in particular, and is looking to commercialize it.

From Research Professional News, Inge writes about a possible hack of Oxford’s Division of Structural Biology. The details are as you might imagine, but the event being made public is still too rare. I can’t emphasize enough how important it is for all of us in research computing to be transparent and public about such events, so that all of us that are similarly vulnerable - either technically or because of similar targets being present - can know as soon as possible.

Extending no-cost Red Hat Enterprise Linux to open source organizations - Jason Brooks, RedHat

As a small gesture towards those who used to rely on CentOS stable, RedHat is making some self-supported licenses of RHEL for open-source projects:

Under the program’s terms, eligible organizations will be granted access to no-cost RHEL subscriptions for any use within the confines of their infrastructure. This includes build systems, continuous integration (CI) testing and general project requirements (i.e. web servers, mail servers, etc.). These subscriptions will be self-supported by default, which provides full access to the Red Hat customer portal, knowledge base articles and forums, and also include Red Hat Insights, our proactive analytic tooling.

This will help some targetted use cases, but it won’t help those who run CentOS for large compute clusters.

I for one won’t weep for CentOS stable. Long-term stable operating system releases are used as a paper-thin excuse by vendors of multi-bajillion-dollar networking and filesystem kit to update and test their drivers and software stacks as little as humanly possible, leaving researchers stuck on OSes with ancient system libraries, and systems teams scrambling to build and maintain essentially their own parallel custom distributions.

Easybuild and Spack have made things easier for systems teams, but they still hasn’t had the impact they could have because of vendors are tying teams to ancient OS releases.

I might hope that this transition, though jarring and a lot of work, would lead to vendors supporting more frequent OS upgrades, but who are we kidding - they’ll just wait until the community creates a replacement to CentOS for them, or they’ll switch to supporting only Ubuntu LTS or something.

Open Compute Project Foundation announces the OCP future Technologies Initiative - Scientific Computing World

The OCP is growing in importance as hyperscalers and those who are interested in drafting off of the work of hyperscalers are defining standards for commodity computing infrastructure.

The OCP has formed a subcommittee to keep planning 3-5 years in the future to support the work of standardization. It’s interesting to see what OCP has chosen to keep an eye on:

Software Defined Memory
Cloud Service Model
AI HW-SW Design Collaboration

As well as keeping a slot open for “Additional R&D Opportunities area identified by the Community”.

Emerging Data & Infrastructure Tools

How to test github actions locally using Act - Yonatan Kra
Act - Casey Lee et al.

We haven’t started trying Github Actions yet. There’s a number of reasons, but given how long it takes me (admittedly the least technically capable member of our team) to even get a working travis config set up, the idea of having to commit and recommit an embarrasing number of times to update the CI/CD pipeline definitely plays a role.

Once you have docker installed, act allows you to run and test GitHub actions locally. Installation looks relatively simple on Intel macs with home-brew, Windows on chocolatey or scoop, or on Arch Linux or with Nix, or by compilation with any reasonably recent golang toolset otherwise.

How they SRE

This community-curated list on github includes best practies, tools, techniques, and resources for how site/system reliability engineering is practiced in a large number of orgaizations, on topics including

Hiring and Building SRE teams
SRE Culture
DevOps
Monitoring & Observability
Alerting
Incident Response & Post-Mortem
On-Call
Testing in Production
Chaos Engineering
Automation
Performance

Accelerator virtualization - Carlos Reaño, Federico Silla, and Blesson Varghese, Concurrency and Computation, Practice and Experience

We’ve talked a couple of times in the last couple of issues of the possibilities of PCIe 5 and the upcoming 6, for allowing fast local interconnects within a cluster and increased virtualization of resources.

Reaño, Silla, and Varghese write a preview of three papers in the current issue of CCPE on the use of accelerator virtualization in very research-computing relevant areas. One paper uses edge-computing virtualized accelerators to improve performance (within the same amount of time) of de-noising of IoT monitoring systems The second looks at performance using rCUDA (remote CUDA) with Slurm for GPU jobs in a heterogeneous cluster; GPU utilization is improved, cluster throughput is increased, and total energy consumption is decreased. The third uses CUDA offloading for molecular dynamics simulation from low-power notebooks or tablets.

Maelstrom - Kyle Kingsbury and Kit Patella

This really cool package lets you learn distributed systems by implementing algorithms in a simple DSL, then testing the heck out of them with the full power of Jepsen. It comes with with a 6 chapter set of instructions getting you started from yellow world up to consensus with Raft. This honestly seems like great materials for a longer-term graduate student training.

Calls

Essential Open Source Software for Science, Cycle 4 - Chan Zuckerberg Initiative - LOIs due 30 March, Full applications due 19 May

Applicants can request funding between $50k and $200k total costs per year for two years for a total amount between $100k and $400k for the duration of the grant.

1st Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy - 25 June, Stockholm Sweden, Submissions due 19 March

The purpose of this meeting, held in conjunction with ACM HPDC 2021:

is to bridge tools developers and end users of performance analysis tools. It is a full day workshop with a keynote in conjunction with HPDC 2021. We are hoping that the stakeholders, which are application developers, domain scientists, analyst, and tools developers can collaborate and build a bridge to fill in the gaps in various topics…

Submissions are 6-8 page papers in ACM format, on topics such as

Performance analysis and modelling on the real world applications
Data visualization in high level performance analysis
Inefficiencies in programming patterns or computing architecture
Patterns, anomaly detection, and performance characterization in HPC applications
Performance engineering strategies and use cases.
Analyzing application performance in Cloud environment and other ubiquitous technology
Machine learning aided performance analysis

Events: Conferences, Training

KubeCon/CloudNativeCon Europe 2021 - 4-7 May, $75

A lot of cloud-native technologies on disply here, with Kubernetes for AI or at the edge, data management, tutorials, deep dives into Kubernetes networking and instrumentation, talks on role-based authorization, workflows, Spark on K8s, GPUs on k8s, confidential computing, open policy agent - there’s a lot here that could be useful.

International Conference on ICT enhanced Social Sciences and Humanities 2021 - 28-30 June, Free but manditory registration by 20 June.

A conference on computation- or data-supported social sciences and humanities work.

The ICTeSSH 2021 online conference aims to bring together leading SSH researchers, computer scientists, informaticians, publishers, librarians, vendors of research ICT tools, SSH decision makers and others, to exchange and share their experiences and research results on all aspects of ICT enhanced Social Sciences and Humanities. In addition, the conference aims to discuss promising new ideas and identify new potential collaborators.

Random

Great performance-improving story of profiling (and reverse-engineering) GTA Online to cut loading times by 70%.

A great list of foundational distributed systems papers.

An argument for having extremely simple “layer cake” architectural diagrams in your talking-about-systems toolbox.

Cloudflare Pages is now in GA, with a free tier that looks like more than enough for personal or departmental static web hosting. Right now I use Travis-CI + AWS S3 + AWS CloudFront + AWS Certificate manager for this use case - which is fine, and absurdly cheap. But it’s a lot of steps and moving parts, and so is tricky to get working from scratch.

A tutorial introduction to the differential dataflow model of computation which as it matures is going to be interesting for a number of different reseach computing applications.

We tend to get sloppy when writing bash, because, hey, it’s just a script, right? But we can write bash scripts professionally using real software development approaches - even unit tests.

It’s fascinating to watch older and newer databases converge in some areas of functionality. Postgres, which has increasingly rich document-store like functionality with JSON fields, also can be set up to trigger events on changes to data.

A recently minted PhD student goes through an ordeal of converting his PhD thesis from LaTeX to HTML. It shouldn’t be this hard. I’ve taken to using markdown to write things up then pandoc to convert to different formats, but that still doesn’t solve the problem that then if you want something highly styled (like a scientific paper or book chapter) you have to have multiple copies of the stylings for the various formats.

Someone studies the ancient, of-historical-interest-only web technology of Server-Side Image Maps.

An introduction to lockless algorithms.

RCT Newsletter