#100 - 12 Nov 2021

Valuing ourselves; Owning the manager's authority; The programmer's brain; NVIDIA and AMD news; NVMe for HDD; JupyterHub and Kubernetes for Pangeo

Hi, everyone:

It’s been a good week here at RCT world headquarters.

First, our team finally published our paper describing our v1 platform at a high level - a mere 29 months after creating the first version’s Google Doc. The effort tied together years of not just software development and technical architecture but stakeholder engagement, privacy considerations, team building, and domain knowledge. Several co-authors were software developers who had never been on a paper before, were pretty new to the whole process, and hadn’t necessarily appreciated the “full stack” of the effort. It was fun to help them be part of the process not just of writing a paper but of creating a piece of the scientific record of humanity. Knowing they’ll be able to walk into many University libraries all over the world, for decades, and find a copy of it in the stacks, with their name on it, with authorship and citation records kept basically in perpetuity, is pretty cool.

Secondly, on a personal note, I spent some time at an arm HPC hackathon, which was both exciting (new tech! With many different systems to play with!) and surprising (Oracle’s cloud seems… pretty ok?). But more importantly it was really rewarding to see that after a probably eight year hiatus from day-to-day performance tuning of HPC codes, some of the names and tools may have changed, but a basic understanding of the tradeoffs at play and the techniques used to balance between those tradeoffs translate unscathed and can be put to use immediately. These are fundamental skills.

Both of these events drive home to me the breadth and depth of expertise we have in our profession, and how important it is for us to apply it.

And both breadth and depth are needed. Tech just learned a very expensive lesson with the Zillow Offers fiasco that we in research computing have known for a while - it turns out you need to have some domain expertise as well as technical expertise. It’s not just enough to know how to code or to run a computing system or manage a database; that needs to be paired with understanding of why the software or system is being used, or what valid data looks like in a field. And the problems we’re dealing with are subtle - they require deep understanding of the domains we straddle in our work.

What continues to baffle me is that while the nature and importance of the expertise we bring to bear on research problems is being increasingly appreciated in the rest of academia - and elsewhere, as the burgeoning job board indicates - too many of our own teams continue to underplay it. Groups underbid on projects, are timid in proposals, and try to be a little bit of everything to everyone instead of understanding and playing to their strengths. Teams discussing cloud computing in research computing continue to emphasize first and foremost arguments like “we’re cheaper”, and “we don’t pay inflated tech salaries” as if being the bargain-basement discount brand is our natural lot, or as if us scandalously underpaying our staff is a feature instead of a bug.

We’re hitting a bit of a milestone with this issue of the newsletter - it’s not a nice round number like 128, but it’s still pretty notable. So far this newsletter community has helped at least one reader find a new job, helped another couple try new things in managing their teams, and has inspired at least one feature in a software project. We’re just getting started, and there’s a lot more to be done. If you have ideas, or questions, or want to help, just drop me a note at [email protected].

For now, on to the roundup!

Managing Teams

Voice or Veto (Employees’ Role in Hiring Managers) - Ed Batista

A common and avoidable source of frustration when making any high-impact decision - hiring a new team member or manager, but also any major technical or strategic direction - comes from not being clear ahead of time about how the decision is being made and by who. Do the team members get a voice, or a veto? What are the decision criteria?

There’s a lot of perfectly good answers to those questions, many of which the team members (or stakeholders, or..) would be ok with, but not making things explicit right at the beginning can make people feel like they’ve been fooled or not listened to.

Batista councils to be explicit about how important hiring decisions will be made before soliciting input, being clear about how the decision will be made (and by whom), and communicating clearly throughout the process.


Owning your power as a manager - Rachel Hands

Relatedly to being clear about decision making power: one of the common mistakes I see in new research computing managers is an unwillingness to actually accept the fact that they now have a position of power. This is especially true when the new manager has been promoted to a manager of previous peers.

For a lot of people, suddenly having power is uncomfortable, and that’s ok (it’s way better than the other failure mode, of really relishing the newfound power), but you can’t just ignore it. “Ah but I’m still just the same person, you know?” Yes, you are, but now you can fire someone. And even if you choose not to see that power difference, those someones are exquisitely aware of it.

Hands outlines the role power that comes with being a manager, helpfully and correctly distinguishes it from the relationship power that comes with trust, and points out some specific real problems that come if you don’t acknowledge your power (my favourite: your power manifests in ways you didn’t intend) and what happens when you do.


Research Software Development

CFFInit - Generate your citation metadata files with ease - Netherlands eScience Center

If you’ve been meaning to generate a CITATION.cff for your repos, here’s a little browser-based tool that will get you started - enter the name, authors, a message, and any identifiers, and it’ll provide a downloadable file.


The programmer’s brain in the lands of exploration and production - Vikki Boykis

Boykis talks about some problems she had learning a new programming language for production data work in the context of some things she’s read in The Programmer’s Brain, a book that covers cognitive science specifically in the context of programming. (There’s been a bunch of good reviews of The Programmer’s Brain, but I haven’t had a chance to read it yet)

There’s a lot of different ways that one can be confused by something, including but not limited to:

  • Lack of information - Short-term issue
  • Lack of knowledge - Long-term issue
  • Lack of processing power - Issue in working memory

and how to write code (or documentation) to make it less confusing means being clear on which of those and other sources of confusion are.

What’s more, some of the middle issue - lack of knowledge - can often be helped by making it easy to explore around and make changes and see results, but that’s often quite hard in production code while quite easy in more exploration friendly environments. The things one is concerned with in production tooling - robustness, logging, correctness checking - are very different when exploring (will this even work?) - which is extremely relevant to research software development.


Research Data Management and Analysis

NVIDIA Announces Availability for cuNumeric Public Alpha - Jay Gould, NVIDIA

Worth flagging here that NVIDIA has released it’s first public version of a free drop-in CUDA/NVIDIA GPU enabled replacement for numpy, called, confusingly for those of us who remember the pre-numpy days, cuNumeric.


Research Computing Systems

I normally avoid “speeds and feeds” and new product announcements here, but this was a pretty big week for new stuff coming out and I think reflects some upcoming directions that those of us in RCD should be aware of.

NVIDIA Declares that it is a Full-Stack Platform - Jeffrey Burt, Next Platform
NVIDIA Debuts Quantum-2 Networking Platform with NDR InfiniBand and BlueField-3 DPU -John Russell, HPC Wire

NVIDIA GTC was this week, and there were a lot of announcements - like cuNumeric above - but I think these two capture the most interesting points. The first, by Burt, points out that NVIDIA leadership sees itself as building complete systems from hardware, networking, systems software, and SDKs for accelerated computing; and while AI and graphics are clearly specialties, HPC, genomics, data science, digital twins, and other research computing and data mainstays are explicitly called out, as well as emerging areas like quantum computing simulation.

An example of what this will mean in the short term in the data centre is in the second article by Russell, where new extremely high-bandwidth InfiniBand fabrics are being paired with accelerated computing in NIC, NVIDIA’s DPUs. That’s going to allow cloud-like network flexibility - like network isolation and encryption - at infiniband latencies within HPC clusters, which will hopefully support wider ranges of use cases and more flexible provisioning.


AMD: 96-Core EPYC ‘Genoa’ Due In 2022, 128-Core EPYC In 2023 - Dylan Martin, CRN
AMD Launches Milan-X CPU with 3D V-Cache and Multichip Instinct MI200 GPU - Tiffany Trader, HPC Wire
Vertical L3 Cache Raises the AMD Server Performance Bar - Timothy Prickett Morgan, The Next Platform
Azure HBv3 virtual machines for HPC, now up to 80 percent faster with AMD Milan-X CPUs - Evan Burness, Azure Blog

Slightly stomping on NVIDIAs news was a set of announcements the day before by AMD, long an “x86, but cheaper” considering but now well and truly taking on Intel and NVIDIA at the high end. For CPUs, the new Milan-X chips introduce a large, fast L3 cache on top of the chipset cores, as Morgan points out, which has substantial performance implications for a lot of research computing codes which are bandwidth-limited but with pretty regular access patterns.

This is available right now, as Burness points out, in Azure - commercial cloud platforms are increasingly the most reliable way to start testing out new systems.

Performance of Azure’s Milan-X HPC instances over the previous generation Milan SKUs, with 20-75% performance improvements due to, in significant part, the new cache

AMDs new Instinct MI200 GPUs also look like beasts, and seem to have made different tradeoffs than NVIDIA, going (like Ponte Veccio?) explicitly after high double-precision FP64 performance. These aren’t yet available to play with, so we’ll have to see how this holds up on real benchmarks.

The range of interesting research computing hardware, with increasing differentiation between them, is going to make for a very interesting time, and will finally blow up the “all the world’s an x86” assumptions and monoculture we’ve had around tooling and assumptions. That’s going to make things harder for systems and software teams, but it’s going to make greater ranges of applications more feasible.


Hooking up a faucet to Niagara Falls: Seagate demos NVMe-accessed disk drives at Open Compute Summit - Chris Mellor, Blocks & Files

This is fun - hard drives are continuing to get faster, even though they’re much slower than SSDs - and so are starting to benefit from faster interfaces, especially when it’s not just one drive but JBODs of them.

Here Mellor reports on Seagate’s demo at Open Compute Summit of an NVMe-connected JBOD of disks, and blog of the same, including of NVMe support directly in an HDD controller.


How We Saved Millions in SSD Costs by Upgrading Our Filesystem - James Katz, Heap

Katz provides us a reminder that using SSDs for filesystems means changing some tradeoffs that previous filesystem decisions may have implicitly made.

They had used ZFS, a Copy-on-Write filesystem, for their database cluster. That had a number of advantages for them (higher durability, consistent snapshots, fs-level compression); but that copy-on-write, which always starts causing problems when the disk starts getting full (it’s harder to find empty blocks for each write), and that interacts with SSDs natural tendency for write amplification.

Moving to the ZFS 2.x which uses Zstandard and so has the option of higher but slower compression over lz4 was a substantial win for them. It resulted in fewer blocks written per write, so better performance overall and when things started to get full -which was less often, because of the better compression. Other workloads, of course, will experience the higher/slower compression tradeoff differently.


Emerging Technologies and Practices

Analyze terabyte-scale geospatial datasets with Dask and Jupyter on AWS - Ethan Fahy and Zac Flamig, AWS Public Sector Blog

The Pangeo community has been doing a lot of great work for large geospatial data, from software (where they’ve pushed on Dask quite a bit) to array-structured data formats (Xarray, iris).

Fahy and Flamig walk us through setting up a JupyterLab environment using Dask workers to access a very large climate simulation intercomparison data set, CIMP6. Here of course they use AWS - and make use of spot instances for the Dask workers - but the basic setup of Dask and JupyterHub using Kubernetes (with Helm charts) would be a pretty common Pangeo setup.


Calls for Submissions

CCGrid 2022 - 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing - 16-19 May, Taormina, Italy, papers due 24 Nov

Covering future internet computing systems, programming models and runtimes, distributed middleware and network architectures, storage and I/O systems, security, privacy, trust, and resiliance, performance modelling, scheduling, and analysis, sustainable and green computing, scientific and industrial applications, and AI/ML/DL.


The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) - Minneapolis 27 June-1 July 1, Papers due 27 Jan

Some relevant topics of interest include:

  • Datacenter, HPC, cloud, serverless, and edge/IoT computing platforms
  • Heterogeneous computing accelerators and non-volatile memory systems
  • File and storage systems, I/O, and data management
  • Operating systems and networks
  • System software and middleware for parallel and distributed systems
  • Programming languages and runtime systems
  • Big data stacks and big data ecosystems
  • Scientific applications, algorithms, and workflows
  • Resource management and scheduling
  • Performance modeling, benchmarking, and engineering
  • Fault tolerance, reliability, and availability
  • Operational guarantees, risk assessment, and management

ESSA 2022 : 3rd Workshop on Extreme-Scale Storage and Analysis - 3 June, Lyon, Papers due 1 Feb

Covers topics ranging from storage systems to language and library support for data intensive computing at scale.


Events: Conferences, Training

Open Confidential Computing Conference 2022 - Conference 17 Feb, Virtual, Free; Onsite Hackathon 18-19 Feb

As research computin and more frequently involves sensitive data, there’s growing interest in confidential computing, keeping data confidential even from the systems teams. OC3 is a good venue to learn about what’s happening in this space.


Random

10 print “Hello, world!”; 20 goto 10 - now available in the browser. qbasic in Javascript

Seemingly weird redirection behaviour explained, stemming from the fact that >& doesn’t just redirect stdout and stderr, but duplicates them to the same handle.

Learn low-level programming and disassembly through the use case of hacking games.

Container Layer Analyzer, a simple self-hosted web application to explore container image sizes broken down by layer and directory.

10 image formats that didn’t make it.

New HTTP verb (hopefully) - QUERY. I can’t even tell you how much it annoys me to have to use POST to make a complex query just so I can put the detailed query in the body of the request.

Useful sed tricks.

Great discussion of the issues with ffast-math, and why the “why don’t you just” arguments for dealing with it won’t usually work.

C vs C++ vs Rust vs Cython for Python extensions.

Generative art - Samila.

HPC job scheduler issues come to cloud systems - AWS Batch can now use fair share scheduling. I have opinions!

Source code for a Commodore 64 MMPORG, Habitat.

A way deep dive into data initialization and finalization in C++ and ELF.

Free Operating Systems book, breaking OS design down into three overarching concepts - virtualization, concurrency, and persistence.

xtdb is an interesting looking open source database that keeps and makes searchable all value history.

Lesser known postgres features.

This week I learned about the Kirkpatrick Model of training assessment - that assessments can be considered in layers from the most superficial (reactions) to increasingly meaningful and harder measures (assessing learning; assessing behaviour changes; and assessing overall results or impact).


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.