#91 - 10 Sept 2021

Technological readiness for research computing and data; attrition is ok, and so is process; demo-driven development; software devs need more ops knowledge; the state of CXL and AWS Parallel Cluster

Research computing and data is increasingly a single discipline, a single area of practice. It’s a complex one, too; one which includes the support of first tentative steps which themselves may be research activities, all the way through to production operations of routine infrastructure.

You can read more, or go straight on to the roundup.

There was a good discussion on twitter (it happens) about software as a facility, as a tool for enabling research rather than merely as a research output itself.

I think there’s truth to that, but that’s not the end state for all research computing and data efforts - nor should it be - and, crucially, it’s not all just about software.

I like to think about research computing and data support going from the testing out of early ideas as a proof of concept, supporting the development of prototypes, then the development of a product and, and outreach and community building around that, until finally it matures into the routine operations and maintenance of a piece of well-understood research infrastructure.

Technological readiness levels for Digital Research Infrastructure (software, data resources, systems, and increasingly combinations thereof) and the questions being asked at each level.  1 - Proof of concept - “will this even work?"; 2 - Prototype - “will this solve real problems?”; 3 - Product - “will others find this useful?”; 4 - Proven - “will a community be able to depend on this?”  A tool moves from research output to research input as it becomes more ready for widespread use.

At each step, the fundamental question being asked is different:

  1. Proof of concept: “Will this even work?”
  2. Prototype: “Will this solve real problems?
  3. Product: “Will others find this useful?”
  4. Proven: “Will a community be able to depend on this?”

The questions being asked at each level are important to keep in mind, because they set priorities. Issues like “technical debt” are irrelevant distractions at stages 1 or 2, when the questions being asked are about whether or not this is even a useful approach. At stage 3, the biggest question is around building a community, identifying their needs, improving reliability, and navigating a path towards sustainability. Only at stage 4 do things like long-term maintainability and professional operations play a big role.

Not every effort will, or should, make it to the “proven infrastructure” readiness level. Most really innovative proofs of concept won’t pan out - they may just not work, or might be ahead of their time and require other components not ready yet. Learning and disseminating that information is useful. Many successful prototypes or products may be useful for generating new ideas, or useful for individual groups and their research, but be too specific to gain very widespread use; that’s still a success. Our job isn’t to get every tool to the infrastructure level, but to enable as much experimentation in the early stages as possible, and make sure that no unnecessary barriers stand in the way of tools’ evolution to wherever they’re going.

The other thing to notice about all of this is that technological readiness level applies not just to software, but to novel systems, new curated data resources - and increasingly, combinations of all the above. Research computing and data isn’t three silos - research software, research systems, and research data - and the highest impact tools we provide these days are combinations of two or all three. Science gateways, interactive exploration and analysis of provided datasets - these don’t fall into one of the three silos. The highest impact COVID computing efforts combined all of the above - rapid computation and analysis using novel tools of collected COVID data.

And it’s always been that way. Even though data, software and systems are converging ever closer in this software-defined and data-enabled era, it’s not new. Ever since the beginning, those providing research computing systems have also had to provide curated application software stacks, and the means for analyzing generated or input data. Research software developers have always had to consider the systems the code is to run on. Those curating data resources do so in order that the data can be explored and computed on.

For these integrated systems, keeping in mind the readiness level of the product is even more important. The complexity of the effort makes it even more important to keep in mind the key questions at each stage, and to get feedback and input. The wider the effort the easier it is to lose track of the big picture, and focus purely on internal technical criteria. It’s important to remember the question you’re trying to answer.

And now, the roundup…

Managing Teams

Let’s Discuss Attrition - Subbu Allamaraju

We’ve talked about people leaving before, and that post pandemic such departures will rise. Allamaraju reminds us that this is good and healthy and not to freak out:

  • Attrition is natural
  • Reasons for attrition vary
  • Attrition is an output - you can only control the inputs

In my team we do a relatively good job of celebrating new opportunities for team members where they go. This is one of the things academia I think does right - recognize that people moving on is just part of their career development. We pretty actively stay in touch - so much we may get our first boomerang team member this year if we play our cards right!

But Allamaraju makes another good point, and one which I don’t see called out very often:

Every exit gives you, the manager, an opportunity to improve the team structure and dynamics. Attrition loosens the team fabric and gives you a chance to reshape it. Don’t let go of that opportunity.

The departure of even a valued high-performing team member is an opportunity to grow the team in new directions.


The culture of process - Cate Huston

The defining transition between hobbyist and professional, between someone in research who codes a little or does a bit of sysadminning and running a professional team providing research computing and data services, is that you no longer just focus on mean quality but also variance. You’re no longer trying to just get good results, but consistently good results. That means, painful though it might be, introducing some process.

Huston has a few ways to think about process as, frankly, a fabric connecting the team members:

  • Process as an agreement about how we work.
  • Process as a way to answer questions.
  • Process as (documented) culture.

and this suggests a starting point for process - documenting how the team does things, ideally after some discussion, so that as new people join they can get up to speed quickly.

But as a team grows or its needs change, process can’t stay rigid - agreements shift, different questions are asked, culture updates. New processes may need to be created, and old ones retired, ideally after discussion. Huston describes changes to processes in her team - growing the hiring process of a team, trying “process experiments” on a team (and tossing out the ones that didn’t work), and building support for change. She also points out that any process change is going to get a mix of obliging and enthusiastic support, and questioning and out-of-hand disagreement. I don’t know if it’s useful to think of that in terms of personality types, but certainly accepting that any process change will lead to a variety of reactions is important.

Towards the end, Huston emphasizes that with process - as with technical decision logs! - it’s important to explain the why. That helps build agreement, and helps make it clear when it’s time for a process to be updated or retired.


9 metrics to help you understand (and prioritize) DEI - Kat Boogaard, Culture Amp

Every job posting in academic or academic-adjacent institutions has some boilerplate about the importance of equity, diversity, and inclusion. But research computing and data is worse than both academic research and tech when it comes to any kind of representation.

We’re scientists, and we fix things by collecting data and then performing experiments to see what makes the measurements improve. Boogaard suggests nine metrics to find out where you are and where you’re going, in three broad categories - the employee “lifecycle”, the employee experience while they’re with you, and company makeup.

Employee “lifecycle”

  1. Hiring - diversity of applicant pool and hiring panel
  2. Representation - finding out where you stand now
  3. Retention - are you attracting diverse talent just to preferentially lose non-white men because of lack of support?
  4. Advancement - again, are you attracting diverse talent but only promoting some of the usual suspects?

Employee experience

  1. Job satisfaction and engagement - are there common threads between employees who aren’t happy and engaged?
  2. Employee resource group participation - if you have such resources, are they being underused?
  3. Accessibility - do all employees have the tools and resources they need?

Company makeup

  1. Leadership - does your leadership look like your organization?
  2. Suppliers - are EDI issues hiding in vendor relationships?

How Many People Can Someone Lead? - Pat Kua

Whenever I give my “Help! I’m a Research Computing Manager!” talk, this is a question I get. Kua here says 5–7 (I often say 7 ± 2), but says it depends and gives a very helpful list of things it depends on. It includes the obvious things (a more experienced team - individually and working collectively as a team - requires a little less of a manager/lead, so you could lead more of them; similarly a more experienced lead can lead a few more people).

But there are a couple of nice other points here. The basic idea is that the manager/lead has a finite time budget, and the more of it goes into other things, the fewer people they have the time to effectively manage. Some of the things he points to are:

  • If the level of institutional bureaucracy is high - and for many of us, it is - then more time is spent on paperwork and administrative tasks, and so there is less time available to effectively manage or lead people
  • Leadership scope/leadership roles - if you are not just responsible for the people but also coordinating project delivery and also technical leadership, (a) that’s an exhausting role and (b) you’re not going to be able to lead as many people effectively. (Personal note: it looks like we’ll be getting a technical project manager shortly for our team and I can’t even tell you how excited I am about it).

Managing Your Own Career

Ok. So, You Can’t Decide - Michael Lopp

Sometimes you just can’t decide. Lopp’s experience is that this is from one of two causes:

  • There’s something you know in the back of your mind that you need to find out first, or:
  • You just need to suck it up, take the plunge, and decide

I think those of us who came up from academia are more often stuck in the second case than the first. There’s always one more thing you can read! But at some point you just need to decide.


Product Management and Working with Research Communities

Making World-class Docs Takes Effort - Daniel Stenberg

Documentation is incredibly important for a product’s adoption and use - whether the tool is software, data products, systems, or (increasingly) a combination of the three. It takes a lot of work, but that work pays off later with more adoption and less support effort per user.

Stenberg highlights what he’s found to be important for documentation: that it be:

  • Stored with the code, for convenience and so updates are kept in sync, but
  • Not generated from the code
  • Filled with as many examples as possible
  • Documenting every interface
  • Easily accessible and browsed, and
  • Easy to contribute to.

Stenberg then describes work he’s doing to improve the documentation of curl/libcurl, including constantly iterating, adding examples, and generating some simple automated checks for consistency and making sure cross-references stay current.


Research Software Development

Demo-driven development - Jade Rubick

From earlier in the year - Rubick describes his method for introducing stories and planning into software development, by starting with routine demos, backing from that into introduction of user stories by structuring the plans for the next demo, and then from there moving out into routine planning. What’s nice about this is that it keeps the important thing - always be delivering something useful - in the forefront.


Paranoid Scientist - Maxwell Shinn

Research software is subtle, and correctness issues can be fiendishly tricky to track down. I’ve spoken fondly of languages like Ada (e.g. #64), with contracts at function level, and formal methods (e.g. #46) for proofs that a routine correctly implements a function. Having such functionality available - preferably in a way that can be turned on and off and implemented incrementally, focussing on the most subtle and error prone parts of the code - could be extremely useful in research software development.

Shinn’s package, Paranoid Scientist (GitHub, preprint, tutorial), is a package for Python code which lets the developer create pre- and post-condition contracts for functions, enable or disable checking them at runtime (at about a 2x runtime penalty when enabled), and run a property-based test suite (e.g. #56) which automatically generates unit test cases against the post-conditions.


Why Every Software Engineering Interview Should Include Ops Questions - Charity Majors
Are Dockerfiles good enough? - Mathew Duggan

As I mentioned at the start, the line between “software” and “systems” is as fuzzy as it’s ever been. In any area of research software, a developer needs to think about how their software will be deployed, operated, updated, and maintained, even if it’s not hosted software. And if it is going to be hosted software, then operability is at least as important as code quality in terms of its impact on the team.

It’s not unusual to ask scripting/software questions to sysadmin/operations candidates, and Majors pushes us to including operations questions when interviewing software developer candidates. Some are particularly relevant even for non-hosted software:

  • “What are some of your favorite tools for visibility, instrumentation, and debugging?
  • “What’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?”

In the second, Duggan points out that Dockerfiles - not containers, not even Docker, but Dockerfiles - are written by and for developers, and so they have a tendency to make assumptions that make sense to developers but cause operations teams unnecessary pain. Some common issues Duggan call out are:

  • Reproducibility issues:
    • FROM: [whatever]:latest
    • apt-get upgrade
    • Embedded brittle shell scripts
  • Build performance issues:
    • Killing caching by copying/adding in local code before installing dependencies
  • Security issues
    • Secrets in ENV
    • Installing random software from who-knows where
  • Over-optimization:
    • Obsessing about container sizes over what the ops team knows
    • Being overly cute with terse code over readbility

Duggan recommends combinations of dockerfiles and reproducible tools like Ansible (over bash), using packer, or at the very least using a Dockerfile linter like hadolint.


Research Data Management and Analysis

When Less is More: Database Connection Scaling - Richard Yen
10 Cool SQL Optimisations That do not Depend on the Cost Model - Lukas Eder

A few performance issues to keep in mind with relational databases.

In the first, Yen reminds us that allowing large numbers of connections - even if they’re idle, or even if they’re not used! - can impact PostgreSQL (and many others) performance. The connection table entries take memory, they have to be monitored periodically which take linear time.

“select 1” performance (transactions per second - higher is better) vs the number of idle connections. Performance drops steadily as the number of connections increases, until at least 15,000 or so

In the second, Eder gives a master class in SQL and areas where optimizations can always be made:

  • Transitive Closure
  • Impossible Predicates and Unneeded Table Accesses
  • JOIN Elimination
  • Removing “Silly” Predicates
  • Projections in EXISTS Subqueries
  • Predicate Merging
  • Provably Empty Sets
  • CHECK Constraints
  • Unneeded Self JOIN
  • Predicate Pushdown

SQL engines are very good at automatically identifying many but not all of these statically - particularly those involving joins, check constraints, or provably empty sets; so these are useful for knowing where to automatically rewrite, as well as just being a nice deep dive into practical relational algebra.


Facebook’s Golang Object Relational Mapper Moves to the Linux Foundation - Mike Melanson, the New Stack

New ORMs have been getting steadily more sophisticated. Here is a quick brief on Ent, Facebook’s Golang ORM that has just been contributed to the Linux Foundation. It relies on code generation (though will rely much less on that shortly, with Golang generics becoming standard), but is more notable for having quite sophisticated semantic relationships between entity types, and for having graph concepts built in so that one can implement simple (presumably not too deep) graph analyses on top of relational databases.


Research Computing Systems

The CXL Roadmap Opens Up the Memory Hierarchy - Timothy Prickett Morgan, The Next Platform

Morgan gives a state-of-play of CXL v2.0, a switched-fabric cache-coherent interconnect protocol over PCIe-5+ protocols, that has a set of functionality for I/O, cache, and memory - with cache and memory optimized for latencies comparable to larger caches, and I/O targeted more at bandwidth (and is closest to native PCIe). It will be competing with, amongst other things, IBM’s OpenCAPI.

A slide from Debendra Das Sharma’s talk at Hot Interconnects 28, which Morgan describes in his article, giving an overview of CXL’s approach

Where I’m most excited about this is allowing a common interface to a deeper memory hierarchy, including off-node memory and devices with different bandwidth/latency tradeoffs. The desegregated composable infrastructure stuff is cool too, but that’s mainly an efficiency play. I’ve been dusting off out-of-core memory algorithm papers since NVMe started playing a big role, and I’m more interested in the range of new capabilities that this could allow.


Emerging Technologies and Practices

Introducing AWS ParallelCluster 3 - Brendan Bouffler & Rye Robinson
CFD Direct From The Cloud v9 Released - CFD Direct

A couple of interesting HPC-in-the-cloud items:

AWS Parallel Cluster 3 is a complete rearchitecting of AWS’s HPC-cluster-as-a-service tool. I haven’t tried it yet, but Parallel Cluster 2 had a surprising number of sharp edges to it, so I’m anxious to try this one. The big outwardly-visible changes are a new API and updates to the CLI, for much more flexible control; a tool for custom VM images; much improved network configurations; better permissioning; and customization scripts at runtime for compute nodes.

I’ve seen a number of “HPC cluster in the cloud” projects which are terraform scripts or whatever which faithfully implement an on-prem static cluster on more expensive, software-defined infrastructure - and, not to put too fine a point on it, it’s utter lunacy. On-prem clusters are designed that way for a reason, the design has trade-offs, and those trade-offs only make sense when you have a lot of static hardware you have to use and very little automation available to you. We all know - or quickly learn - that having a single massive globally-shared parallel filesystem that everyone uses for both active and near-line use is just a terrible idea, and one that’s done grudgingly only because it’s all that’s feasible on-prem. The idea of recreating that in the cloud is absurd. Tools like Parallel Cluster allow you to spin up what’s needed per job, and you can have a Parallel Cluster per user or per group, or per-larger-organization but with filesystems per job class, and with data staging in and out.

CFD Direct’s OpenFOAM environment is really nice to see as a step beyond that, building an application on top of some similar infrastructure - with a specialized interactive application that runs directly on the cloud infrastructure over a high-performance Remote Desktop. Whatever the future is of routine (non-hero-calculation) HPC, on the cloud or on-prem, it’s going to look a lot more like this than slurm scripts.


Calls for Submissions

Applications Open for the 2022 Better Scientific Software (BSSw) Fellowship Program - Webinar 13 Sept, Application Due 30 Sept

Each 2022 BSSw Fellow will receive up to $25,000 for an activity that promotes better scientific software. Activities can include organizing a workshop, preparing a tutorial, or creating content to engage the scientific software community


SC22 Planning Committee Applications - Applications due 30 Sept

If you’re interested in being part of the planning committee for Supercomputing ’22, apply here by the end of the month.


19th Annual Workshop on Charm++ and Its Applications - Abstracts due 30 Sept, Virtual Workshop 18-19 Oct

From the call:

Authors are invited to submit abstracts or short papers describing research and development in the broad themes of parallel processing emphasized by Charm++, AMPI, and Charm4Py including runtime adaptivity, message-driven execution, automated resource management, and application development.


International Symposium on Computer Architecture (ISCA) 2022 - June 11-15, 2022, New York, NY, Abstracts Due 16 Nov, Papers due 23 Nov

Paper topics of interest to research computing teams include Processor, memory, and storage systems architecture, Datacenter-scale computing, Evaluation and measurement of real computing systems.


9th European Conference on Service-Oriented and Cloud Computing (ESOCC 2022) - 22-243 March, Lutherstadt Wittenberg, Germany, Papers due 31 Oct (research papers) & 14 Jan (Projects, PhD Symposium, Industrial)

The European Conference on Service-Oriented and Cloud Computing (ESOCC) is the premier conference on advances in the state of the art and practice of service-oriented computing and cloud computing in Europe. The main objectives of this conference are to facilitate the exchange between researchers and practitioners in the areas of service-oriented computing and cloud computing, as well as to explore the new trends in those areas and foster future collaborations in Europe and beyond.


Events: Conferences, Training

SciDataCon - 18-28 October, Virtual, registration (per-session) free

Organized by CODATA and WDS, it includes talks on the following topics:

  • Implementing and Assessing FAIR
  • Data Skills
  • Data Stewardship
  • Data Centres and Repositories
  • Research Infrastructures
  • Global Open Science Clouds, Platforms and Commons
  • Data for the Sustainable Development Goals and Disaster Risk Reduction
  • Data in Cross-Domain Research
  • Data in the Earth Sciences

Random

Running X11 GUI apps in docker containers.

Update in what’s coming in C’23 including digit separators (const int a = 1’234’567), else-if preprocessor macros, and better integer types including a fundamental type for N-bit integers.

This is interesting - a company called BuildJet is offering drop-in replacement of cheaper GitHub actions on bigger instances. I have no idea who they are or whether their offerings are any good, but all things being equal, competition there is good for us.

Writing a PC game that fits entirely in the boot sector.

Speeding builds of rust projects.

Amazon’s EKS-anywhere is an entire open source Kubernetes and management tools distribution, allowing an AWS-consistent management experience for on-prem kubernetes deployments.