Even though the newsletter won’t start in earnest until mid-February or so (January is packed with deadlines), I’ve decided to start sending the link roundups. There’s a lot of interesting stuff going on in the world of R&D computing and teams, and neither letting it all go by un-heralded nor sending you one huge link dump full of then-stale information in February seemed right.
So I’ll start the Friday link roundups now; this will give you a chance to send me feedback about what you find interesting and what you don’t, and me a chance to get used to sending the newsletter, before things start in earnest next month. Also, if you don’t mind, I may take the opportunity to experiment on you over the next couple of weeks with a couple of different newsletter emailing services (I’m currently choosing between the stripped-down Buttondown and the less configurable but fancier Revue with nicer authoring options. This one is sent with Buttondown.)
Let’s get started!
Dask is, IMHO, one of the most promising distributed and parallel dataflow-style technical computing frameworks being worked on currently. I find it compelling compared to say, Spark, because it scales down well (letting you do meaningful work on your laptop); that plus its integration with the python scientific computing ecosystem means that it’s relatively straightforward to go from a single core or node prototype to distributed memory computing at some scale. The scale to which you can meaningfully use Dask might grow with its adoption of UCX for communications on high-speed networks.
In the blogpost linked above, Matthew explains the history of the funding of the Dask project and his reasoning for taking Dask independent, rather than having its development continue to be funded by individual corporations. There’s lots of examples of this being quite successful, and some of it being less successful; the details will matter quite a bit. He’s promised followup blog posts going into the details.
Check Out Podman, Red Hat’s daemon-less Docker Alternative, The New Stack
I don’t love CentOS for R&D computing clusters, but a lot of us are stuck with it. RedHat’s purchase of CoreOS two years ago brought the hope of more container tooling in RHEL; and now CentOS 8 has Podman, which is starting to gain attention even in pretty mainstream tech publications like the New Stack. Podman’s daemon-less model may well mean it gets easier for researchers and R&D developers to run containerized tools on clusters run by container-skeptical centres.
This blog post by Andraeas Müller connects two facts that I think most of us in R&D computing are pretty familiar with - one that we talk a lot about and one that we don’t - and extrapolates to a conclusion that I’m not sure I agree with but is certainly worth discussing.
The fact that we talk about regularly is that ongoing maintenance of important research software (and key open-source software in general) is famously underfunded, and this blog post points to some really useful references on this topic that I hadn’t seen before. The fact we don’t is that when grants are available they are often for new software development efforts, and a huge fraction of those result in packages that are almost instantly abandoned when the grant ends.
The argument (which I’m not sure even the author agrees with 100%) is that funding shouldn’t be extended to packages that don’t exist, as it suggests a need so small that no one has “scratched their own itch” to even put together a prototype; and that once such a prototype exists, if it’s any good at all it’ll start attracting other users. This would suggest basically holding off on stand-alone research software development funding until it was roughly half-way through a “R&D software maturity ladder” I wrote about once (shameless plug).
I think there probably is a place for seed funding to get a project started - not every type of R&D software effort is the sort of thing that someone be meaningfully prototyped by a single person meeting their own needs - but we know that R&D software funding needs to change, and we should take this argument seriously as we change it.
(h/t CCC Blog) A wide-ranging new NSF program, covering everything from compilers and systems to algorithms and privacy considerations. “Importantly, as described below, PPoSS specifically seeks to fund projects that span the entire hardware/software stack.” First deadline, March 30, is for $250k/1yr planning grants.
A good recent blog post on the pros and cons of different approaches to software citation by Daniel S. Katz, who’s thought about this a lot. Some key points: any method is going to take extra work by someone; there may not be a one-size-fits all approach; and in the end, code just isn’t the same as a paper (amongst other things, there’s no one point at which it’s done). Daniel ends the post leaning tentatively towards recommending Software Heritage IDs.
A lot of organizations are setting up data science teams - these teams are analogous to other sorts of R&D computing teams, but these efforts are (a) everywhere and (b) involve both a lot more money on the line and a lot less institutional inertia than is the case when we’re setting up or managing an HPC centre.
While the answers some organizations come up with may not be necessarily widely applicable elsewhere, a lot of the questions being asked really are. The second link asks some big picture questions (but doesn’t really get into their consequences, which is a sham) - centralized vs distributed? Where organizationally? Reporting to whom? What’s the role?
The first link, a data project (and team) checklist, gets finer grained. How is work for [R&D computing team members] selected and allocated? What skills exist in the org currently?
What are checklists you use before starting an R&D computing project — what sorts of gotchas should people be looking out for?
A bit of a “files are bad, SQLite is surprisingly good” theme this week:
Files are Fraught with Peril, Dan Luu
In R&D computing we often rely on files in cases where simple databases would be easier and more robust. This is a nice reminder that despite how well file systems generally work, they are full of prickly edge-cases and even today, “POSIX” filesystems are prone to undefined behaviour in the case of crashes (loss of power, or when even when jobs are suddenly killed on a cluster).
Just what it says on the box - trying to coordinate file access between processes is tricky; that’s what databases are for, and even SQLite works really nicely here.
How I teach database design, Daniel Lemire
As microservices are becoming more important, the consequences of database design decisions are becoming more explicitly limited in scope. Instead, APIs are the interface boundaries and their definitions (and changes) can have larger impact. In R&D computing we too often focus on internal data models and schemas when we could more usefully be thinking in term of APIs; or worse, thinking in terms of file formats when we could be thinking in terms of schemas or APIs (I’m looking at you, bioinformatics).