#175 - 4 Feb 2024

Digital Humanities’ LLM Edge? Plus: EOLing a Product; Product vs Project; Directing Contributors Where Needed; Commercial Products and Reproducibility; Federated Analysis Primer; And the Job Board Returns, Maybe?

Digital Humanities’ LLM Edge? Plus: EOLing a Product; Product vs Project; Directing Contributors Where Needed; Commercial Products and Reproducibility; Federated Analysis Primer; And the Job Board Returns, Maybe?

Happy 2024!

RCT took a bit of an unplanned long holiday break, due to life intervening; all’s well now, and I’m looking forward to the year ahead. The next few issues might be a bit short as I get back into the swing of things; and there’ll be some modest changes (I’ll be slowly bringing back the jobs board, which I’m excited about!)

As always, please email me (hit reply, or email mailto:[email protected]) if you have any suggestions, thoughts, or comments. I have a bit of a backlog of emails to reply to, but I read everything and will reply! Or you can always set up a 15 minute chat with me to give me your comments.


Here’s something I’d like to hear from the community about, by email or chat.

This is anecdotal so far, but I’m starting to see a pattern of anecdotes that suggests to me that teams with significant digital humanities skills are getting researchers up to speed faster in using existing LLM tooling for research. Are you seeing this, too?

So when I think about LLMs specifically in research, I’m thinking of three broad categories:

  1. Research in AI or LLM methods
    • Can be super inside-baseball stuff like quantizations, architectures, alternatives to attention…
    • Often but not necessarily driven by novel application areas or data types
  2. Research about AI or LLMs
    • Ethics of data collection, philosophy, how LLMs will affect jobs
  3. Research with AI or LLMs
    • Using existing methods and services to accelerate current research projects or make new projects feasible
    • Analyzing large amounts of text, videos, images
    • Writing, summarizing
    • Speeding code development…

I’ve seen several times now, research computing or research software or research data groups with digital humanities expertise seem to be doing a better, faster job of onboarding researchers to productive uses of LLMs for their research (things like feeding papers or job postings into LLMs for efficient and rapid analysis).

In particular, I’m thinking about work like this paper, “Measuring remote work using a Large Language Model (LLM)” which was written in up HBR both for its results (significant inequality in who gets to work remotely - its disproportionately higher paid jobs with higher educational requirements) and its methodology. Basically the work used LLMs and humans to categorize the jobs described in 10,000 job ads, and cross-checked them.

While there’s a lot of emphasis and interest in our community right now on category 1, supporting research on AI or LLM (like developing new deep learning methods for physics problems), I haven’t seen as much emphasis on the more prosaic and quotidian use of already existing (and often free or cheap) LLM services to get research done, low-hanging fruit under category 3.

But our teams’ mission is to increase scientific impact as much as possible given the resources available, and I get the sense that we should be thinking about this more.

Staff with digital humanities experience seem to be leaping into action for this kind of work quickly than more traditional scientific computing teams, and in many cases they seem to be really kickstarting some of these efforts. Part of this may just be being able to speak the language of social sciences and humanities very fluently, where a lot of this work is happening (since analyzing textual data is more common there). Part might have something to do with existing experience with and intuition for NLP efforts and web services.

And of course, part might just be an artifact of small number statistics of the sorts of conversations I’m having these days.

I’d be very interested in hearing from you - are your teams supporting category-three LLM-powered efforts right now, and are digital humanities staff involved? Let me know!

With that, on to the roundup.

Managing Teams

Over at Manager, PhD in the previous issue I had covered articles on

  • The importance of requests being clear, even small ones
  • Making hard decisions easier (and more clearly communicated) by explicitly choosing A even over other good things

Product Management and Working with Research Communities

Discontinuing a Research Software Project - Mark C. Miller, Better Scientific Software

The content of this article is perfectly useful and sensible, well worth reading, but the framing and word choice might as well have been grown in a lab to provoke me into a rant.

So yes, content first. Miller tells us we have to get started with the ending, so that everything can be closed out as usefully as possible and all the good that a come out of the work does come out. There’s outlined a good set of steps for closing out a research software product:

  • Make an end-of-project release
  • Open-source the code, if not done already
  • Document final status
  • Gather lessons learned
  • Establish an enduring on-line presence with all this material
  • Present or publish - all that final documentation will help for this
  • Refactor critical dependencies if some pieces of the software product would be independently useful
  • Archive the software

These are all great. In particular, “lessons learned” are really useful to the community, and the idea of pulling out independent pieces for possible separate use (and other options for sustained development) is a good one I don’t see mentioned often.

Ok, now for the other part.

Look, as a community we need to get our stuff together about the very basic concepts that underly the work we do and how we do it.

Someone’s going to email me saying I’m harping on something that’s “just semantics” - let’s head that right off here. Anyone who tells you that semantics are trivial is someone you shouldn’t listen to. We cannot communicate meaningfully, we even have a hard time even thinking clearly, without well-defined concepts and clear names for them.

Projects are how we organize doing our work. The work we do is to deliver products and services.

This matters.

All projects end. Projects are by definition finite bundles of work with a predetermined end. In a software context, a project might be a sprint or the addition of a new set of functionality, or a rewrite, or even a new set of documentation and examples. We’re going to do XYZ, and then we’ll be finished, and that’s the project. Fin.

The software itself is a product that we offer to the community. Or we offer services around the software, or around the creation of the software. (Or systems, or data analyses, or…) They involve ongoing work and operations, they require discovery and implementation and delivery and feedback loops. There’s no predefined end to a product, or our offering of a service.

Somehow in our community it’s seen as too tawdry, or common, or transactional, to say out loud that we provide products and services to research groups. The words echo discordantly in the hallowed and monastic halls of academia. But products and services are what we do, we’re good at it, and the products and services we provide enormously accelerate scientific progress.

If “product” is some dirty word that we can’t mention even amongst ourselves, we cut ourselves off from lots of perfectly good literature and books and blog posts about, for instance, product life cycles and how to wind down a product. If we refuse to think rigorously about services, thinking it’s too “commercial”, we’re going to struggle to explain what we do or to have the impact on science that our team could have.

Our funders have some stuff to answer to about all this. Too many of our funding mechanisms are still really set up for funding small collections of experiments, which are inherently projects. So we have to kind of frame the work we do - which is really about programmes and operations and products - as if it were a discrete set of projects.

But look, just because we have twist words to explain our work to funders in ways they can understand and fund doesn’t mean we have to keep using that warped diction when we’re talking amongst ourselves. We offer products and services, and we do the work necessary to deliver those products and services by bundling tasks into projects.


Research Software Development

Directing new contributors to areas of need - Ben Cotton

It should go without saying, but for those of us who maintain open source code and are open to external contributions, it makes sense to make it as easy as possible for contributors to add the stuff you want added! Cotton suggests going beyond (say) Hacktoberfest recommendations by spelling out directions you want the code to go, non-code contributions you’d like (social media, specific documentation, examples), and keeping track of contributors and their skills so you can ask them for specific contributions.


The low-hanging fruit in computational reproducibility - Konrad Hinsen

Hinsen opens with an eye-catcher:

Here’s a provocative proposition: we can solve computational reproducibility for a big majority of those 92% of researchers by buying them a license for Mathematica.

And I think puts his finger squarely on why this seems so provocative:

All the activities around software in Open Science are organized by and for people who work in computational science, […] On the other hand, most of the 92% of researchers who depend on software do computer-aided research but not computational science. Their main tools are instruments or mathematical theories. They use computers as auxiliary tools, mostly for routine data analysis tasks. […] The people who contribute to Open Source projects for scientific software have overall the same profile as the participants of software-for-open-science events. [LJD: and so are not representative of most users’ needs]

I’ve written here before about how under-discussed commercial options or fee-for-service semi-public options are in our community, and this is a thought provoking discussion starter on the topic specifically in the context of reproducibility.


Some python news - you may have already seen this, but there’s now a JIT in Python 3.13 - impact initially is modest but this is an important start.


Research Data Management and Analysis

Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer - James Casaletto, Alexander Bernier, Robyn McDougall, and Melissa S. Cline, Ann Rev Genomics and Genetics

As more of our teams are being asked to work with sensitive data, federating data becomes an interesting way of gaining the analytics benefits of pooling data without actually pooling data. But it’s not a panacea, and there’s a lot to consider. This is a good review in the context of health data (and genomics in particular) but the basic concepts and concerns are explained well here and are quite broadly applicable.


Ten simple rules for organizations to support research data sharing - Champieux et al, PLOS Comp Bio

If you build it - and do nothing else - they are absolutely not going to come. Building technical data sharing infrastructure is by far the easy part. The authors here talk about a range of other considerations to make sure that data sharing infrastructure is used, including:

  • Understand who will be using this, and why
  • Do communications work with those desired users well ahead of time
  • Make it easy by providing workflows and procedures
  • Get governance right from the beginning
  • Incorporate data ethics best practices
  • Lots of education and communication

Emerging Technologies and Practices

There have been a lot of quantum announcements since I last wrote you - the Finnish Quantum Flagship project, and partnerships between Pawsey and QuEra for quantum computing, MILA and Pasqal for quantum AI, and US Geological Survey and Q-CTRL for quantum sensing and quantum computing. Also, STFC Hartree is now a maintainer for a part of IBM’s Qiskit and LANL is collaborating with D-Wave. Besides just the slow but steady rise in interest in quantum computing despite everything else going on, these private-public partnerships seem to be increasingly common for quantum in a way I don’t remember them being for more general computing or even for AI or the Big Data that preceded it.


Random

Yes, you can define PDF pages larger than Germany.

Generating random numbers by hand, if you want to, which you shouldn’t.

Writing a TUI in bash, if you want to, which you shouldn’t.

SSH implemented over HTTP/3, if you want to, which maybe you shouldn’t?

I missed this when it came out - Google created a model for reading data off of graphs, and my current employer has a playground for using it for free (there was a hugging face one too but doesn’t seem to be working properly as I write this). A lot better than those plot digitizer draw-dots-by-hand approaches I’m used to.

For traditional numerical computing folks who are thinking of getting some into LLMs in some way, I now usually recommend starting from the outside in, working on tooling needed to use them effectively in (say) information retrieval rather than going straight into neural architectures or something. There’s been a couple really good review papers on vector databases and vector retrieval generally that are worth reading; that area is probably pretty fruitful and accessible for many of us, (especially if you worked on particle methods!), and has applications broader than just LLMs.

Relatedly, here’s a free short course on LLMops, some of the content is also more applicable to MLops generally.

If you do want to tackle LLMs directly, this material is well thought of.

Replacing suid binaries with SSH over unix sockets.

Checkpoint/restore for using AWS spot instances for HPC.

SQL2023 is finished, with a number of small changes that clean up various pieces of the standard.


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.


Jobs Leading Research Computing Teams

This week’s new-listing highlights are in the email edition; the full listing of 17 jobs is, as ever, available on the job board.