#166 - 30 Apr 2023

Case Discussion: a key team member leaves. Plus: Making tools easy to use; Using ChatGPT for technical writing; RSE-AUNZ’s strategy process; ACM CHI 2023 highlights; C4DC data sharing agreement library; Arrow and Apache Superset for data dashboards; Front-line data science; Learning from near misses

Here’s the first in what will be a recurring series of case study/tabletop exercise: the team member a manager most relies on announces they’re leaving.

In this series, I’m going to propose situations similar to some I’ve seen (heavily reworked and changed), and ask you for your input: What would you do/have you done? What other things should be considered? And, after the situation plays out, what would you suggest the manager or leader do in the future to make things better?

Cover image - How would you advise your peer?  Key data science team member announces she’s taken another job.  Have two weeks before she’s gone.  Manager comes to you for advice.

Daniel Miller runs a central Research Data Science team at the University of West Northshire (all names and places entirely fictional, situation only slightly less so). After a long while of having trouble getting potential clients to have even heard about them, much less engage and bring their problems, they forged collaborations with the stats help desk and the HPC centre to mutually refer promising projects, and now things are getting pleasantly busy. The team hadn’t had to hire past its initial cohort from a couple of years ago, but is now considering growing with at least some student workers. Currently, Miller spends most of his time building and maintaining relationships with stakeholders, hoping to secure additional funding.

For operations, Miller has leaned heavily on his most experienced team member, Olivia Johnson, who has a breadth of technical operational and technical knowledge. Johnson acted as an unofficial team lead, doing mentoring and project management in addition to the more subtle pieces of data science work, and was widely respected inside and outside the team for being innovative, collaborative, and reliable.

So Miller was shocked when, at an impromptu check-in she requested, Johnson said that she had accepted an offer from industry to lead a new data science team and would be leaving in two weeks. Miller had no idea that she was unhappy or looking for other opportunities. He felt betrayed and angry, but also worried and anxious. There were a number of projects in the pipeline, and some in progress, that he had been expecting Johnson would be playing an important part in. There were also some training events coming up that were important for visibility for the still-establishing-itself team.

So Miller suddenly has a bunch of urgent new things on his plate:

  • How does he replace Johnson? Can he?
  • What happens with the projects and trainings?
  • What happens to the other team members? Are they going to get mentoring? Are they thinking of leaving?
  • What does he prioritize Johnson do in the mean time?

In a panic he’s texted you, his more experienced UWN peer, asking for advice. How would you advise Miller? What steps would you recommend to manage the team, manage the work, and manage relationships with stakeholders? He’s got two weeks - what are priorities, and what aren’t? And when the dust settles, how would you suggest Miller do things differently to mitigate this risk in the future?

You can assume anything you want about details not in the writeup, as long as you make those assumptions clear. Just hit reply, or email me your thoughts at [email protected]. Let me know if I can share your thoughts, anonymously or attributed to you; otherwise I’ll include it in summarized form. Next week I’ll share the discussion, and give my own recommendations.


Managing Individuals and Teams

At Manager, Ph.D. I talk about the roles collaboration play on a team - promoting knowledge sharing, doing better work by combining expertise and perspectives, doing work faster by dividing up tasks. We discuss mutually-reinforcing practices that can help promote collaboration: promoting a culture of question-asking, feedback, and documentation, sharing knowledge within and outside the team with talks and workshops, pairing, collaboration tools.

The round-up has articles on:

  • Having the hardest conversations requires the deepest caring
  • Delegation
  • Your top performers need your attention, too
  • Making feedback a team habit
  • The solution isn’t interesting, the problem is
  • Securing resources for your projects
  • Teaching yourself to say no

Product Management and Working with Research Communities

Why it’s worth making computational methods easy to use - Jean Fan

Fan, an assistant professor at Johns Hopkins, writes in Nature careers how she and her team prioritized ease of getting started with and using STdeconvolve. STdeconvolve analyses spatial transcriptomics data, and helps to estimate how many cell types are present in the pixels and how the cell types are distributed in the tissue.

Fan and her group spent a lot of time simplifying the installation process and provided vignettes, tutorials, and video lectures to make it more accessible to inexperienced researchers. Additionally, she shared recordings of conference talks online. She aims to increase the impact of the tool by making it more accessible. As is often the aesthetic, making the software accessible has made it easier to maintain and use in her lab, and it provides students with a space to ask for explanations or publicly contribute their own solutions to others’ questions.

We probably spent as many hours making STdeconvolve accessible as we did in developing it. Some of my colleagues have been surprised by this effort, as those hours won’t lead to new publications. But I think the time is worth it: when science is made accessible, we can verify claims more easily and build on innovations, furthering progress.

This approach is pretty gutsy and risky for faculty members, especially for an assistant professor. Fan’s right that it’s work that won’t produce publications (or not any that funders or tenure committees will care about), although it might well boost citations.

On the other hand, we’re the teams that are well situated to turn research outputs into reliable research inputs (#119); we can be the Development in R&D. This kind of work greatly amplifies our impact, in ways that can more reliably turn into support for our teams than it can for PI-lead research groups. In a way, it’s a bit of a failure of funding that it is falling to faculty members and their teams to do this work.


Using ChatGPT as a technical writing assistant - Mike Mason

I want to write about this at greater length later - but outward-facing documentation, explanations, images, and project writeups are consistently the lowest-hanging fruit I see for product management (and to an extent positioning and marketing) for our teams.

Tools like ChatGPT or Bing Chat make it easy to turn a few paragraphs of some internal text plus some bullet points into a first draft for an external audience, after a little back-and-forth. Increasingly these tools can even write some initial explanations of APIs. Even if that first draft is pretty crummy, editing that first draft into something publishable is much easier than creating something ex nihilo. And once you have a good version of the text for one purpose (ideally the most detailed and technical version), the same tools are exceedingly good at rewriting it for different audiences, or turning it into shorter forms for other kinds of media.

In this article on Martin Fowler’s blog, Mason describes writing technical material using ChatGPT including the free version for his consultancy’s blog. He can quickly turn some very rough notes into a draft suitable for review and rewrites, and then very successfully turn that into different formats.

My own experience is that you need to be a little bit more iterative with these tools to get the best results (In another post, Fowler describes colleague Xu Hao’s approach to getting code out of ChatGPT, which uses a much more iterative and back-and-forth and bottom-up approach which works well for text, too). I’d summarize my approach which I’m happy with as:

  • Don’t rely on these tools for answers containing knowledge; give them the facts, use them as writing and rewriting assistants
  • They’re also great for brainstorming, and for trying different text “refactorings”
  • You have to go back and forth a lot, preferably starting with points to cover, through outlines, then to text
  • The more context you give them, the better. Mason’s request for “’blip’ in the style of the Thoughtworks technology radar” didn’t work, probably because ChatGPT didn’t know what that was but soldiered on regardless. I like to give it samples I like, ask it to describe the style of that sample, and then ask for text with that description of style.
  • If you’re starting from long-form text, ChatGPT works well. If you’re not, Bing Chat is strictly better (unless you’re paying for ChatGPT and using things like plugins or GPT-4).

Ethan Mollick is consistently the best source for how to get good results and speed up routine work like writing up project results with LLMs like Bing . Two very useful recent resources from him are


Strategy and Positioning

Research Software Engineer Australia New Zealand Strategy 2023-2025 - Rowland Mosbergen, Paula Andrea Martinez, Nooriyan Poonawala, Manodeep Sinha, Janet Stacey, RSE-AUNZ

We usually see others’ strategy documents only as finished products. That’s a problem! When learning strategy, like learning to do science, the how and the why that don’t make it into the final document or publication (#155) is what matters. So I was excited to see this document.

The RSE Association of Australia and New Zealand have put together a slide deck proposing their strategy and goals for the next two years. They set out some clear selection criteria for goals, and evaluated proposed goals systematically against those criteria. There’s an appendix that lists a number of good and worthwhile goals which did not make the cut.

This approach of setting strategic priorities and evaluating choices against them is given various names - La Piana, a book I quite like, calls a related approach a “strategy screen”, for instance. But whatever it’s called, it solves two important problems:

  • It reduces the number of goals - “Do all the good things” isn’t a strategy, it isn’t even a plausible aspiration
  • It means that there’s some underlying coherency to the selected goals - increasingly so with fewer criteria

It’s that second feature that makes a set of goals a strategy, rather than a to-do list of worthy tasks. A coherency to the goals, if the screen is well chosen, makes it possible for the goals to be mutually reinforcing and add to something greater than the sum of its parts (#163). That same coherence also greatly simplifies communicating the value of what is being done to stakeholders and decision makers.

The appendix of unselected goals is an important feature. There are good things in there, things I would expect to be greatly valuable to the community! But because there are always finite resources, strategic choices are always tradeoffs. It’s not a strategy - it’s not even a feasible todo list - if good choices aren’t being traded off against each other, if a different organization might not make different tradeoffs. Having a “won’t do now” list is valuable for communicating the choices made and the direction set.

Rowland Mosbergen and Nooriyah Lohani were kind enough to answer questions about how the process has worked, and in particular how selection criteria were themselves selected and what the response has been like. Mosbergen gave permission to use his answers (I failed to ask Lohani in time for the same explicit permission):

Yes, there was a few iterations on the selection criteria, but mainly additions rather than removals or updates, […]. We are lucky that we do have an aligned steering committee.

[T]he community feedback has been positive, from the limited responses we have had. The fact 29 people showed up for the strategy session is actually quite a big thing.

This is great - thanks to RSE-AUNZ for providing such a valuable example of clarity around strategy to the community.


Research Software Development

The future of programming: Research at CHI 2023 - Austin Z. Henley

Henley writes up some highlights (with links to papers and video abstracts/demos) from the ACM CHI conference on computer-human interfaces. There’s some interesting papers there on coding tools, but also some on develop teaching materials.

In particular I’m taken by this paper, A Literate Programming Approach for Authoring Explorable Multi-Stage Tutorials by Wang et al., where the authors build an extension to VSCode for writing long-form tutorials, with stages that build on each other, in a way that allows more complex and nonlinear work on pieces of the code.

Wang et al’s Fig 1 - An overview of Colaroid. Colaroid is implemented as a VS Code extension. The user can open the Colaroid Notebook (B) side by side with their main code editor (A). A Colaroid notebook consists of cells. Each cell captures a history state of the codebase, which contains three components — a text annotation area explaining the rationale behind this state (C), a code editor area displaying the state of the code and highlighting the changes compared to the previous state (D), and an output area rendering the HTML display of the history state (E).


There’s a really interesting-looking Digital Humanities Research Software Engineering Summer School in Cambridge UK at the end of July; part of the program is in-person. Application deadline is 23 May.


Research Data Management and Analysis

Contracts for Data Collaboration (C4DC) Legal Agreement Library - C4DC

This is a fascinating resource which I just learned about - a library of (mostly governmental) data sharing agreements for public perusal, to provide some ideas, examples, and sample language for the kinds of data sharing agreements that our community is increasingly being asked to enter into. Obviously I am not (and likely you are not) a lawyer, so this isn’t advice nor a DIY sort of resource, but it can help clarify our thinking and provide examples for the institutional lawyers who may be drafting these sorts of agreements.

C4DC also has a great framework for thinking about data sharing agreements, and some case studies.


Explore a New Way to Deploy Data Storage and Analytics with Arrow Flight SQL and Apache Superset - Kae Suarez, Voltron Data

Embedded databases, interoperable data formats, and more sophisticated dashboarding tools are making it easier than ever to stand up interactive websites powered by data sets for researchers. Here Suarez demonstrates using DuckDB, Apache Arrow, Arrow Database Connectivity, and Apache Superset to set up a data exploration environment for a 14GB dataset (The US Census Public Use Microdata Sample).

If your research group’s needs are simpler, we’ve covered a number of “host a web interface to a sqlite database” options, like datasette, including sqlite in the browser; here’s a hosted one (with fly.io) that’s free for modest use, Airsequel.


Frontline Data Science: Lessons Learned From a Pandemic - Emily A. Beck, Hannah Tavalire, and Jake Searcy, Harvard Data Science Review

Beck et al describe their front line interdisciplinary team at the beginning of the pandemic

We provide a summary of our experiences in building an interdisciplinary team founded on, and structured by, data pipelines and data scientists’ skills. We present ‘lessons learned’ from our experience building a COVID-19 Monitoring and Assessment Program (MAP), which provides SARS-CoV-2 information and testing to our university and large parts of our state. We further present ‘lessons learned’ from expanding disaster responses to traditionally unreached and disproportionately impacted groups. Finally, we provide a set of ‘principles for success’ that we hope will guide future efforts by data scientists wishing to immediately impact our society and provide information to organizations curious as to how data scientists can contribute to emergency response and disaster recovery.

And offer some lessons learned for “rapid response” data science work in the future:

  • Prioritize flexibility over automation during a ramp-up (because things are so uncertain and fast-changing)
  • Data scientists can reduce data burdens on other staff (indeed probably the easiest way to get buy-in is to make the work being done do double duty — to both help reduce work being done elsewhere and for the purposes of the data science project)
  • Data scientists should be involved at every step of the pipeline (I think we’ve all see what can happen when this isn’t the case)
  • Interoperability is more important than Robustness: an ode to CSV (Ugh, but in this case they’re probably right)

Research Computing Systems

The Invisible Success of Near Misses - Will Gallego

We’ve covered the topic of incident response and postmortems frequently - but you don’t have to wait for something to fail spectacularly to learn from it!

Hopefully, incidents happen relatively infrequently. You can keep your incident response/postmortem skills in practice, and still learn valuable lessons, by running those processes on “near misses”.

Near misses might be as subtle as noticing a typo in a config change just before deploying, to extrapolating from graphs an upward trend in memory allocation without the corresponding relief, to awareness of an unpatched server with a vulnerability begging to be exploited. What’s the difference between these events and a classic incident? Timing when someone noticed. […] Data from events that would likely produce failure were observed, informed decision making, and instead prevented failure.

We know that it’s just as important to learn from successes as well as failures, in retrospectives and elsewhere; the same applies to running computer systems:

We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.


Random

Tufte css, inspired by the Tufte-LaTeX package.

Going deep into UEFI and the PC boot process.

A branchless binary search (if the compiler cooperates).

Anyone tried using Octopus.ac? It looks like a curated, categorized way to register research, including smaller and earlier works than papers.

A Matlab extension for Visual Studio Code. I’m not the worlds biggest Matlab fan, but the Matlab team has been on fire the last couple of years making the product increasingly relevant and attractive in a very changed computational science world.

A large selection of SQL tutorials.

This is one of my favourite genres of deep dive into super-specific technical topics - “You could have invented futexes”.

Another use of ChatGPT, also on Martin Fowler’s blog, for writing code - this one does a better job of showing the iterative back-and-forth that’s really necessary to get good results out of these tools.

An OpenSSL cookbook.

Miss Kermit and Zmodem? Want to get back into using it? Using Kermit and ZMODEM over ssh. (And so ends the newsletter - or, as we used to say in that same era, ATH0).


That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,

Jonathan

About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.