#162 - 1 Apr 2023

Measure what matters, Part I: Kirkpatrick Models. Plus: Role of a Tech Lead; User Interviews 101; Shared Services Organizations; Maintainer Guides; Refactoring; ROI

I’ve had some great conversations following up from the surveys article (#159) - thanks! You’ll likely remember that my argument was that while surveys can, used carefully, be a useful instrument, there are some common failure modes in our line of work:

  • Surveys are too often a way to avoid talking to people
  • It’s too easy to ask questions in a survey without putting in the effort to decide what really matters, and what decisions will be make based on the results. So the results don’t mean anything and nothing changes, but it still feels like something’s been done.

Training came up a lot as an example in the conversations. That’s a great application area, because training has a well-developed practice around evaluation. I’d like to use that as a baseline way to talk in the next issue about measuring what matters, and outcomes and impact versus inputs.

So let’s talk about the Kirkpatrick Evaluation Model for training interventions:

The Kirkpatrick model of teaching evaluations - a pyramid describing evaluating at the lowest level (reactions), then a level for learning, then behaviour change, and at the top, results.

This model describes four levels at which one could evaluate the effectiveness of training. They all have value, but at each stage it’s important to understand what’s actually being measured, and why.

Reaction - What is The Immediate Response to the Training

When some kind of evaluation is done for short or medium-length trainings in our business, it’s almost always of the “how satisfied were you with the training today?” reaction-level evaluation.

These evaluations can surface real issues in the inputs to the training. The in-person venue is too hot, the Zoom audio was too quiet, the scheduling was bad, the speaker wasn’t engaging, the materials were unclear. These problems can interfere with learning, and if the evaluations are done after every session, it gives you clear signal about something that might be fixed before the next session. What’s more, they’re easily done, there’s lots of templates one can use, and they typically get good response rates because attendees expect them. Terrific!

Further, if you do these consistently for a while, you can test to see if a major change has made the inputs to the learning experience better or worse. For instance, in #64, we looked at a report by Bristol’s ACRC on the move from in-person to virtual training delivery; because they had been doing these “would you recommend this workshop” questions for a long time, they could monitor how well the transition went. That’s great! It’s good and useful information.

One needs to be aware of the limitations of what’s being measured here, however. A low score gives some signal about problems which could hurt learning. But learning isn’t being measured at all. A session with a very engaging, charming instructor can easily result in very high evaluations without any actual learning. A boring, droning session might end up being filled with real, actionable nuggets of knowledge which get used enthusiastically. We all know the issues with teaching evaluations and how they measure a lot of stuff that has nothing at all to do with how much education actually happened.

Even people in our business, who learn professionally and constantly as part of their jobs, are absolutely rubbish at knowing in the moment whether or not they’ve successfully learned something, much less whether or not it will be useful. When was the last time you read or watched something, were sure you understood it, and at some point later realized you hadn’t understood it at all? It probably wasn’t decades ago, right?

Our purpose in offering these trainings isn’t to be engaging. It’s not to entertain. It’s not to offer amazing venues. If those things help us achieve our purpose, fantastic, but they’re not the goal. So while “Reaction” evaluation is a useful tool, it’s only one step in measuring what matters.

Learning - Did Education Actually Happen?

We put a lot of effort into these events. If we want to know how good we are at achieving our purpose with them, it’s not enough that the inputs are good, and that people have positive reactions to them; they have to be effective. And we can’t know how to get better at achieving our purpose without measuring how well we’re doing.

The first level on which to measure effectiveness, the first step towards the purpose, is the clearest and easiest to measure - did the attendees actually learn what we set out to teach?

For longer trainings, we’re normally pretty good at this. We have one or more assignments which get evaluated. Paired with a pre-test (which is important; our success at teaching the material is measured not by the final grades but the difference between what the incoming students knew and what they leave knowing), this gives us actual data on what is being learned and what we need to work on still. If the tests are light-weight enough (e.g. clickers and the like) we can actually know in near-real-time what we need to spend more time on/revisit in a different way, and when we can move on. We’re normally less great at doing this for shorter trainings, but it’s just as important. It doesn’t have to be a 20 part test with essay questions; but some kind of pre- and post- assessment is the only way to know if our intervention had any educational value.

Once done, this is now something we can compare across sessions, instructors, and experiments and know if something is “better” than before in a real, educational, sense. Note that unlike the reaction, this won’t necessarily give us clear guidance of what needs to change; but it will tell us which sub-topic needs more attention. That, plus experimentation, is enough to guide improvements.

But even that is just a first step.

Behaviour Change - Did The Learning “Stick” and Seem Useful Enough To Apply?

If all we cared about was successfully teaching some knowledge, we could just choose easier knowledge to teach!

Conveying facts and knowledge is well and good, but that isn’t our purpose. In our roles, we teach knowledge and skills with the intent that they get used to advance research and scholarship. People attend our sessions, presumably, so they can apply what we’re offering to teach them. If we taught engagingly, and attendees successfully learned the material, but then never use it — what was the point?

Thinking about how the learning will be used, right from the beginning of preparation, can greatly and usefully shape what and how we choose to teach. It almost always results in a greater hands-on focus, with deeper exposition on theory and facts being deprioritized in favour of teaching some of the basics with immediate relevant application, and provide deeper theory and knowledge either as resources to be looked up as needed, or in followup sessions.

Measuring behaviour change is tricky, partly because it’s the first level where we’re unlikely to be able to make meaningful quantitative measurements, and partly because by definition it involves contacting people once they’ve already finished the training (and so response rates go down). Surveys can work here, but probably aren’t as good as emails and quick followup chats. These will be much more successful if you let attendees know throughout the course that it’s coming.

Cohort-based teaching can help here with both enabling and evaluating behaviour change. In this model the cohort has a way of staying in contact for some fixed period of time (3 months is a common period) after the material, and there’s follow-up discussions, and maybe coaching and Q&A; the idea here is to provide a mini “community of practice” where people can share how they’re using the material and learn from each other.

Longer-term and less-intense training (say, an afternoon a week for 8 weeks) can also help, because it can provide more opportunity to use the learned material between sessions - and class discussion can be used.

Finding out how the material was used or what barriers there are to applying it can help improve future sessions, making the material being taught in a way that’s more useable, and better targeted to audiences that will successfully use the material. It will also start providing useful testimonials that can be used in communicating with future attendees and decision makers: “Six months later I’m still using what was taught in this course, and have learned more skills based on the original material! I’m so glad I took this course!”

Results - Did It Matter? Did We Teach The Right Thing?

Great, so the material was taught well, and people are using it. But to what end?

Our goal is to advance research and scholarship as much as we can, given our constraints and priorities.

This final level of evaluation attempts to answer the questions: Why teach a course? Why teach this course? Does any of it actually matter?

Implicitly or explicitly, we’re teaching these things so that there’s some impact on research and scholarship. We can’t possibly be consistently successful in that end goal - the whole purpose of the exercise - unless we teach with that end result clearly in mind, and repeatedly, constantly, evaluate the impact we’re having.

Maybe our goals with this particular training were:

  • Help grad students be more effective by helping them analyze their data faster and in a more repeatable way, so they can publish faster and better with the same or less effort.
  • Help labs write the software they need to do specialized work and receive more grants.
  • Help promote the economic impact of our institution by helping trainees successfully land post-ac jobs that make use of their skills.
  • Help research groups run larger-scale, more multi-physics simulation so their publications can have more impact.

So were we successful or not?

There’s absolutely no chance we can find that information out using surveys. This needs to come from on-going discussions, and long-term followup. And folks, it’s messy.

There’s absolutely no chance we can find this information out using surveys. This needs to come from on-going discussions, and long-term followup.

Attribution will be challenging, causality will be unclear. We certainly won’t end up with any quick fixes we can do for the next session.

But these impacts are the point of our interventions. We’re putting our (and our attendees) time into these courses to have some impact on research and scholarship in our institutions. We can’t know we’re doing the right things, or doing things well enough, without some attempt to find out the impact of what we’re doing.

The results will be uncomfortably qualitative. Some researcher will be willing to say “We never would have gotten this grant without our students being trained in this material”. Some researcher will seemingly have the same outcome but not feel that way. The self-reported results are unfortunately the best you’re likely to be able to get. You can also look at overall grant success rate or publication rates or…

This unclear data may be unsatisfying to you, but will be hugely influential to decision makers. That “never would have gotten this grant” quote will be something you can use in conversations and slides again and again. Consistently placing grad students in spinouts or other businesses will really matter for your jurisdiction’s staff and politicians. Prioritizing the work that has the most clear impact will mean your team is spending its time on the right things. And aiming for that impact in how the training is chosen, designed, and delivered will mean that work is as valuable as possible.

Application More Broadly

I’ve spent a lot of time on training evaluation here, but these ideas have much wider application.

We choose a training intervention — or other services or products for our community — because, consciously or unconsciously, we have some theory of change. We think our training will have some impact on research and scholarship in our institution, typically through pretty simple mechanisms; mechanisms which are observable, measurable, and improvable.

Crucially, the impact is the entire point. There’s no reason to do the work except for the impact. This isn’t a hobby; we’re professionals, doing badly needed work, with far more things we could usefully do than time or resources to do them.

And yet, we tend to focus on measuring the inputs than the impact, because the impacts are easy and quantitative to measure.

That’s a bit longer than I intended. I’ll write more about this next week - now on to a slightly shorter roundup!

Managing Individuals and Teams

Across the way this week over at Manager, Ph.D., I covered:

  • Feedback as worked example of team expectations, goals and standards, and how we should seek out these worked examples to learn faster
  • One-on-one questions for underperforming team members
  • Team expectations should be tradeoffs
  • Problems with “the team” may be caused by us
  • Better LinkedIn profiles

Technical Leadership

The Role of the Tech Lead - Rachel Spurrier

For a small enough team, the manager may be the tech lead, but things get a lot easier once the team is large enough to split out the responsibilities.

A technical lead can be a permanent job title, or it can be a per-project role. The key thing, as Spurrier writes, is that there’s clarity about who is responsible for what.

The tech lead generally focusses on the how - how best to implement something, how to manage the project - and the manager focusses on the who (and the what and why, if there’s no separate person doing that like a product manager). The tech lead, Spurrier tells us, is responsible for technical mentorship while the manager focusses on broader people management and career development.

You can have different divisions of responsibilities and have things work, as long as there’s clarity. Spurrier talks about what skills the tech lead needs, and how to introduce some one into the role. That’s particularly useful in teams like ours where the tech leadership may be a bit fluid depending on what’s being done for who, and so different people may get the opportunity to practice those responsibilities at different times.

Product Management and Working with Research Communities

User interview 101 - Sophie Aguado

Whether we’re following up on a course to see if material is being used, or getting input on current or possible new services or feature or product for our clients, we need to make sure we can talk to our clients without asking leading questions and getting the most out of the time-consuming conversations.

Aguado walks us through what user interviews are good for (qualitative insight about needs) and what they aren’t good for (quantitative results or “do you think this is a good idea” questions), who and when to interview, and a good series of steps.

Aguado’s scripts are more for UX design interviews, but the steps, suggestions, and listed gotchas apply very much to our teams.

Innovating to Improve and Mature Your Shared Services Organization - Karen Hilton, Betsy Curry, Scott Madden Consulting

We might not love thinking about it this way, but our teams are shared services, centralizing certain kinds of expertise and/or equipment across departments, divisions, institutions, or even regions.

There’s actually a lot written about shared services, and even when the shared services are very much non-research functions (finance, HR, IT), the basic story of we best serve our institutions very much applies: as does the path to maturity:

Figure 1 of the article, with maturity levels for Shared Services being (1) Transform, initiating the centralized and more efficient shared services, resulting in a cost reduction; (2) Stabilize and Optimize, pushing for consistently higher efficiency, lower costs, and lower turnover, and higher service levels, and (3) Refining, adapting to executive expectations changing, desired capabilities expanding, and complexities increasing

I see a lot of teams get stuck after that first phase. They happen, there’s centralized and more efficient services. But the drive to keep getting better, to optimize and excel, to hold ourselves accountable to high and publiclly measured standards, and to use that excellence to find opportunites to tackle larger and more complicated tasks - that doesn’t happen by default. We’re not trained how to do it.

(Relatedly, a different article by the same company describes different “service delivery channels” - as advisor to leadership or divisions, as a transactional service center, or centers of expertise. I think too often our teams fall into the trap of being largely or entirely transactional service centers, which is valuable but only part of what we can be).

Research Software Development

Maintainer guides: spending, succession, & more - Sumana Harihareswara

Harihareswara has a great set of guides here from training she’s developed for maintainers of open source projects. It includes including guides on growing the contributor base and handling conflict.

Refactoring and Program Comprehension - Greg Wilson
Quickly improve code readability with Proximity Refactorings - Nicolas Carlo
Easily identify technical debt in any repository - Daniel Bartholomae

A trifecta of refactoring articles for this issue:

Wilson summarizes a recent paper with quantitative results on what kinds of refactoring genuinely result in improved readability:

  1. Refactorings can help readability directly or indirectly (like by making comments easier to apply to the code being read)
  2. Reorganizing source code to make more cohesive components positively effects readability
  3. Moving features between objects don’t clearly have a net positive or negative impact
  4. Reorganizing data within classes doesn’t have a clear impact on readibility
  5. Renaming code elements doesn’t always help
  6. Moving logic along the hierarchy structure can help, but superclasses inevitably have a negative impact.
  7. Multiple refactoring operations involving renaming applied in sequence can noticably improve readibility.

The paper itself really looks interesting.

Meanwhile, Carlo suggests that refactorizations that can be of type 2 - reorgnizing code so relevant pieces are in closer proximit - is a simple and effective way to refactor code as you go along making other changes

Finally, Bartholomae makes two arguments:

  • Technical debt and accidental complexity are basically the same thing, differing only in the history of how it happened
  • Accidental complexity is suprisingly well correlated with typical seemingly-simplistic code complexity measures

And with that he offers a tool based on the new-to-me GitHub Blocks, which lets you build UIs on top of code bases. His block looks for regions of the code with high code complexity measures. Very cool!

Research Computing Systems

Use of accounting concepts to study research: return on investment in XSEDE, a US cyberinfrastructure service - Stewart et al, Scientometrics

In #132 we covered a similar article, “Metrics of financial effectiveness: Return On Investment in XSEDE, a national cyberinfrastructure coordination and support organization”. The latest article is a greatly expanded, 31-page treatment of that work, discussing what the costs would be compared to if the systems had been provided piecemeal without coordination and common stacks, etc and services like extended consulting, software optimization, and training had not been offered.

Using the largest dataset assembled for analysis of ROI for a cyberinfrastructure project, we found a Conservative Estimate of ROI of 1.87, and a Best Available Estimate of ROI of 3.24.

Figure 6 of Stewart et al, with estimated values of services for extended consulting, training, software optimization, and value to service providers of the coordination and knowedge-sharing.


In praise of awk (and some interesting technical notes about awk - e.g. it was designed, like bash, to not need any garbage collection - I did not know that about either sh or awk).

A ChatGPT client for MS-DOS, because we could.

An international conference on Pascal and Pascal-like languages, because they could.

Forwarding ssh-agent through websockets. They were so preoccupied with whether or not they could, they didn’t stop and think if they should.

Going deep into fsync.

Darklang (a hosted programming language for web services) is going all-in on LLM code generation. Whether you like Darklang and its direction or not, this is the first argument I’ve seen that a future with more AI code generation will mean that what matters for ergonomics in programming languages will change significantly.

Better defaults for GitHub actions.

That’s it…

And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.

Have a great weekend, and good luck in the coming week with your research computing team,


About This Newsletter

Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.

So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.

This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.

Jobs Leading Research Computing Teams

This week’s new-listing highlights are below in the email edition; the full listing of 187 jobs is, as ever, available on the job board.