The unreasonable effectiveness of investment in open-source infrastructure

In my last post, I gave a retrospective of my time at the UC Berkeley Institute for Data Science (BIDS), where I've had an unusual, almost unique, position that allowed me to focus full-time on making the open Python ecosystem work better for scientists, and in particular I described my work in four areas: revitalizing NumPy development, improving Python packaging, the viridis colormap, and the Trio project to make concurrent programming more accessible. Of course you should read and judge for yourself, but personally I feel like this was an extraordinary return-on-investment for BIDS and its funders: 1 headcount × 2 years = 4 different projects that wouldn't have happened otherwise (plus two more candidate projects identified), all with enormously broad impact across research and industry.

Yet curiously, all the problems that I worked on are ones that have been well-known and widely-discussed for years. So why didn't they get addressed before? How can there be so much low-hanging fruit? Why was funding me so unreasonably effective?

I wish I could say that it's because I'm, y'know, just that good... but it's not true. Instead, I'd argue that these successes followed from some specific aspects of the position, and are replicable at other institutions and in other communities. Specifically, I believe that these projects all fell into a category that's mostly inaccessible to current funding models for open (scientific) software. Projects like this accumulate, gathering dust, because there's no-one in a position to tackle them. This is a tragedy, but if we can understand the reason and find a way to fix it, then we'll unlock a tremendous opportunity for high-ROI investments.

The category I'm thinking of is defined by two features: it contains projects that (1) require a modest but non-trivial amount of sustained, focused attention, and (2) have an impact that is large, but broad and diffuse. That combination is currently kryptonite for open source and particularly open science. Consider the types of labor we have available:

Famously, a lot of open-source development is done by volunteers working nights and weekends, grad students playing hooky, industry developers using "20% time" to contribute back: these are similar in that they're all ways of scavenging small bits of time out of people's lives. I think it's a testament to the power of the open-source model that it can take effective advantage of scattershot contributions like this, and these kinds of contributions can add up to make amazing things – which makes it tempting to conclude that this kind of labor is sufficient to solve any problem. But it's not true! There are many problems where forty people each putting in one hour a week are helpless, but that can easily be solved by one person working forty hours. That's why none of NumPy's many volunteers built consensus on governance or wrote a grant, why dozens of people have tried to get rid of "jet" without success, why Python packaging remains painful despite being used daily by millions of people, and so forth – the inability of any individual contributor to devote enough focused, sustained attention to get any traction.

Another way people contribute to OSS is as a side-effect of some other funded work. For example, work on conda the open-source package management tool is subsidized by Anaconda, the commercially supported software distribution. Or in an academic context, an astronomy grad student's thesis work is funded by a grant, and they might contribute the resulting algorithms back to AstroPy. But paradoxically, the projects I described above all have "too much" impact to be funded this way – and in particular, their impact is too broad and diffuse.

Everyone already uses NumPy and nobody owns it, so from a company's point of view, it's very difficult to make a business case for supporting its development. You can make a moral case, and sometimes that can work, but I've had many conversations with companies that ended with "You're right, we should be helping, and I really wish we could, but..." Or for another example, before viridis, probably the most impactful work on the colormap problem was done by engineers at Mathworks, who created the parula colormap and made it the default in MATLAB – but they had to make it proprietary to justify their investment, which sharply limited its impact.

This isn't unique to industry; essentially the same dynamics apply in academia as well. If an astronomer contributes to AstroPy, then other astronomers can appreciate that; it might not be worth as much as writing a proper journal article, but it's worth some disciplinary credit, and anyway most of the work can be justified as a side-effect of publishing a paper, thesis, etc. But NumPy is different: most advisors will look askance on someone who spends a lot of time trying to contribute to NumPy, because that's "not astronomy", and while it produces value, it's not the kind of value that can be captured in discrete papers and reputation within the field. Similarly, Python's community-maintained packaging stack is everyone's problem, so it's no-one's problem. You get the idea.

This raises a natural question: if we can't piggyback on some other funding, why not get a dedicated grant? This is an excellent solution for projects that require a lot of focused attention, but there are two problems. First, many projects only require a modest amount of focused attention – too much for volunteers, but too little to justify a grant – and thus fall through the cracks. It would have taken more effort to get a grant for viridis, or for the packaging improvements described above, than it did to actually do the work. In other cases, like NumPy or (perhaps) my concurrency project, a grant makes sense. But there's a catch-22: the planning and writing required to get a grant is itself a project that requires sustained attention... and without the grant, this attention isn't available!

So how do grants ever work, then? Well, academia has a solution to this, that's imperfect in many ways but nonetheless may serve as an inspiration: they have faculty positions. Faculty have a broad mandate to identify problems where applying their skills will produce impact, the autonomy to follow up on these problems, the stability to spend at least some time on risky projects (especially post-tenure), and an environment that supports this kind of work (e.g., with startup funds, grad students, administrative support, etc.). But unfortunately, universities currently don't like to fund faculty positions outside of specific fields, or where the outcomes are tools rather than papers – regardless of how impactful those tools might be.

Of course we should fund more grants for open scientific software. More and more, scientific research and software development are interwoven and inseparable – from a single cell in a single Jupyter notebook, to a custom data processing pipeline, to whole new packages to disseminate new techniques. And this means that scientific research is increasingly dependent on the ongoing maintenance of the rich, shared ecosystem of open infrastructure software that field-specific and project-specific software builds on.

But grant calls alone will be ineffective unless we also have leaders who can think strategically about issues that cut across the whole software ecosystem, identify the way forward, and write those grants – and those leaders need jobs that let them do this work. Our ecosystem needs gardeners. That's what made my position at BIDS unique, and why I was able to tackle these problems: I was one of the few people in all of science with the mandate, autonomy, stability, and support to do so. Any solution to the sustainability problem needs to find a way to create positions with these properties.