In February 2015, I joined
UC Berkeley Institute for Data Science
(BIDS) in a very unusual position: I got to focus full-time on making
the open Python ecosystem work better for scientists. My contract is
ending in a bit over a month, so I'm currently thinking about what's
next. But in this post I want to instead look back on what this unique
opportunity allowed me to do, both as a kind of personal post-mortem
and in the hopes that it might be of interest to people and
institutions who are thinking about different models for funding open
source and open science. In particular, there's also
follow-up post discussing
some implications for software sustainability efforts.
A BIDS retrospective
Might as well start with the worst part: in late 2016 I came down with
and have been on partial disability leave since then. This has been
gradually getting better – cross your fingers for me. But it does mean
that despite the calendar dates, in terms of hours worked I've only
been at BIDS for 2 years and change.
But I'm pretty proud of what I accomplished in that time. There were
four main projects I led while at BIDS, that I'll discuss in
individual sections below. And to be clear, I'm certainly not claiming
exclusive credit for any of these – they all involved lots of other
people, who together did way more than I did! But I think it's fair to
say that these are all projects where I played a critical role in
identifying the issues and finding a way to push the community towards
solving them, and that if BIDS hadn't funded my position then none of
these things would have happened.
Revitalizing NumPy development
NumPy is so central to numerical work in Python, and so widely used in both academia and industry, that many people assume that it must receive substantial funding and support. But it doesn't; in fact for most of its history it's been maintained by a small group of loosely-organized, unpaid volunteers. When I started at BIDS one of my major goals was to change that, ultimately by getting funding – but simply airdropping money into a community-run OSS project doesn't always produce good results.
So the first priority was to get the existing maintainers on the same page about where we wanted to take the project and how funding could be effectively used – basically paying down "social debt" that had accumulated during the years of under-investment. I organized a developer meeting, and based on the discussions there (and with many other stakeholders) we were ultimately able to get consensus around a governance document (latest version) and technical roadmap. Based on this, I was able to secure two grants totaling $1.3 million from the Moore and Sloan foundations, and we've just finished hiring two full-time NumPy developers at BIDS.
I have to pause here to offer special thanks to the rest of the NumPy grant team at BIDS: Jonathan Dugan, Jarrod Millman, Fernando Pérez, Nelle Varoquaux, and Stéfan van der Walt. I didn't actually have any prior experience with writing grant proposals or hiring people, and initially I was on my own figuring this out, which turned out to be, let's say, challenging... especially since I was trying to do this at the same time as navigating my initial diagnosis and treatment. (It turns out not all buses have wheels.) They deserve major credit for stepping in and generously contributing their time and expertise to keep things going.
Improving Python packaging (especially for science)
Software development, like science in general, is an inherently
collaborative activity: we all build on the work of others, and
hopefully contribute back our own work for others to build on in turn.
One of the main mechanisms for this is the use and publication of
software packages. Unfortunately, Python packaging tools have
traditionally been notoriously unfriendly and difficult to work with –
especially for scientific projects that often require complex native
code in C/C++/Fortran – and this has added substantial friction to
this kind of collaboration. While at BIDS, I worked on reducing this
in two ways: one for users, and one for publishers.
On the package user side, conda has done a great deal to relieve the
pain... but only for conda users. For a variety of reasons, many
people still need or prefer to use the official community-maintained
pip/PyPI/wheel stack. And one major limitation of that stack was that
you could distribute pre-compiled packages on Windows and MacOS, but
not on the other major OS: Linux. To solve this, I led the creation
of the "manylinux" project. This
has dramatically improved the user experience around installing Python
packages on Linux servers, especially the core scientific stack. When
I ran the numbers a few weeks ago (2018-05-07), ~388 million manylinux
packages had been downloaded from PyPI, and that
number was growing by ~1 million downloads every day, so we're almost
certainly past 400 million now. And if
look at those downloads,
scientific software is heavily represented: ~30 million downloads of
NumPy, ~15 million SciPy, ~15 million pandas, ~12 million
scikit-learn, ~8 million matplotlib, ~4 million tensorflow, ... (Fun
fact: a back of the envelope calculation suggests that the
manylinux wheels for SciPy alone have so far prevented ~90 metric tons
of CO2 emissions, equivalent to planting ~2,400 trees.)
So manylinux makes things easier for users. Eventually, users become
developers in their own right, and want to publish their work. And
then they have to learn to use distutils/setuptools, which is...
painful. Distutils/setuptools can work well, especially in simple
cases, but their design has some fundamental limitations that make
them confusing and difficult to extend, and this is especially
problematic for any projects with complex native code dependencies or
that use NumPy's C API, i.e. scientific packages. This isn't exactly
distutils's fault – its design dates back to the last millennium, and
no-one could have anticipated all the ways Python would be used over
the coming decades. And Python's packaging maintainers have done a
heroic job of keeping things working and incrementally improving
extremely minimal resources.
But often this has meant piling expedient hacks on top of each other;
it's very difficult to revisit fundamental decisions when you're a
all-volunteer project struggling to maintain critical infrastructure
with millions of stakeholders. And so fighting with
distutils/setuptools has remained a rite of passage for Python
developers. (And conda can't help you here either: for builds, conda
packages rely on distutils/setuptools, just like the rest of us.)
Another of my goals while at BIDS was to chart a path forward out of
this tangle – and, with the help of lots of folks
Thomas Kluyver, whose efforts were truly heroic!), we now have
one. PEP 518 defines the
pyproject.toml file and for the first time makes it possible to
extend distutils/setuptools in a reasonable way (for those who know
setup.py: this is basically
setup_requires, except it works). This
recently shipped in pip 10.
And PEP 517 isn't quite
implemented yet, but soon it will make it easy for projects to abandon
distutils/setuptools entirely in favor of tools that
are easier to use
better prepared to handle demanding scientific users,
making software publication easier and more accessible to ordinary
The Viridis colormap
When I started at BIDS, matplotlib still used
the awful "jet" colormap by
default, despite probably dozens of peer-reviewed articles pointing
out how rainbow colormaps like "jet" distort users' understanding of
their data, create barriers to accessibility, and lead to bad
decisions, including (for
unnecessary medical diagnostic errors.
So I suggested to Stéfan that we fix this.
This was an interesting challenge, with two parts: first, the
computational challenge of building a set
visualize and design better colormaps,
and second and more importantly, the social challenge of convincing
people to actually use them. After all, there have been many proposals
for better colormaps over the years. Most of them sank without a
trace, and it was entirely possible that our colormap "viridis" would
do the same.
This required working with the matplotlib community to first find a
socially acceptable way to make any changes at all in their default
my suggestion of
a style-change-only 2.0 release proved successful (and ultimately led
broader style overhaul).
Then we had the problem that there are many perfectly reasonable
colormaps, and we needed to build consensus around a single proposal
without getting derailed by endless discussion – avoiding this was the
talk I gave at SciPy 2015.
In the end, we succeeded beyond our wildest expectations. As of today,
my talk's been watched >85,000 times, making it the most popular talk
in the history of the SciPy conference. Viridis is now the default
colormap in matplotlib, octave, and parts of ggplot2. Its R package
hundreds of thousands of downloads every month which
puts it comfortably in the top 50 most popular R packages. Its fans
have ported it to essentially every visualization framework known to
humankind. It's been showcased
Nobel-prize winning research and
NASA press releases,
and twitter bots
follow-ups from other researchers.
On the one hand, it's "just" a colormap. But it feels pretty good to
know that every day millions of people are gaining a little more
understanding, more insight, and making better decisions thanks to our
work, and that we've permanently raised the bar on good data
Making concurrent programming more accessible
Here's a common problem: writing a program that does multiple things
concurrently, either for performance or as an intrinsic part of its
functionality – from web servers handling simultaneous users and web
spiders that want to fetch lots of pages in parallel, to Jupyter
notebooks juggling multiple backend kernels and a UI, to complex
simulations running on HPC clusters. But writing correct concurrent
programs is notoriously challenging, even for experts. This is a
challenge across the industry, but felt particularly acutely by
scientists, who generally receive minimal training as software
developers, yet often need to write novel high-performance parallel
code – since by definition, their work involves pushing the boundary
of what's possible. (In
fact Software Carpentry originally
"grew out of [Greg Wilson's] frustration working with scientists who
wanted to parallelize complex programs but didn't know what version
Over the last year I've been developing a new paradigm for making
practical concurrent programming more accessible to ordinary
developers, based on a novel analysis of where some of the
difficulties come from, and repurposing some old ideas in language
design. In the course of this work I've produced a practical
implementation in the Python
library Trio, together with a series
of articles, including two discussing the theory behind the core new
This last project is a bit different than the others – it's more in
the way of basic research, so it will be some time before we know the
full impact. But so far it's attracting quite a bit of interest across
the industry and from language designers
and I suspect that either Trio or something very like it will become
the de facto standard library for networking and concurrency in
Some other smaller things I did at BIDS, besides the four major
projects discussed above:
Was elected as an
honorary PSF Fellow, and to
Python core developer team.
the BLAS working group on their proposal for
next generation BLAS API.
the set of core linear algebra routines that essentially all
number-crunching software is built on, and the BLAS working group is
currently developing a the first update in almost two decades. In
the past, BLAS has been designed mostly with input from traditional
HPC users running Fortran on dedicated clusters; this is the first
time NumPy/SciPy have been involved in this process.
Provided some assistance with organizing
MOSS grant that
funded the new PyPI.
Created the h11 HTTP library, and
came up with
a plan for using
it to let urllib3/requests
join the new world of Python async concurrency.
Had a number of discussions with the conda team about how the conda
and pip worlds could cooperate better.
And of course lots of general answering of questions, giving of
advice, fixing of bugs, triaging of bugs, making of connections,
...and the ones that got away
And finally, there are the ones that got away: projects where I've
been working on laying the groundwork, but ran out of time before
producing results. I think these are entirely feasible and have
transformative potential – I'm mentioning them here partly in hopes
that someone picks them up:
PyIR: Here's the problem. Libraries like NumPy and pandas are
written in C, which makes them reasonably fast on CPython, but
prevents JIT optimizers like PyPy or Numba from being able to speed
them up further. If we rewrote them in Python, they'd be fast on PyPy
or Numba, but unusably slow on regular CPython. Is there any way to
have our cake and eat it too? Right now, our only solution is to
maintain multiple copies of NumPy and other key libraries (e.g. Numba
and PyPy have both spent significant resources on this), which isn't
scalable or sustainable.
organized a workshop
and invited all the JIT developers I could find. I think we came up
with a viable way forward, based around the idea of a Cython-like
language that generates C code for CPython, and a common higher-level
IR for the JITs, and multiple projects were excited about
collaborating on this – but this happened literally the week before I
got sick, and I wasn't able to follow up and get things organized.
It's still doable though, and could unlock a new level of performance
for Python – and as a bonus, in the long run it might provide a way to
escape the "C API trap" that currently blocks many improvements to
CPython (e.g., removing the GIL).
Telemetry: One reason why developing software like NumPy is
challenging is that we actually have very little idea how people use
it. If we remove a deprecated API, how disruptive will that be? Is
anyone actually using that cool new feature we added? Should we put
more resources into optimizing module X or module Y? And what about at
the ecosystem level – how many users do different packages have? Which
ones are used together? Answering these kinds of questions is crucial
to providing responsible stewardship, but right now there's simply no
way to do it.
Of course there are many pitfalls to gathering this sort of data; if
you're going to do it at all, you have to do it right, with
affirmative user consent, clear guidelines for what can be collected
and how it can be used, a neutral non-profit to provide oversight,
shared infrastructure so we can share the effort across many projects,
and so on. But these are all problems that can be solved with the
right investment (about which, see below), and doing so could
radically change the conversations around maintaining and sustaining
open scientific software.
So there you have it: that's what I've been up to for the last few
years. Not everything worked out the way I hoped, but overall I'm
extremely proud of what I was able to accomplish, and grateful to BIDS
and its funders for providing this opportunity.
As mentioned above, I'm currently considering options for what to do
next – if you're interested in discussing
possibilities, get in touch!
Or, if you're interested in the broader question of sustainability for
open scientific software, I wrote
follow-up post trying
to analyze what it was about this position allowed it to be so