njs blog

The unreasonable effectiveness of investment in open-source infrastructure

In my last post, I gave a retrospective of my time at the UC Berkeley Institute for Data Science (BIDS), where I've had an unusual, almost unique, position that allowed me to focus full-time on making the open Python ecosystem work better for scientists, and in particular I described my work in four areas: revitalizing NumPy development, improving Python packaging, the viridis colormap, and the Trio project to make concurrent programming more accessible. Of course you should read and judge for yourself, but personally I feel like this was an extraordinary return-on-investment for BIDS and its funders: 1 headcount × 2 years = 4 different projects that wouldn't have happened otherwise (plus two more candidate projects identified), all with enormously broad impact across research and industry.

Yet curiously, all the problems that I worked on are ones that have been well-known and widely-discussed for years. So why didn't they get addressed before? How can there be so much low-hanging fruit? Why was funding me so unreasonably effective?

I wish I could say that it's because I'm, y'know, just that good... but it's not true. Instead, I'd argue that these successes followed from some specific aspects of the position, and are replicable at other institutions and in other communities. Specifically, I believe that these projects all fell into a category that's mostly inaccessible to current funding models for open (scientific) software. Projects like this accumulate, gathering dust, because there's no-one in a position to tackle them. This is a tragedy, but if we can understand the reason and find a way to fix it, then we'll unlock a tremendous opportunity for high-ROI investments.

The category I'm thinking of is defined by two features: it contains projects that (1) require a modest but non-trivial amount of sustained, focused attention, and (2) have an impact that is large, but broad and diffuse. That combination is currently kryptonite for open source and particularly open science. Consider the types of labor we have available:

Famously, a lot of open-source development is done by volunteers working nights and weekends, grad students playing hooky, industry developers using "20% time" to contribute back: these are similar in that they're all ways of scavenging small bits of time out of people's lives. I think it's a testament to the power of the open-source model that it can take effective advantage of scattershot contributions like this, and these kinds of contributions can add up to make amazing things – which makes it tempting to conclude that this kind of labor is sufficient to solve any problem. But it's not true! There are many problems where forty people each putting in one hour a week are helpless, but that can easily be solved by one person working forty hours. That's why none of NumPy's many volunteers built consensus on governance or wrote a grant, why dozens of people have tried to get rid of "jet" without success, why Python packaging remains painful despite being used daily by millions of people, and so forth – the inability of any individual contributor to devote enough focused, sustained attention to get any traction.

Another way people contribute to OSS is as a side-effect of some other funded work. For example, work on conda the open-source package management tool is subsidized by Anaconda, the commercially supported software distribution. Or in an academic context, an astronomy grad student's thesis work is funded by a grant, and they might contribute the resulting algorithms back to AstroPy. But paradoxically, the projects I described above all have "too much" impact to be funded this way – and in particular, their impact is too broad and diffuse.

Everyone already uses NumPy and nobody owns it, so from a company's point of view, it's very difficult to make a business case for supporting its development. You can make a moral case, and sometimes that can work, but I've had many conversations with companies that ended with "You're right, we should be helping, and I really wish we could, but..." Or for another example, before viridis, probably the most impactful work on the colormap problem was done by engineers at Mathworks, who created the parula colormap and made it the default in MATLAB – but they had to make it proprietary to justify their investment, which sharply limited its impact.

This isn't unique to industry; essentially the same dynamics apply in academia as well. If an astronomer contributes to AstroPy, then other astronomers can appreciate that; it might not be worth as much as writing a proper journal article, but it's worth some disciplinary credit, and anyway most of the work can be justified as a side-effect of publishing a paper, thesis, etc. But NumPy is different: most advisors will look askance on someone who spends a lot of time trying to contribute to NumPy, because that's "not astronomy", and while it produces value, it's not the kind of value that can be captured in discrete papers and reputation within the field. Similarly, Python's community-maintained packaging stack is everyone's problem, so it's no-one's problem. You get the idea.

This raises a natural question: if we can't piggyback on some other funding, why not get a dedicated grant? This is an excellent solution for projects that require a lot of focused attention, but there are two problems. First, many projects only require a modest amount of focused attention – too much for volunteers, but too little to justify a grant – and thus fall through the cracks. It would have taken more effort to get a grant for viridis, or for the packaging improvements described above, than it did to actually do the work. In other cases, like NumPy or (perhaps) my concurrency project, a grant makes sense. But there's a catch-22: the planning and writing required to get a grant is itself a project that requires sustained attention... and without the grant, this attention isn't available!

So how do grants ever work, then? Well, academia has a solution to this, that's imperfect in many ways but nonetheless may serve as an inspiration: they have faculty positions. Faculty have a broad mandate to identify problems where applying their skills will produce impact, the autonomy to follow up on these problems, the stability to spend at least some time on risky projects (especially post-tenure), and an environment that supports this kind of work (e.g., with startup funds, grad students, administrative support, etc.). But unfortunately, universities currently don't like to fund faculty positions outside of specific fields, or where the outcomes are tools rather than papers – regardless of how impactful those tools might be.

Of course we should fund more grants for open scientific software. More and more, scientific research and software development are interwoven and inseparable – from a single cell in a single Jupyter notebook, to a custom data processing pipeline, to whole new packages to disseminate new techniques. And this means that scientific research is increasingly dependent on the ongoing maintenance of the rich, shared ecosystem of open infrastructure software that field-specific and project-specific software builds on.

But grant calls alone will be ineffective unless we also have leaders who can think strategically about issues that cut across the whole software ecosystem, identify the way forward, and write those grants – and those leaders need jobs that let them do this work. Our ecosystem needs gardeners. That's what made my position at BIDS unique, and why I was able to tackle these problems: I was one of the few people in all of science with the mandate, autonomy, stability, and support to do so. Any solution to the sustainability problem needs to find a way to create positions with these properties.

A farewell to the Berkeley Institute for Data Science

In February 2015, I joined the UC Berkeley Institute for Data Science (BIDS) in a very unusual position: I got to focus full-time on making the open Python ecosystem work better for scientists. My contract is ending in a bit over a month, so I'm currently thinking about what's next. But in this post I want to instead look back on what this unique opportunity allowed me to do, both as a kind of personal post-mortem and in the hopes that it might be of interest to people and institutions who are thinking about different models for funding open source and open science. In particular, there's also a follow-up post discussing some implications for software sustainability efforts.

A BIDS retrospective

Might as well start with the worst part: in late 2016 I came down with some serious health issues, and have been on partial disability leave since then. This has been gradually getting better – cross your fingers for me. But it does mean that despite the calendar dates, in terms of hours worked I've only been at BIDS for 2 years and change.

But I'm pretty proud of what I accomplished in that time. There were four main projects I led while at BIDS, that I'll discuss in individual sections below. And to be clear, I'm certainly not claiming exclusive credit for any of these – they all involved lots of other people, who together did way more than I did! But I think it's fair to say that these are all projects where I played a critical role in identifying the issues and finding a way to push the community towards solving them, and that if BIDS hadn't funded my position then none of these things would have happened.

Revitalizing NumPy development

NumPy is so central to numerical work in Python, and so widely used in both academia and industry, that many people assume that it must receive substantial funding and support. But it doesn't; in fact for most of its history it's been maintained by a small group of loosely-organized, unpaid volunteers. When I started at BIDS one of my major goals was to change that, ultimately by getting funding – but simply airdropping money into a community-run OSS project doesn't always produce good results.

So the first priority was to get the existing maintainers on the same page about where we wanted to take the project and how funding could be effectively used – basically paying down "social debt" that had accumulated during the years of under-investment. I organized a developer meeting, and based on the discussions there (and with many other stakeholders) we were ultimately able to get consensus around a governance document (latest version) and technical roadmap. Based on this, I was able to secure two grants totaling $1.3 million from the Moore and Sloan foundations, and we've just finished hiring two full-time NumPy developers at BIDS.

I have to pause here to offer special thanks to the rest of the NumPy grant team at BIDS: Jonathan Dugan, Jarrod Millman, Fernando Pérez, Nelle Varoquaux, and Stéfan van der Walt. I didn't actually have any prior experience with writing grant proposals or hiring people, and initially I was on my own figuring this out, which turned out to be, let's say, challenging... especially since I was trying to do this at the same time as navigating my initial diagnosis and treatment. (It turns out not all buses have wheels.) They deserve major credit for stepping in and generously contributing their time and expertise to keep things going.

Improving Python packaging (especially for science)

Software development, like science in general, is an inherently collaborative activity: we all build on the work of others, and hopefully contribute back our own work for others to build on in turn. One of the main mechanisms for this is the use and publication of software packages. Unfortunately, Python packaging tools have traditionally been notoriously unfriendly and difficult to work with – especially for scientific projects that often require complex native code in C/C++/Fortran – and this has added substantial friction to this kind of collaboration. While at BIDS, I worked on reducing this in two ways: one for users, and one for publishers.

On the package user side, conda has done a great deal to relieve the pain... but only for conda users. For a variety of reasons, many people still need or prefer to use the official community-maintained pip/PyPI/wheel stack. And one major limitation of that stack was that you could distribute pre-compiled packages on Windows and MacOS, but not on the other major OS: Linux. To solve this, I led the creation of the "manylinux" project. This has dramatically improved the user experience around installing Python packages on Linux servers, especially the core scientific stack. When I ran the numbers a few weeks ago (2018-05-07), ~388 million manylinux packages had been downloaded from PyPI, and that number was growing by ~1 million downloads every day, so we're almost certainly past 400 million now. And if you look at those downloads, scientific software is heavily represented: ~30 million downloads of NumPy, ~15 million SciPy, ~15 million pandas, ~12 million scikit-learn, ~8 million matplotlib, ~4 million tensorflow, ... (Fun fact: a back of the envelope calculation1 suggests that the manylinux wheels for SciPy alone have so far prevented ~90 metric tons of CO2 emissions, equivalent to planting ~2,400 trees.)

So manylinux makes things easier for users. Eventually, users become developers in their own right, and want to publish their work. And then they have to learn to use distutils/setuptools, which is... painful. Distutils/setuptools can work well, especially in simple cases, but their design has some fundamental limitations that make them confusing and difficult to extend, and this is especially problematic for any projects with complex native code dependencies or that use NumPy's C API, i.e. scientific packages. This isn't exactly distutils's fault – its design dates back to the last millennium, and no-one could have anticipated all the ways Python would be used over the coming decades. And Python's packaging maintainers have done a heroic job of keeping things working and incrementally improving on extremely minimal resources. But often this has meant piling expedient hacks on top of each other; it's very difficult to revisit fundamental decisions when you're a all-volunteer project struggling to maintain critical infrastructure with millions of stakeholders. And so fighting with distutils/setuptools has remained a rite of passage for Python developers. (And conda can't help you here either: for builds, conda packages rely on distutils/setuptools, just like the rest of us.)

Another of my goals while at BIDS was to chart a path forward out of this tangle – and, with the help of lots of folks at distutils-sig (especially Thomas Kluyver, whose efforts were truly heroic!), we now have one. PEP 518 defines the pyproject.toml file and for the first time makes it possible to extend distutils/setuptools in a reasonable way (for those who know setup.py: this is basically setup_requires, except it works). This recently shipped in pip 10. And PEP 517 isn't quite implemented yet, but soon it will make it easy for projects to abandon distutils/setuptools entirely in favor of tools that are easier to use or better prepared to handle demanding scientific users, making software publication easier and more accessible to ordinary scientists.

The Viridis colormap

When I started at BIDS, matplotlib still used the the awful "jet" colormap by default, despite probably dozens of peer-reviewed articles pointing out how rainbow colormaps like "jet" distort users' understanding of their data, create barriers to accessibility, and lead to bad decisions, including (for example) unnecessary medical diagnostic errors. So I suggested to Stéfan that we fix this. This was an interesting challenge, with two parts: first, the computational challenge of building a set of tools to visualize and design better colormaps, and second and more importantly, the social challenge of convincing people to actually use them. After all, there have been many proposals for better colormaps over the years. Most of them sank without a trace, and it was entirely possible that our colormap "viridis" would do the same.

This required working with the matplotlib community to first find a socially acceptable way to make any changes at all in their default styles – here my suggestion of a style-change-only 2.0 release proved successful (and ultimately led to a much-needed broader style overhaul). Then we had the problem that there are many perfectly reasonable colormaps, and we needed to build consensus around a single proposal without getting derailed by endless discussion – avoiding this was the goal of a talk I gave at SciPy 2015.

In the end, we succeeded beyond our wildest expectations. As of today, my talk's been watched >85,000 times, making it the most popular talk in the history of the SciPy conference. Viridis is now the default colormap in matplotlib, octave, and parts of ggplot2. Its R package receives hundreds of thousands of downloads every month which puts it comfortably in the top 50 most popular R packages. Its fans have ported it to essentially every visualization framework known to humankind. It's been showcased in Nobel-prize winning research and NASA press releases, and inspired stickers and twitter bots and follow-ups from other researchers.

On the one hand, it's "just" a colormap. But it feels pretty good to know that every day millions of people are gaining a little more understanding, more insight, and making better decisions thanks to our work, and that we've permanently raised the bar on good data visualization practice.

Making concurrent programming more accessible

Here's a common problem: writing a program that does multiple things concurrently, either for performance or as an intrinsic part of its functionality – from web servers handling simultaneous users and web spiders that want to fetch lots of pages in parallel, to Jupyter notebooks juggling multiple backend kernels and a UI, to complex simulations running on HPC clusters. But writing correct concurrent programs is notoriously challenging, even for experts. This is a challenge across the industry, but felt particularly acutely by scientists, who generally receive minimal training as software developers, yet often need to write novel high-performance parallel code – since by definition, their work involves pushing the boundary of what's possible. (In fact Software Carpentry originally "grew out of [Greg Wilson's] frustration working with scientists who wanted to parallelize complex programs but didn't know what version control was...".)

Over the last year I've been developing a new paradigm for making practical concurrent programming more accessible to ordinary developers, based on a novel analysis of where some of the difficulties come from, and repurposing some old ideas in language design. In the course of this work I've produced a practical implementation in the Python library Trio, together with a series of articles, including two discussing the theory behind the core new language constructs:

This last project is a bit different than the others – it's more in the way of basic research, so it will be some time before we know the full impact. But so far it's attracting quite a bit of interest across the industry and from language designers (for example) and I suspect that either Trio or something very like it will become the de facto standard library for networking and concurrency in Python.

Other work

Some other smaller things I did at BIDS, besides the four major projects discussed above:

  • Was elected as an honorary PSF Fellow, and to the Python core developer team.

  • Wrote up feedback for the BLAS working group on their proposal for a next generation BLAS API. The BLAS is the set of core linear algebra routines that essentially all number-crunching software is built on, and the BLAS working group is currently developing a the first update in almost two decades. In the past, BLAS has been designed mostly with input from traditional HPC users running Fortran on dedicated clusters; this is the first time NumPy/SciPy have been involved in this process.

  • Provided some assistance with organizing the MOSS grant that funded the new PyPI.

  • Created the h11 HTTP library, and came up with a plan for using it to let urllib3/requests and downstream packages join the new world of Python async concurrency.

  • Had a number of discussions with the conda team about how the conda and pip worlds could cooperate better.

  • And of course lots of general answering of questions, giving of advice, fixing of bugs, triaging of bugs, making of connections, etc.

...and the ones that got away

And finally, there are the ones that got away: projects where I've been working on laying the groundwork, but ran out of time before producing results. I think these are entirely feasible and have transformative potential – I'm mentioning them here partly in hopes that someone picks them up:

PyIR: Here's the problem. Libraries like NumPy and pandas are written in C, which makes them reasonably fast on CPython, but prevents JIT optimizers like PyPy or Numba from being able to speed them up further. If we rewrote them in Python, they'd be fast on PyPy or Numba, but unusably slow on regular CPython. Is there any way to have our cake and eat it too? Right now, our only solution is to maintain multiple copies of NumPy and other key libraries (e.g. Numba and PyPy have both spent significant resources on this), which isn't scalable or sustainable.

So I organized a workshop and invited all the JIT developers I could find. I think we came up with a viable way forward, based around the idea of a Cython-like language that generates C code for CPython, and a common higher-level IR for the JITs, and multiple projects were excited about collaborating on this – but this happened literally the week before I got sick, and I wasn't able to follow up and get things organized. It's still doable though, and could unlock a new level of performance for Python – and as a bonus, in the long run it might provide a way to escape the "C API trap" that currently blocks many improvements to CPython (e.g., removing the GIL).

Telemetry: One reason why developing software like NumPy is challenging is that we actually have very little idea how people use it. If we remove a deprecated API, how disruptive will that be? Is anyone actually using that cool new feature we added? Should we put more resources into optimizing module X or module Y? And what about at the ecosystem level – how many users do different packages have? Which ones are used together? Answering these kinds of questions is crucial to providing responsible stewardship, but right now there's simply no way to do it.

Of course there are many pitfalls to gathering this sort of data; if you're going to do it at all, you have to do it right, with affirmative user consent, clear guidelines for what can be collected and how it can be used, a neutral non-profit to provide oversight, shared infrastructure so we can share the effort across many projects, and so on. But these are all problems that can be solved with the right investment (about which, see below), and doing so could radically change the conversations around maintaining and sustaining open scientific software.

What next?

So there you have it: that's what I've been up to for the last few years. Not everything worked out the way I hoped, but overall I'm extremely proud of what I was able to accomplish, and grateful to BIDS and its funders for providing this opportunity.

As mentioned above, I'm currently considering options for what to do next – if you're interested in discussing possibilities, get in touch!

Or, if you're interested in the broader question of sustainability for open scientific software, I wrote a follow-up post trying to analyze what it was about this position allowed it to be so successful.

  1. Assuming that the people installing SciPy manylinux wheels would instead have built SciPy from source (which is what pip install scipy used to do), that building SciPy takes 10 minutes, and that during that time the computer consumes an extra 50 W of power, then we can calculate 10 minutes * 50 W / 60 minutes/hour / 1000 Wh/kWh * 15,000,000 builds = 125,000 kWh of reduced electricity usage, which I then plugged into this EPA calculator

Companion post for my PyCon 2018 talk on async concurrency using Trio

The talk itself:

(I'm afraid you probably need LibreOffice, and ideally the Montserrat and Deja Vu Sans Mono, to view the slides properly. I haven't posted a PDF because LibreOffice's PDF export makes a mess of slides containing animations.)

Chat: Questions? You're watching the talk months later on youtube? That's cool, we hang out on Gitter chat and you can always ask questions there.

Sprint info: I'll be here at PyCon for the first two days of the sprint – if you want to contribute to Trio, or just play around with it while sitting next to me, then that'd be awesome! Note that we give out commit rights to everyone as soon as their first PR is merged.

Trio's tutorial and reference manual: https://trio.readthedocs.io

Code and issues: https://github.com/python-trio/trio

Articles: For more background on the ideas in Trio:

I really want to follow you on Twitter! I don't really tweet much, but, here you go....

Do you have a blog? Yep. This is it :-).

Control-C handling in Python and Trio

Trio is a new asynchronous I/O library for Python, with a focus on usability and correctness – the goal is to make it easy to get things right.

One thing well-behaved programs should do is exit cleanly when the user hits control-C. In Python this mostly Just Works without developers having to think about it too much, and as part of trio's focus on usability, we'd like to carry that over: there are few things more annoying than a program that refuses to quit when you hit control-C! But preserving correctness in the face of an interrupt that can happen at literally any moment is not at all trivial. This is a place where trio's two goals interact in a surprisingly interesting way! In this post I'll explore some different options for handling control-C, and explain Trio's solution – with a bonus deep dive into signal handling and some rather obscure corners of CPython's guts.

The tl;dr is: if you're writing a program using trio, then control-C should generally Just Work the way you expect from regular Python, i.e., it will raise KeyboardInterrupt somewhere in your code, and this exception then propagates out to unwind your stack, run cleanup handlers, and eventually exit the program. You don't need this article to use trio; you can start with our tutorial and be happy and productive without thinking about control-C ever again. In fact, most developers probably won't even realize that there's anything special happening at all. But if you're curious about how we make the magic go, then read on...

Announcing Trio

As you may recall, I have strong feelings about the design of usable APIs for asynchronous libraries.

So, I wrote my own.

I'm happy to announce the first release of Trio, a new permissively-licensed async library for Python:

Trio is very much inspired by my work with and on Curio, so much credit to Dave Beazley. They don't share any actual code, and at this point there are many small and large divergences all over the stack, but if you're curious the tipping point where I decided I wanted to explore an incompatible approach was here.

Some noteworthy features:

  • Aspires to become production-quality
  • Full support for Windows, Linux, MacOS
  • Full support for both CPython 3.5+ and for PyPy 3.5 pre-releases
  • Flow control is fully async/await-native and easy to reason about: no callbacks, no futures, no implicit concurrency
  • Powerful and composable framework for handling cancellation and timeouts
  • Strong user-centered guarantees around cancel and schedule points make it easier to manage and reason about cooperative concurrency
  • Erlang-influenced interface for task spawning provides a structured system for managing child tasks. If user code raises an exception then it's always propagated until handled, never logged-and-discarded.
  • First-class support for introspection and debugging (example)
  • Powerful built-in testing helpers. For example, you can speed up tests that involve timeouts by using a clock that automatically skips over boring periods
    • As a demonstration of the power of good testing tools, trio's own test suite achieves >98% coverage and runs in ~5 seconds in "slow" mode (~2 seconds in default mode).
  • Interrupting programs with control-C magically just works.
  • A mostly-written tutorial that doesn't assume any familiarity with async/await.
  • A low-level "hazmat" API for when you need to go under the hood. To make sure it's powerful enough, Trio's main synchronization, networking, and threading APIs are implemented using only public interfaces.
  • Exposes a whole laundry list of Python limitations.
  • Lots of missing pieces left for you to help fill in! :-)

I hope you'll check it out!

Why does calloc exist?

[Edit: Welcome Hacker News readers! Before we dive into the neat memory management esoterica, I want to briefly note that as engineers we have an ethical obligation in our work to consider the "safety, health, and welfare of the public", because if we don't, terrible things happen. This is a challenging responsibility that requires we all stay thoughtful and informed – but that's difficult if popular technical news aggregators choose to censor links and discussions about the societal implications of technology. I sympathize with their moderation challenges, but this idea of creating a politics-free safe space is the cowards' way out, quite literally choosing the "absence of tension" over "the presence of justice". I hope the HN moderators find a way to step up to the responsibility their position entails; in the mean time, you might consider also subscribing to to The Recompiler and Model View Culture, and checking out Safety Pin Box, techsolidarity.org, or Fund Club. Anyway, thanks for listening! We now return to our regularly scheduled calloc-related programming, and I hope you enjoy my essay. And if you like this, you might also enjoy Cory Benfield's related post.]

When programming in C, there are two standard ways to allocate some new memory on the heap:

void* buffer1 = malloc(size);
void* buffer2 = calloc(count, size);

malloc allocates an uninitialized array with the given number of bytes, i.e., buffer1 could contain anything. In terms of its public API, calloc is different in two ways: first, it takes two arguments instead of one, and second, it returns memory that is pre-initialized to be all-zeros. So there are lots of books and webpages out there that will claim that the calloc call above is equivalent to calling malloc and then calling memset to fill the memory with zeros:

/* Equivalent to the calloc() call above -- OR IS IT?? */
void* buffer3 = malloc(count * size);
memset(buffer3, 0, count * size);

So... why does calloc exist, if it's equivalent to these 2 lines? The C library is not known for its excessive focus on providing convenient shorthands!

It turns out the answer is less widely known than I had realized! If I were Julia Evans at this point I'd make a neat little comic 😊. But I'm not, so... here's a wall of text.

It turns out there are actually two differences between calling calloc, versus calling malloc + memset.

Difference #1: computers are bad at arithmetic

When calloc multiplies count * size, it checks for overflow, and errors out if the multiplication returns a value that can't fit into a 32- or 64-bit integer (whichever one is relevant for your platform). This is good. If you do the multiplication the naive way I did it above by just writing count * size, then if the values are too large then the multiplication will silently wrap around, and malloc will happily allocate a smaller buffer than we expected. That's bad. "This part of the code thought the buffer was this long but that part of the code thought it was that long" is the beginning of, like, eleventy-billion security advisories every year. (Example)

I wrote a little program to demonstrate. It tries to allocate an buffer containing 263 × 263 = 2126 bytes, first using malloc and then using calloc:

Download source: calloc-overflow-demo.c
 6 int main(int argc, char** argv)
 7 {
 8     size_t huge = INTPTR_MAX;
10     void* buf = malloc(huge * huge);
11     if (!buf) perror("malloc failed");
12     printf("malloc(huge * huge) returned: %p\n", buf);
13     free(buf);
15     buf = calloc(huge, huge);
16     if (!buf) perror("calloc failed");
17     printf("calloc(huge, huge) returned: %p\n", buf);
18     free(buf);
19 }

On my computer, I get:

~$ gcc calloc-overflow-demo.c -o calloc-overflow-demo
~$ ./calloc-overflow-demo
malloc(huge * huge) returned: 0x55c389d94010
calloc failed: Cannot allocate memory
calloc(huge, huge) returned: (nil)

So yeah, apparently malloc successfully allocated a 73786976294838206464 exbiyte array? I'm sure that will work out well. This is a nice thing about calloc: it helps avoid terrible security flaws.

But, it's not that exciting. (I mean, let's be honest: if we really cared about security we wouldn't be writing in C.) It only helps in the particular case where you're deciding how much memory to allocate by multiplying two numbers together. This happens, it's an important case, but there are lots of other cases where we either aren't doing any arithmetic at all, or where we're doing some more complex arithmetic and need a more general solution. Plus, if we wanted to, we could certainly write our own wrapper for malloc that took two arguments and multiplied them together with overflow checking. And in fact if we want an overflow-safe version of realloc, or if we don't want the memory to be zero-initialized, then... we still have to do that. So, it's... nice? But it doesn't really justify calloc's existence.

The other difference, though? Is super, super important.

Difference #2: lies, damned lies, and virtual memory

Here's a little benchmark program that measures how long it takes to calloc a 1 gibibyte buffer versus malloc+memset a 1 gibibyte buffer. (Make sure you compile without optimization, because modern compilers are clever enough to know that free(calloc(...)) is a no-op and optimize it out!) On my laptop I get:

~$ gcc calloc-1GiB-demo.c -o calloc-1GiB-demo
~$ ./calloc-1GiB-demo
calloc+free 1 GiB: 3.44 ms
malloc+memset+free 1 GiB: 365.00 ms

i.e., calloc is more than 100x faster. Our textbooks and manual pages says they're equivalent. What the heck is going on?

The answer, of course, is that calloc is cheating.

For small allocations, calloc literally will just call malloc+memset, so it'll be the same speed. But for larger allocations, most memory allocators will for various reasons make a special request to the operating system to fetch more memory just for this allocation. ("Small" and "large" here are determined by some heuristics inside your memory allocator; for glibc "large" is anything >128 KiB, at least in its default configuration).

When the operating system hands out memory to a process, it always zeros it out first, because otherwise our process would be able to peek at whatever detritus was left in that memory by the last process to use it, which might include, like, crypto keys, or embarrassing fanfiction. So that's the first way that calloc cheats: when you call malloc to allocate a large buffer, then probably the memory will come from the operating system and already be zeroed, so there's no need to call memset. But you don't know that for sure! Memory allocators are pretty inscrutable. So you have to call memset every time just in case. But calloc lives inside the memory allocator, so it knows whether the memory it's returning is fresh from the operating system, and if it is then it skips calling memset. And this is why calloc has to be built into the standard library, and you can't efficiently fake it yourself as a layer on top of malloc.

But this only explains part of the speedup: memset+malloc is actually clearing the memory twice, and calloc is clearing it once, so we might expect calloc to be 2x faster at best. Instead... it's 100x faster. What the heck?

It turns out that the kernel is also cheating! When we ask it for 1 GiB of memory, it doesn't actually go out and find that much RAM and write zeros to it and then hand it to our process. Instead, it fakes it, using virtual memory: it takes a single 4 KiB page of memory that is already full of zeros (which it keeps around for just this purpose), and maps 1 GiB / 4 KiB = 262144 copy-on-write copies of it into our process's address space. So the first time we actually write to each of those 262144 pages, then at that point the kernel has to go and find a real page of RAM, write zeros to it, and then quickly swap it in place of the "virtual" page that was there before. But this happens lazily, on a page-by-page basis.

So in real life, the difference won't be as stark as it looks in our benchmark up above – part of the trick is that calloc is shifting some of the cost of zero'ing out pages until later, while malloc+memset is paying the full price up front. BUT, at least we aren't zero'ing them out twice. And at least we aren't trashing the cache hierarchy up front – if we delay the zero'ing until we were going to write to the pages anyway, then that means both writes happen at the same time, so we only have to pay one set of TLB / L2 cache / etc. misses. And, most importantly, it's possible we might never get around to writing to all of those pages at all, in which case calloc + the kernel's sneaky trickery is a huge win!

Of course, the exact set of optimizations calloc uses will vary depending on your environment. A neat trick that used to be popular was that the kernel would go around and speculatively zero out pages when the system was idle, so that they'd be fresh and ready when needed – but this is out of fashion on current systems. Tiny embedded systems without virtual memory obviously won't use virtual memory trickery. But in general, calloc is never worse than malloc+memset, and on mainstream systems it can do much better.

One real life example is a recent bug in requests, where doing streaming downloads over HTTPS with a large receive block size was chewing up 100% CPU. It turns out that the problem was that when the user said they were willing to handle up to 100 MiB chunks at a time, then requests passed that on to pyopenssl, and then pyopenssl used cffi.new to allocate a 100 MiB buffer to hold the incoming data. But most of the time, there wasn't actually 100 MiB ready to read on the connection; so pyopenssl would allocate this large buffer, but then would only use a small part of it. Except... it turns out that cffi.new emulates calloc by doing malloc+memset, so they were paying to allocate and zero the whole buffer anyway. If cffi.new had used calloc instead, then the bug never would have happened! Hopefully they'll fix that soon.

Or here's another example that comes up in numpy: suppose you want to make a big identity matrix, one with 16384 rows and 16384 columns. That requires allocating a buffer to hold 16384 * 16384 floating point numbers, and each float is 8 bytes, so that comes to 2 GiB of memory total.

Before we create the matrix, our process is using 24 MiB of memory:

>>> import numpy as np
>>> import resource
>>> # this way of fetching memory usage probably only works right on Linux:
>>> def mebibytes_used():
...     return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
>>> mebibytes_used()

Then we allocate a 2 GiB dense matrix:

>>> big_identity_matrix = np.eye(16384)
>>> big_identity_matrix
array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.]])
>>> big_identity_matrix.shape
(16384, 16384)

How much memory is our process using now? The Answer May Surprise You (Learn This One Weird Trick To Slim Down Your Processes Now):

>>> mebibytes_used()

Numpy allocated the array using calloc, and then it wrote 1s in the diagonal... but most of the array is still zeros, so it isn't actually taking up any memory, and our 2 GiB matrix fits into ~60 MiB of actual RAM. Of course there are other ways to accomplish the same thing, like using a real sparse matrix library, but that's not the point. The point is that if you do something like this, calloc will magically make everything more efficient – and it's always at least as fast as the alternative.

So basically, calloc exists because it lets the memory allocator and kernel engage in a sneaky conspiracy to make your code faster and use less memory. You should let it! Don't use malloc+memset!

Changes history:

  • 2016-12-05 14:00 PST: Fix typo: HTTP where I meant HTTPS.
  • 2016-12-05 17:00 PST: Add the HN note.
  • 2016-12-07 01:45 PST: Several changes:
    • Use better error checking style in calloc-overflow-demo.c
    • Clarify that I'm not saying you can't reimplement calloc yourself and get good performance, just that if you want to do that you have to reimplement malloc too – the point is that calloc might look like it can be implemented in terms of malloc, but this is misleading.
    • Add a paragraph noting more explicitly that calloc optimizations vary across systems.

Emerging from the underworld

Github contributions graph showing an empty void from mid-July through mid-October.

SciPy this year was awesome – there's so much wonderful stuff happening in the community, and I had lots of great, productive conversations about how to move forward on different projects that I'm really excited about. ...and then immediately afterwards, I pretty much dropped everything on the floor and disappeared for 3 months, including cancelling several talks and trips. Some people know part of why, but I definitely haven't been as good about keeping my friends and collaborators updated as I'd like to be. So here's an update.

The short version: the week after SciPy I had a minor surgery scheduled, and that went fine – but then afterwards, instead of recovering, I just got sicker and sicker. I couldn't eat – I lost 30 lbs in 60 days – I was feverish and in pain and sleeping 12-16 hours a day, anemic... among other things. If you imagine having food poisoning continuously 2+ months then you won't be too far off.

The good news is that after much medical wrangling (so much medical wrangling), I finally have a diagnosis. The not-so-great news is that it's ulcerative colitis, which is a fairly serious, chronic, auto-immune disorder in which my immune system periodically tries to kill my large intestine. There's a lot of variation in the long-term effects: for some people it's a minor annoyance, and for others it's a permanent severe disability. There's every reason to hope that once I'm out of this massive flare triggered by the surgery etc. then I'll turn out to be on the milder end of the scale (and in retrospect, I'm suspiciously eyeing some past events as signs that I might been having mild symptoms that evaded diagnosis for quite a long time). But realistically, it'll take months or years before I know for sure where I fall.

For now at least, it's good to know what's going on, and I'm responding well to treatment. I'm feeling much, much better. Still not 100%, and I'm trying to ease back into things slowly to avoid over-extending myself and triggering a relapse (hello, spoon theory).

To everyone who's been waiting for a response or followup from me on something: I'm very sorry for disappearing like this. I'm starting to work through my disaster zone of an inbox, and realistically it's going to take some time to get on top of things. I may have to make some tough choices about dropping previous commitments, or asking folks to cover for me – but at least I can do a better job communicating that. I also don't mind if you send me more email or pings or whatever – if anything, it helps to triage. Just FYI though that even though I'm doing somewhat better, I'm still going to have limited bandwidth for a while. Thanks for your understanding and patience! I'm looking forward to getting through this and back to working on more awesome stuff.