njs blog

Announcing Trio

As you may recall, I have strong feelings about the design of usable APIs for asynchronous libraries.

So, I wrote my own.

I'm happy to announce the first release of Trio, a new permissively-licensed async library for Python:

Trio is very much inspired by my work with and on Curio, so much credit to Dave Beazley. They don't share any actual code, and at this point there are many small and large divergences all over the stack, but if you're curious the tipping point where I decided I wanted to explore an incompatible approach was here.

Some noteworthy features:

  • Aspires to become production-quality
  • Full support for Windows, Linux, MacOS
  • Full support for both CPython 3.5+ and for PyPy 3.5 pre-releases
  • Flow control is fully async/await-native and easy to reason about: no callbacks, no futures, no implicit concurrency
  • Powerful and composable framework for handling cancellation and timeouts
  • Strong user-centered guarantees around cancel and schedule points make it easier to manage and reason about cooperative concurrency
  • Erlang-influenced interface for task spawning provides a structured system for managing child tasks. If user code raises an exception then it's always propagated until handled, never logged-and-discarded.
  • First-class support for introspection and debugging (example)
  • Powerful built-in testing helpers. For example, you can speed up tests that involve timeouts by using a clock that automatically skips over boring periods
    • As a demonstration of the power of good testing tools, trio's own test suite achieves >98% coverage and runs in ~5 seconds in "slow" mode (~2 seconds in default mode).
  • Interrupting programs with control-C magically just works.
  • A mostly-written tutorial that doesn't assume any familiarity with async/await.
  • A low-level "hazmat" API for when you need to go under the hood. To make sure it's powerful enough, Trio's main synchronization, networking, and threading APIs are implemented using only public interfaces.
  • Exposes a whole laundry list of Python limitations.
  • Lots of missing pieces left for you to help fill in! :-)

I hope you'll check it out!

Why does calloc exist?

[Edit: Welcome Hacker News readers! Before we dive into the neat memory management esoterica, I want to briefly note that as engineers we have an ethical obligation in our work to consider the "safety, health, and welfare of the public", because if we don't, terrible things happen. This is a challenging responsibility that requires we all stay thoughtful and informed – but that's difficult if popular technical news aggregators choose to censor links and discussions about the societal implications of technology. I sympathize with their moderation challenges, but this idea of creating a politics-free safe space is the cowards' way out, quite literally choosing the "absence of tension" over "the presence of justice". I hope the HN moderators find a way to step up to the responsibility their position entails; in the mean time, you might consider also subscribing to to The Recompiler and Model View Culture, and checking out Safety Pin Box, techsolidarity.org, or Fund Club. Anyway, thanks for listening! We now return to our regularly scheduled calloc-related programming, and I hope you enjoy my essay. And if you like this, you might also enjoy Cory Benfield's related post.]


When programming in C, there are two standard ways to allocate some new memory on the heap:

void* buffer1 = malloc(size);
void* buffer2 = calloc(count, size);

malloc allocates an uninitialized array with the given number of bytes, i.e., buffer1 could contain anything. In terms of its public API, calloc is different in two ways: first, it takes two arguments instead of one, and second, it returns memory that is pre-initialized to be all-zeros. So there are lots of books and webpages out there that will claim that the calloc call above is equivalent to calling malloc and then calling memset to fill the memory with zeros:

/* Equivalent to the calloc() call above -- OR IS IT?? */
void* buffer3 = malloc(count * size);
memset(buffer3, 0, count * size);

So... why does calloc exist, if it's equivalent to these 2 lines? The C library is not known for its excessive focus on providing convenient shorthands!

It turns out the answer is less widely known than I had realized! If I were Julia Evans at this point I'd make a neat little comic 😊. But I'm not, so... here's a wall of text.

It turns out there are actually two differences between calling calloc, versus calling malloc + memset.

Difference #1: computers are bad at arithmetic

When calloc multiplies count * size, it checks for overflow, and errors out if the multiplication returns a value that can't fit into a 32- or 64-bit integer (whichever one is relevant for your platform). This is good. If you do the multiplication the naive way I did it above by just writing count * size, then if the values are too large then the multiplication will silently wrap around, and malloc will happily allocate a smaller buffer than we expected. That's bad. "This part of the code thought the buffer was this long but that part of the code thought it was that long" is the beginning of, like, eleventy-billion security advisories every year. (Example)

I wrote a little program to demonstrate. It tries to allocate an buffer containing 263 × 263 = 2126 bytes, first using malloc and then using calloc:

Download source: calloc-overflow-demo.c
 6 int main(int argc, char** argv)
 7 {
 8     size_t huge = INTPTR_MAX;
 9 
10     void* buf = malloc(huge * huge);
11     if (!buf) perror("malloc failed");
12     printf("malloc(huge * huge) returned: %p\n", buf);
13     free(buf);
14 
15     buf = calloc(huge, huge);
16     if (!buf) perror("calloc failed");
17     printf("calloc(huge, huge) returned: %p\n", buf);
18     free(buf);
19 }

On my computer, I get:

~$ gcc calloc-overflow-demo.c -o calloc-overflow-demo
~$ ./calloc-overflow-demo
malloc(huge * huge) returned: 0x55c389d94010
calloc failed: Cannot allocate memory
calloc(huge, huge) returned: (nil)

So yeah, apparently malloc successfully allocated a 73786976294838206464 exbiyte array? I'm sure that will work out well. This is a nice thing about calloc: it helps avoid terrible security flaws.

But, it's not that exciting. (I mean, let's be honest: if we really cared about security we wouldn't be writing in C.) It only helps in the particular case where you're deciding how much memory to allocate by multiplying two numbers together. This happens, it's an important case, but there are lots of other cases where we either aren't doing any arithmetic at all, or where we're doing some more complex arithmetic and need a more general solution. Plus, if we wanted to, we could certainly write our own wrapper for malloc that took two arguments and multiplied them together with overflow checking. And in fact if we want an overflow-safe version of realloc, or if we don't want the memory to be zero-initialized, then... we still have to do that. So, it's... nice? But it doesn't really justify calloc's existence.

The other difference, though? Is super, super important.

Difference #2: lies, damned lies, and virtual memory

Here's a little benchmark program that measures how long it takes to calloc a 1 gibibyte buffer versus malloc+memset a 1 gibibyte buffer. (Make sure you compile without optimization, because modern compilers are clever enough to know that free(calloc(...)) is a no-op and optimize it out!) On my laptop I get:

~$ gcc calloc-1GiB-demo.c -o calloc-1GiB-demo
~$ ./calloc-1GiB-demo
calloc+free 1 GiB: 3.44 ms
malloc+memset+free 1 GiB: 365.00 ms

i.e., calloc is more than 100x faster. Our textbooks and manual pages says they're equivalent. What the heck is going on?

The answer, of course, is that calloc is cheating.

For small allocations, calloc literally will just call malloc+memset, so it'll be the same speed. But for larger allocations, most memory allocators will for various reasons make a special request to the operating system to fetch more memory just for this allocation. ("Small" and "large" here are determined by some heuristics inside your memory allocator; for glibc "large" is anything >128 KiB, at least in its default configuration).

When the operating system hands out memory to a process, it always zeros it out first, because otherwise our process would be able to peek at whatever detritus was left in that memory by the last process to use it, which might include, like, crypto keys, or embarrassing fanfiction. So that's the first way that calloc cheats: when you call malloc to allocate a large buffer, then probably the memory will come from the operating system and already be zeroed, so there's no need to call memset. But you don't know that for sure! Memory allocators are pretty inscrutable. So you have to call memset every time just in case. But calloc lives inside the memory allocator, so it knows whether the memory it's returning is fresh from the operating system, and if it is then it skips calling memset. And this is why calloc has to be built into the standard library, and you can't efficiently fake it yourself as a layer on top of malloc.

But this only explains part of the speedup: memset+malloc is actually clearing the memory twice, and calloc is clearing it once, so we might expect calloc to be 2x faster at best. Instead... it's 100x faster. What the heck?

It turns out that the kernel is also cheating! When we ask it for 1 GiB of memory, it doesn't actually go out and find that much RAM and write zeros to it and then hand it to our process. Instead, it fakes it, using virtual memory: it takes a single 4 KiB page of memory that is already full of zeros (which it keeps around for just this purpose), and maps 1 GiB / 4 KiB = 262144 copy-on-write copies of it into our process's address space. So the first time we actually write to each of those 262144 pages, then at that point the kernel has to go and find a real page of RAM, write zeros to it, and then quickly swap it in place of the "virtual" page that was there before. But this happens lazily, on a page-by-page basis.

So in real life, the difference won't be as stark as it looks in our benchmark up above – part of the trick is that calloc is shifting some of the cost of zero'ing out pages until later, while malloc+memset is paying the full price up front. BUT, at least we aren't zero'ing them out twice. And at least we aren't trashing the cache hierarchy up front – if we delay the zero'ing until we were going to write to the pages anyway, then that means both writes happen at the same time, so we only have to pay one set of TLB / L2 cache / etc. misses. And, most importantly, it's possible we might never get around to writing to all of those pages at all, in which case calloc + the kernel's sneaky trickery is a huge win!

Of course, the exact set of optimizations calloc uses will vary depending on your environment. A neat trick that used to be popular was that the kernel would go around and speculatively zero out pages when the system was idle, so that they'd be fresh and ready when needed – but this is out of fashion on current systems. Tiny embedded systems without virtual memory obviously won't use virtual memory trickery. But in general, calloc is never worse than malloc+memset, and on mainstream systems it can do much better.

One real life example is a recent bug in requests, where doing streaming downloads over HTTPS with a large receive block size was chewing up 100% CPU. It turns out that the problem was that when the user said they were willing to handle up to 100 MiB chunks at a time, then requests passed that on to pyopenssl, and then pyopenssl used cffi.new to allocate a 100 MiB buffer to hold the incoming data. But most of the time, there wasn't actually 100 MiB ready to read on the connection; so pyopenssl would allocate this large buffer, but then would only use a small part of it. Except... it turns out that cffi.new emulates calloc by doing malloc+memset, so they were paying to allocate and zero the whole buffer anyway. If cffi.new had used calloc instead, then the bug never would have happened! Hopefully they'll fix that soon.

Or here's another example that comes up in numpy: suppose you want to make a big identity matrix, one with 16384 rows and 16384 columns. That requires allocating a buffer to hold 16384 * 16384 floating point numbers, and each float is 8 bytes, so that comes to 2 GiB of memory total.

Before we create the matrix, our process is using 24 MiB of memory:

>>> import numpy as np
>>> import resource
>>> # this way of fetching memory usage probably only works right on Linux:
>>> def mebibytes_used():
...     return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
...
>>> mebibytes_used()
24.35546875

Then we allocate a 2 GiB dense matrix:

>>> big_identity_matrix = np.eye(16384)
>>> big_identity_matrix
array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.]])
>>> big_identity_matrix.shape
(16384, 16384)

How much memory is our process using now? The Answer May Surprise You (Learn This One Weird Trick To Slim Down Your Processes Now):

>>> mebibytes_used()
88.3515625

Numpy allocated the array using calloc, and then it wrote 1s in the diagonal... but most of the array is still zeros, so it isn't actually taking up any memory, and our 2 GiB matrix fits into ~60 MiB of actual RAM. Of course there are other ways to accomplish the same thing, like using a real sparse matrix library, but that's not the point. The point is that if you do something like this, calloc will magically make everything more efficient – and it's always at least as fast as the alternative.

So basically, calloc exists because it lets the memory allocator and kernel engage in a sneaky conspiracy to make your code faster and use less memory. You should let it! Don't use malloc+memset!


Changes history:

  • 2016-12-05 14:00 PST: Fix typo: HTTP where I meant HTTPS.
  • 2016-12-05 17:00 PST: Add the HN note.
  • 2016-12-07 01:45 PST: Several changes:
    • Use better error checking style in calloc-overflow-demo.c
    • Clarify that I'm not saying you can't reimplement calloc yourself and get good performance, just that if you want to do that you have to reimplement malloc too – the point is that calloc might look like it can be implemented in terms of malloc, but this is misleading.
    • Add a paragraph noting more explicitly that calloc optimizations vary across systems.

Emerging from the underworld

Github contributions graph showing an empty void from mid-July through mid-October.

SciPy this year was awesome – there's so much wonderful stuff happening in the community, and I had lots of great, productive conversations about how to move forward on different projects that I'm really excited about. ...and then immediately afterwards, I pretty much dropped everything on the floor and disappeared for 3 months, including cancelling several talks and trips. Some people know part of why, but I definitely haven't been as good about keeping my friends and collaborators updated as I'd like to be. So here's an update.

The short version: the week after SciPy I had a minor surgery scheduled, and that went fine – but then afterwards, instead of recovering, I just got sicker and sicker. I couldn't eat – I lost 30 lbs in 60 days – I was feverish and in pain and sleeping 12-16 hours a day, anemic... among other things. If you imagine having food poisoning continuously 2+ months then you won't be too far off.

The good news is that after much medical wrangling (so much medical wrangling), I finally have a diagnosis. The not-so-great news is that it's ulcerative colitis, which is a fairly serious, chronic, auto-immune disorder in which my immune system periodically tries to kill my large intestine. There's a lot of variation in the long-term effects: for some people it's a minor annoyance, and for others it's a permanent severe disability. There's every reason to hope that once I'm out of this massive flare triggered by the surgery etc. then I'll turn out to be on the milder end of the scale (and in retrospect, I'm suspiciously eyeing some past events as signs that I might been having mild symptoms that evaded diagnosis for quite a long time). But realistically, it'll take months or years before I know for sure where I fall.

For now at least, it's good to know what's going on, and I'm responding well to treatment. I'm feeling much, much better. Still not 100%, and I'm trying to ease back into things slowly to avoid over-extending myself and triggering a relapse (hello, spoon theory).

To everyone who's been waiting for a response or followup from me on something: I'm very sorry for disappearing like this. I'm starting to work through my disaster zone of an inbox, and realistically it's going to take some time to get on top of things. I may have to make some tough choices about dropping previous commitments, or asking folks to cover for me – but at least I can do a better job communicating that. I also don't mind if you send me more email or pings or whatever – if anything, it helps to triage. Just FYI though that even though I'm doing somewhat better, I'm still going to have limited bandwidth for a while. Thanks for your understanding and patience! I'm looking forward to getting through this and back to working on more awesome stuff.

Stochastic descent

Emacs has a lot of stuff in it. Like, really, a lot. Most of the time it doesn't matter, you can just use whatever bits you know and be happy, but there's always more to learn.

For example, here's a thing I often want to do: starting from whatever file I have open, open the directory containing that file.

So my fingers have memorized the sequence: C-x C-f C-j

(For the uninitiated, that's emacs-ese for "while holding down the control key, press X F J in that order".)

This breaks down as:

  • C-x C-f: this runs the find-file command (or actually ido-find-file in my case), and causes emacs to ask for the name of a file to open.

  • C-j: The prompt opens with the current directory filled in as the initial value, so this says "yes, I want that". It's like hitting enter, except that if you just hit enter without having typed anything then emacs figures you made a mistake and want to cancel out; C-j skips that logic and accepts the default.

    (Allegedly, this is intuitive: C-j often does something similar to hitting enter in emacs, and in other classic terminal tools. This is because in ASCII, the newline character has character code 10, J is the tenth letter of the alphabet, and on old-school terminals the "control" key was a modifier where hitting the Nth letter of the alphabet would send the raw ascii control code N. You see: intuitive. Looking at the ASCII table, then this also explains why your terminal sometimes freaks out when trying to distinguish between C-h and backspace [1], why C-d in unix command-line tools means end-of-file, why C-g is traditionally used for interruptions, and why Windows files on Unix sometimes end up displayed with ^M crud. The more you know 🌠)

Anyway, I've been typing C-x C-f C-j multiple times a day for, uh. 15 years now? Probably more?

Last night I missed a keystroke, and accidentally typed: C-x C-j

And it worked!

It turns out that C-x C-j is dired-jump, which opens a dired buffer for the directory containing the current file, and then puts the cursor on top of that file. So actually it works even better than C-x C-f C-j, which doesn't do that last part. (Also, if you're already in a dired-buffer, C-x C-j takes you up to the parent directory.)

I never knew this existed, even though it's been lurking there forever, just 1 Levenshtein distance away. As far as I know, the similarity is a complete and utter coincidence – one "j" is short for "jump", and the other is short for "the tenth letter of the alphabet".

I've probably used it 20 times since last night.

[Update, 2016-10-29: I'm informed that the C-x C-j key binding isn't present by default, but requires that you have loaded the dired-x package. You can do this by adding some autoload nonsense to your .emacs, or just (require 'dired-x). While you're at it you should turn on dired-omit-mode, which is apparently why I had dired-x in the first place.]

(While we're here, a random bonus emacs tip that also took me wayyyy too long to discover: in dired-mode, try hitting C-x C-q. It makes the buffer editable, so you can go around and change the text so that it looks like it's describing the directory that you wish you had – you can rename files, use search-replace, whatever – and when you hit C-x C-s, emacs will go and rearrange the real files on disk to match your changes. Plus, if any of those files are open in emacs, then the buffers will be automagically redirected to point to the new name, so you avoid the annoying thing where you rename the file in the terminal and then the next time you save emacs puts it back where it was before.)

[1]Do please appreciate the Wikipedian who labored over this page's illustrative figure + caption.