New in Python 3.7: The breakpoint() function

I started out trying to pick through the interesting features in Python 3.7 but ended up leaving it so long that Python 3.8 is already out. Even so, there’s a feature that I genuinely only noticed a week ago, and it’s a small but significant one.

For as long as I’ve known how to write Python my standard tool when I’m frustrated with a unit test is to insert on the offending line:

import pdb; pdb.set_trace()

I never got used to managing breakpoints in my IDE because I’m often running something remotely or in a docker container and remote debugging is usually a pain to set up.

Still, it’s always bugged me that this has to be two lines of code.

Luckily someone else has had the same thoughts as me, and more to the point has got round to doing something about it. From Python 3.7 onwards, there’s a built-in function that allows you to do this in one line:

breakpoint()

And that’s all there is to it.

Digging a bit deeper

It turns out that the code:

breakpoint()

does a little more than the old PDB snippet above, as the documentation helpfully explains:

This function drops you into the debugger at the call site. Specifically, it calls sys.breakpointhook(), passing args and kws straight through. By default, sys.breakpointhook() calls pdb.set_trace() expecting no arguments. In this case, it is purely a convenience function so you don’t have to explicitly import pdb or type as much code to enter the debugger. However, sys.breakpointhook() can be set to some other function and breakpoint() will automatically call that, allowing you to drop into the debugger of choice.

So sys.breakpointhook is made available as a writable value, and you can assign your own choice of function over it. This isn’t very useful as a developer, but it’s vital if you’re an IDE writer. When you run Python in an IDE, you might want to provide your own suite of debug tools without PDB getting in the way. If someone’s using your IDE to debug their code and they type breakpoint(), they probably want the IDE breakpoint feature and not PDB.

This raises an interesting question of whether this could be used maliciously. Code could write to sys.breakpointhook to make itself more difficult to debug, or it could check the value of it in order to exhibit different behaviour when run in a non-PDB debugger. This is pretty limited though, since anyone having trouble with an IDE debugger will probably fall back to trying PDB pretty quickly and defeat this simple trick.

New in Python 3.7: Context Variables

This is the first in a series of articles that will look at new features introduced in Python 3.7. I don’t know about anyone else, but I tend only to discover new language features by chance when I come across them on StackOverflow or whatever. I figure a more deliberate process of reading the docs might do me some good, and might help other people as well.

The first feature that caught my eye was Context Variables.

Motivation

Sometimes a library has some kind of hidden state. This usually makes things more convenient for the user, e.g. setting precision in Decimals:

from decimal import *
 
getcontext().prec = 6
print(Decimal(1) / Decimal(7))
# Prints '0.142857'
 
getcontext().prec = 28
print(Decimal(1) / Decimal(7))
# Prints '0.1428571428571428571428571429'

Once you’ve written to prec, the precision is remembered until next time you change it. If you didn’t have this, you’d need some way to specify the precision in the call:

print(Decimal(1) / Decimal(7))

Even if you could figure out a nice API for that, your code would have to pass the context around everywhere it was needed. It’s nicer if the library remembers it for you.

The problem with this is what happens if multiple threads are using the library. If you’re not careful, one thread alters the state and then the other thread will print a Decimal, and end up with the wrong precision. Worse, it would depend on exactly the order in which the two threads executed, and the behaviour would be random.

Of course, no decent library has this problem with threads. The simple way around it is to have thread-local state: if I call decimal.getcontext() it will return me a value that is only used by the active thread, and if I change it it will only affect my thread.

However, things get more complicated once we are working with asynchronous code. Consider a couple of asynchronous functions:

import asyncio
 
async def db_fetch(stuff):
    # Simulate a slow query...
    await asyncio.sleep(1)
    # Maybe do something with Decimal context here?
    return 42
 
async def cache_fetch(stuff):
    # Also slow...
    await asyncio.sleep(1)
    # Maybe do something with Decimal context here?
    return 43
 
async def combine():
    first = db_fetch('select * from foo')
    second = cache_fetch('cachekey')
    return await asyncio.gather(first, second)
 
# This works in Jupyter. YMMV if you're running it elsewhere without an event
# loop running...
print(await combine())

There are no threads in this code. But because it’s written asynchronously, the parts of the code in db_fetch and cache_fetch might get executed in different orders. If the two coroutine functions were doing real work (not just pretending to work), then execution might switch back and forward between the two functions several times as they are working, and the exact pattern would depend on exactly how quickly the DB and the cache returned.

So we can no longer rely on thread-local storage, because even though there is only one thread we are still switching between two areas of the code, and they may change the state in ways that affect each other.

The solution

When coroutines are run concurrently by Python, they are internally wrapped into instances of asyncio.Task. A Task is the basic unit at which execution is scheduled: when control passes from one coroutine to another (because one is blocked and the other gets a chance to run) this is actually handled by calling the _step function on the appropriate task.

The Task class is modified to capture a context on creation, and activate that context each time control returns to that Task:

class Task:
    def __init__(self, coro):
        ...
        # Get the current context snapshot.
        self._context = contextvars.copy_context()
        self._loop.call_soon(self._step, context=self._context)
 
    def _step(self, exc=None):
        ...
        # Every advance of the wrapped coroutine is done in
        # the task's context.
        self._loop.call_soon(self._step, context=self._context)

call_soon is an asyncio function that causes a function to be asynchronously called later.

But what’s actually in the context?

You can think of it as a collection of variable states, essentially like a namespace dict, except that the lookup isn’t done by name (which would raise the possibility of name clashes).

A library that wants to have asynchronous context declares a context variable:

my_state = ContextVar('my_state')
my_state.set('apple')

The my_state variable is now a handle that we can use to look up a value in the context, and get and set the value. The value can be any Python value, so you can put a dict or an object or whatever.

Code that may run in an asynchronous context will read the value of the context variable any time it needs it like this:

my_state.get()

Behind the scenes, this is getting the value of the my_state variable in the currently active context (which was changed into just before asyncio passed control to the task’s _step method). Therefore the library code can safely read and write this value without interfering with other asynchronous tasks that might be using the same library.

Any time a new Task is created, the context is copied so that the task has its own copy of the context.

The mapping in the Context is an immutable dictionary. This means that copying the context once per Task is still cheap. Most of the time the code won’t actually change the context (or at least, won’t change all the variables in the context) so the unchanged variables can continue to be shared between contexts. Only as and when they are written is a cost incurred, and in this case it’s a necessary cost.

If your code makes use of several libraries that use context variables, they will all be storing their values in the same context. This is OK, since the libraries will have different handle objects (the object returned from ContextVar()) so they can’t accidentally overwrite each other’s state.

Conclusion

Context variables are worth knowing about. I guess that if you’re tempted to use thread-local state the answer should always be to use a context variable instead, unless you’re writing internal code where you know that it won’t be used asynchronously or published to be used by other people who may use it asynchronously. In practice that probably means that all code using thread-local state should use context variables instead.

The internals are a bit hairy to think about, but the public interface looks really nice and simple.

How does q work?

In a previous post I talked about what you can do with the debug logging library q. It has some interesting features that don’t look like standard Python, though of course it all is.

Debug anywhere

Start with the simple things. One of the nice things about q is that you can stick a debug call anywhere, without having to change the logic of your code. So if you have a complex expression like:

if something(foo) + other(bar) > limit:
    # Do something

You can just stick a q() call in there to print out the intermediate value that you’re interested in:

if q(something(foo)) + other(bar) > limit:
    # Do something

All we have to do here is have q() return the argument it was given:

def q(arg):
    # Log out the value of arg
    return arg

It’s a little more complicated in practice, because q() supports being called with any number of arguments, and only returns the first. But for the common case, we can assume we have code like the above. The value gets passed into a function, and gets returned back.

Will this always work? Is there ever a case where it matters that the value has been passed through a function before use rather than being used directly? In lower-level languages this kind of thing can matter, because returning a value from a function means moving the value from one place to another. This could involve copying the value, or even constructing a new instance. But in Python, the function just takes a reference to the object passed in and discards the reference at the end (decrementing the reference count to the object and thus returning the object to its original state).

Smooth operator

Next, we can look at q’s nifty operator magic that lets us tag values to be logged without having to wrap them in brackets.

# This logs out the value of thing.flags
if q/thing.flags & (1 << which_flag):
    apply_flag(thing)

This just makes use of Python operator overloading. This is a feature that doesn’t get used all that much: the idea is that you can define the meaning of the built-in operators like +, * and & when applied to instances of the class you are defining.

In theory operator overloading is really neat, because you can define some new class and have it interact with the language’s operators as if it were a built-in type like int or float. This is used extensively in C++ and Haskell. Why not in Python?

I think the reason is that there are really two sorts of cases where you want this:

  • You’re implementing something which really is a variation on a numeric type: perhaps a rational number class, a decimal class or a tensor class
  • You’re implementing something else that doesn’t really have numeric operations, but you want to use operators rather than explicit method calls to make things a bit more concise.

The first of these is pretty rare, because there are sensible numerics in Python and a couple of fairly standard third-party libraries like Numpy. You’re not likely to be writing your own class that acts numeric.

The second is more likely to occur, but the culture of Python (explicit is better than implicit) tends to lean against it. The only two cases I can think of off the top of my head are Django Q() objects (nothing to do with the debugging library discussed here), and Scapy concatenations. Overloading might be the right choice here, but you should really weigh the cost of confusing the user with non-standard behaviour against the benefit you get in terms of conciseness.

So anyway, Q goes ahead and uses operator overloading. What does that look like?

class Q(object):
 
    # ...
 
    def __truediv__(self, arg):  # a tight-binding operator
        """Prints out and returns the argument."""
        info = self.inspect.getframeinfo(self.sys._getframe(1))
        self.show(info.function, [arg])
        return arg
    # Compat for Python 2 without from future import __division__ turned on
    __div__ = __truediv__

All this is doing is declaring a method on the class with the special name __truediv__. When Python comes across an expression involving /, like:

a / b

it actually looks for a method on the left-hand object and calls that with a and b as parameters. Looked at this way, the operator behaviour of the built-in types like int and float is actually a special case of the more general process of “dividing” any two objects.

The only limit is your imagination. Or rather, the only limit is whether this is a sensible thing to do in your case, and the answer is that it very rarely is. But I think it’s excusable in the case of q, which is going for extreme brevity. Also, q isn’t meant to be something you build into your shipping code, just something you use occasionally in debugging.

Import shenanigans

Something else might strike you after you’ve been peering at q for a while. It’s subtle, but I think it’s the most interesting (or perhaps the most underhanded) thing in the library.

You’ll notice that when you use q, you just have to do:

import q

And you have an object called q in scope. That’s it. There’s no need to qualify the object with the module name (q.q

from q import q

This is a bit reminiscent of export default in Javascript: the module only has one thing in it, and we get that rather than a whole namespace of stuff. But lots of Python modules have only one useful class in them, and they don't behave like this. How does q manage it?

First, let's ask what a module really is after we've imported it in Python. From our point of view a module is a text file with code in it, but by the time Python can use it it's been parsed and compiled into Python bytecode. These objects end up in sys.modules, which is more or less a dict mapping names to module objects:

>>> import sys
>>> 'statistics' in sys.modules
False
>>> import statistics
>>> sys.modules['statistics']
<module 'statistics' from '/usr/lib/python3.6/statistics.py'>

The module objects in sys.modules have all the entries in the module as properties:

>>> sys.modules['statistics'].harmonic_mean
<function harmonic_mean at 0x7f71f426d6a8>

In other words, the entry in sys.modules is the thing that gets brought into scope when you do import stastistics or whatever. Your code gets a new entry in its namespace that is a reference to the entry in the sys.modules dict.

But the thing we get when we do import q doesn't act quite like a module. As we explored above, it acts like an instance of the Q class, not an instance of module. If it wasn't a Q, the operator overloading and callable nature described above wouldn't work.

This works because q has one last trick up its sleeve: just before the module is done, it does the following:

# Install the Q() object in sys.modules so that "import q" gives a callable q.
sys.modules['q'] = Q()

It's overwriting the entry in sys.modules with something else, in this case an instance of Q. This is a bit sneaky: I don't know if anything in Python is going to assume that everything in sys.modules is an instance of module, but if it did you could hardly blame it.

As an aside, this makes the class Q completely inaccessible to the programmer using the library. It never gets into the namespace, and because it's not in the object that's in sys.modules you can't get at it even if you want to:

>>> import q
>>> q.Q
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Q' object has no attribute 'Q'
>>> from q import Q
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Q'