Joel Grus - The case against the jupyter notebook
Towards Data Science - Un pódcast de The TDS team
Categorías:
To most data scientists, the jupyter notebook is a staple tool: it’s where they learned the ropes, it’s where they go to prototype models or explore their data — basically, it’s the default arena for their all their data science work.
But Joel Grus isn’t like most data scientists: he’s a former hedge fund manager and former Googler, and author of Data Science From Scratch. He currently works as a research engineer at the Allen Institute for Artificial Intelligence, and maintains a very active Twitter account.
Oh, and he thinks you should stop using Jupyter noteoboks. Now.
When you ask him why, he’ll provide many reasons, but a handful really stand out:
- Hidden state: let’s say you define a variable like
a = 1
in the first cell of your notebook. In a later cell, you assign it a new value, saya = 3
. This results is fairly predictable behavior as long as you run your notebook in order, from top to bottom. But if you don’t—or worse still, if you run thea = 3
cell and delete it later — it can be hard, or impossible to know from a simple inspection of the notebook what the true state of your variables is. - Replicability: one of the most important things to do to ensure that you’re running repeatable data science experiments is to write robust, modular code. Jupyter notebooks implicitly discourage this, because they’re not designed to be modularized (awkward hacks do allow you to import one notebook into another, but they’re, well, awkward). What’s more, to reproduce another person’s results, you need to first reproduce the environment in which their code was run. Vanilla notebooks don’t give you a good way to do that.
- Bad for teaching: Jupyter notebooks make it very easy to write terrible tutorials — you know, the kind where you mindlessly hit “shift-enter” a whole bunch of times, and make your computer do a bunch of stuff that you don’t actually understand? It leads to a lot of frustrated learners, or even worse, a lot of beginners who think they understand how to code, but actually don’t.
Overall, Joel’s objections to Jupyter notebooks seem to come in large part from his somewhat philosophical view that data scientists should follow the same set of best practices that any good software engineers would. For instance, Joel stresses the importance of writing unit tests (even for data science code), and is a strong proponent of using type annotation (if you aren’t familiar with that, you should definitely learn about it here).
But even Joel thinks Jupyter notebooks have a place in data science: if you’re poking around at a pandas dataframe to do some basic exploratory data analysis, it’s hard to think of a better way to produce helpful plots on the fly than the trusty ol’ Jupyter notebook.
Whatever side of the Jupyter debate you’re on, it’s hard to deny that Joel makes some compelling points. I’m not personally shutting down my Jupyter kernel just yet, but I’m guessing I’ll be firing up my favorite IDE a bit more often in the future.