Saturday, 3 August 2019

Tracking Jupyter notebooks with Git

Recently, I came across a PLOS Ten Simple Rules paper about Jupyter notebooks. Rule number 6 is about version control and suggests the use of nbdime to merge notebooks, or saving the notebooks as python scripts and tracking those with your version control system. I experienced problems with using Git to track changes in Jupyter notebooks a while ago, and have been using the following solution.

For me, the largest nuisance is that figures are stored as very long strings in the output of cells. This makes it almost impossible to manually resolve conflicts while merging two branches. The solution is simple: just clear the output of the cells before committing the notebook. I'm doing this with a simple script (which I found somewhere on the Internet).

#!/usr/bin/env python
from __future__ import print_function
import json, os, sys
def main():
"""
Usage: ./clear_output_ipynb.py filename.ipynb
Output: filename.ipynb.cln
Script to clear all output from Jupyter notebooks.
"""
try:
inFileName = sys.argv[1]
outFileName = inFileName + ".cln"
except:
print("ERROR: no file name entered")
sys.exit(1)
with open(inFileName) as inFileHandle:
data = json.load(inFileHandle)
## remove output form code cells, set execution count to None
for n, cell in enumerate(data["cells"]):
if cell["cell_type"] == "code":
cell["execution_count"] = None
cell['outputs'] = []
## write cleaned data to file
with open(outFileName, 'w') as outFileHandle:
json.dump(data, outFileHandle, indent=2, sort_keys=True)
if __name__=='__main__':
main()

The script clear_output_ipynb.py lives in the same folder (called notebooks) as my Jupyter notebooks. I don't track changes in the .ipynb files, but have "clean" copies of the notebooks (with extension .ipynb.cln) that are part of the Git repository. To make life easy, I have two makefiles in my project folder called cln.makefile and nbs.makefile. Before I stage the changes in my notebooks, I first run
$ make -f cln.makefile
which runs the script clear_output_ipynb.py for each notebook in my notebooks folder.

NOTEBOOKS := $(wildcard notebooks/*.ipynb)
CLEAN_NOTEBOOKS := $(NOTEBOOKS:.ipynb=.ipynb.cln)
all: $(CLEAN_NOTEBOOKS)
%.ipynb.cln: %.ipynb
notebooks/clear_output_ipynb.py $<
view raw cln.makefile hosted with ❤ by GitHub

After I pull changes from a remote repository, or switch to another branch, I have to copy all .ipynb.cln files to .ipynb files. For this I have another makefile, and so I run
$ make -f nbs.makefile
before using and modifying the notebooks.

CLEAN_NOTEBOOKS := $(wildcard notebooks/*.ipynb.cln)
NOTEBOOKS := $(CLEAN_NOTEBOOKS:.ipynb.cln=.ipynb)
all: $(NOTEBOOKS)
%.ipynb: %.ipynb.cln
cp $< $@
view raw nbs.makefile hosted with ❤ by GitHub

Of course, sometimes I forget to clean the notebooks before committing, or I forget to make the .ipynb files. I've tried to automate the process of cleaning and copying with Git "hooks", but I have not been able to make that work. If somebody knows how, let me know!