Saturday, 3 August 2019

Tracking Jupyter notebooks with Git

Recently, I came across a PLOS Ten Simple Rules paper about Jupyter notebooks. Rule number 6 is about version control and suggests the use of nbdime to merge notebooks, or saving the notebooks as python scripts and tracking those with your version control system. I experienced problems with using Git to track changes in Jupyter notebooks a while ago, and have been using the following solution.

For me, the largest nuisance is that figures are stored as very long strings in the output of cells. This makes it almost impossible to manually resolve conflicts while merging two branches. The solution is simple: just clear the output of the cells before committing the notebook. I'm doing this with a simple script (which I found somewhere on the Internet).


The script clear_output_ipynb.py lives in the same folder (called notebooks) as my Jupyter notebooks. I don't track changes in the .ipynb files, but have "clean" copies of the notebooks (with extension .ipynb.cln) that are part of the Git repository. To make life easy, I have two makefiles in my project folder called cln.makefile and nbs.makefile. Before I stage the changes in my notebooks, I first run
$ make -f cln.makefile
which runs the script clear_output_ipynb.py for each notebook in my notebooks folder.


After I pull changes from a remote repository, or switch to another branch, I have to copy all .ipynb.cln files to .ipynb files. For this I have another makefile, and so I run
$ make -f nbs.makefile
before using and modifying the notebooks.


Of course, sometimes I forget to clean the notebooks before committing, or I forget to make the .ipynb files. I've tried to automate the process of cleaning and copying with Git "hooks", but I have not been able to make that work. If somebody knows how, let me know!