End-to-end training of deep autoencoders in IPython

While working on my deep learning project, I hacked together a couple of simple support methods for IPython that, at least for me, greatly increased its usefulness for iterative optimization. In this post I will concentrate on IPython, while I defer the discussion about the model and its design to a future post. At the end of this post you'll find a teaser^[1] and I also uploaded the HTML file and the notebook of the current state.

My goal is to train a deep autoencoder^[2] on images of multiple digits. This way I hope to generate features than can be used by a second model to recognize the sequence of digits shown in the images. One problem with these kinds of models and optimization problems in general is that the parameters need some tweaking: pick the learning rate to high the model diverges, pick the learning rate to low you'll wait forever. Similar trade-offs can also be observed for the other parameters and, as usual, there are a lot. Therefore, being able to experiment with the parameter values and inspect the model while it is being optimized is a great advantage.

To keep IPython responsive while optimizing the model I turned to threads. In contrast to the multiprocessing option, it keeps the model and the notebook in the same process and thereby does not require to constantly copy data. Since python does not support an API to stop threads, additional logic is required, for example something along the lines of

class Optimizer(threading.Thread):
    def run():
        model = construct_model()
        self.is_running = True
    
        for iteration in range(100):
            update_and_evaluate(model)
            
            if not self.is_running:
                return

This way the training can be easily aborted by setting the is_running attribute to False. To allow for stopping and resuming of the optimization, additional indirections are required. My idea was to use generators as cooperative coroutines. After each yield the is_running flag is checked. When the optimization is stopped, it can easily be resumed by calling the next method of the generator. The final code posted to github implements all of these features, complete with shiny, clickable buttons to control the execution. It can be used as

from parallel_coroutine import ParallelCoroutine

training = ParallelCoroutine()

@training.execute
def optimize(self):
    model = construct_model()
    
    for iteration in range(100):
        yield update_and_evaluate(model)

training.start()

To control the training by graphical buttons, one just has to evaluate the controls attribute of the training object:

Finally, only one problem remains, that cannot be circumvented: the global interpreter lock. I implemented the autoencoder using the Theano package developed in the group of Yoshua Bengio. Unfortunately Theano does not release the GIL. Since only one python thread can execute at a time, the notebook feels a bit sluggish while the optimization is running. Still overall this technique greatly simplified my work flow and sped up experimentation quite a lot.

[1] The model is a four layer denoising autoencoder with tied weights and soft rectified linear units for all layers but the feature layer and the reconstruction layer. The models tries to learn an encoding of the inputs with 1568 units into a feature vector with 400 units. In between two layers with 2500 units each are used as intermediate stages. The training proceeds over 15 epochs with a learning rate of 0.001 for the initial 10 epochs and of 0.0001 for the last 5 epochs. While training the cross-entropy reconstruction loss is minimized via stochastic gradient descent and back-propagation. In each epoch one million training examples are processed in batches of 20 samples at a time. Additionally, the input is corrupted with 50 % salt-and-pepper noise to train robust features and a dropout regularizer with a rate of 25 % is used to prevent over-fitting.

[2] Well as I learned three layers are only weakly deep: in a recent blogpost, Ilya Sutskever defined (large) deep neural networks as having 10 - 20 layers. Actually, I should consider a deeper model since they become exponentially more powerful. However without a GPU larger models are somewhat frightening from a runtime perspective.