pylearn2 DBM RBM | Experiments in deep learning for speech synthesis

In planning my initial experiments, I tried to identify tasks that could potentially be useful to myself or others as we continue with this speech synthesis project. So to start, here are a few words about what I’ve been thinking so far about for this project long term:

I’m drawn to the convenient generativeness of probabilistic models like the RBM. However, I know that recent results have demonstrated that unsupervised pretraining with RBMs is usually only useful when the number of labeled examples is small and especially if there is unlabeled data to leverage. We’ve also seen that RNNs and their more sophisticated siblings are especially appropriate for modeling time series (like audio). I have begun to look into some existing work on combining these two approaches e.g.:

Boulanger-Lewandowski, N, Yoshua Bengio, and Pascal Vincent. 2012. “Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription.” In Proceedings of the 29th International Conference on Machine Learning.
Sutskever, Ilya, Geoffrey Hinton, and GW Taylor. 2008. “The Recurrent Temporal Restricted Boltzmann Machine.” Neural Information Processing Systems.

In the context of predicting the next frame/sample from the previous t frames/samples, I want to be able to take advantage of the fact that the next frame and the previous frames are all frames and learn a representation of ‘frame’ that is shared across time points (à la RNN) while maintaining the nice generative properties of probabilistic graphical models.

To get my feet wet, I went back to the pylearn2 quickstart tutorial which trains a GaussianBinaryRBM on the CIFAR10 dataset and modified it to run on the TIMIT dataset (using the TIMIT class written by Vincent) with 200 input samples and 1000 hidden units. The rest of the params were the same as in the tutorial (yaml file here). The model is still training so figures are forthcoming.

David WF recommended that I use the newer, more general and featureful DBM class, in which an RBM is a specific parametarization with only hidden layer. The DBM tutorial ran successfully. However, as soon as I tried to modify the yaml to work with TIMIT, I realized that the dbm GaussianVisLayer class was broken. Ian, David, and Vincent helped me to identify the source and solution to the problem. In short, rather than have a separate class for a convolutional Gaussian visible layer, the GaussianVisLayer class will be convolutional if convolution parameters are provided and not otherwise. An additional feature was added to this class, apparently for performance concerns, allowing for the axes of the convolution (batch, channels, row, columns) to be given in any arbitrary order. This broke several functions in GaussianVisLayer which need to know the order or those axes. I have since made the necessary changes to those functions and am currently running tests with GaussianVisLayer on TIMIT with and without convolution. The current state of the code is here. After more tests I’ll try to submit a pull request to have it incorporated. However, the expected_energy_term is still only implemented for the hard sampling case (when average=True). This can be improved to reduce the variance of the negative phase by integrating out the even-numbered layers. In the Binary case this is easy because the terms are linear but they are quadratic in the Gaussian case. I would be happy to work on this with someone with a stronger math background than me. Let me know if you are interested in discussing this.

Thinking more toward the future, I would eventually like to work with time-frequency input representations. In order to still take advantage of the work that has already been put into Vincent’s TIMIT class, I was thinking that I could write a preprocessor to perform a short-time Fourier transform (or other useful transforms) within the pylearn2 framework. I’ll probably use scipy.fftpack unless someone has a better idea.

After finalizing these with RBMs on their own, I plan to look into how to incorporate these probabilistic models within a recurrent neural network.

Experiments in deep learning for speech synthesis

Jessica Thompson's journal for IFT6266 – Representation Learning at Université de Montréal

Tag Archives: pylearn2 DBM RBM

Gaussian Visible Units within the pylearn2 DBM package