About Jessica Thompson

Computational cognitive neuroscience

Incorporating phone info

Vincent published a blog post on how to use the Pylearn2 TIMIT class with multiple sources, specifically combining acoustic samples and phones information. I tried using the example yaml file and model subclasses but got the following error. I made sure that pylearn2, theano, and Vincent’s TIMIT class were all up to date.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-99a022070172> in <module>()
      1 train = yaml_parse.load(train)
----> 2train.main_loop()

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pylearn2/pylearn2/train.pyc in main_loop(self, time_budget)
    153                     break
    154         else:
--> 155self.algorithm.setup(model=self.model, dataset=self.dataset)
    156             self.setup_extensions()
    157             # Model.censor_updates is used by the training algorithm to

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pylearn2/pylearn2/training_algorithms/sgd.pyc in setup(self, model, dataset)
    228         if getattr(model, "force_batch_size", False) and\
    229            any(dataset.get_design_matrix().shape[0] % self.batch_size != 0 for
--> 230                dataset in self.monitoring_dataset.values()) and \
    231            not has_uniform_batch_size(self.monitor_iteration_mode):
    232 

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pylearn2/pylearn2/training_algorithms/sgd.pyc in <genexpr>((dataset,))
    228         if getattr(model, "force_batch_size", False) and\
    229            any(dataset.get_design_matrix().shape[0] % self.batch_size != 0 for
--> 230                dataset in self.monitoring_dataset.values()) and \
    231            not has_uniform_batch_size(self.monitor_iteration_mode):
    232 

AttributeError: 'TIMIT' object has no attribute 'get_design_matrix'

As a hack to get past this, I just commented out the offending line in sgd.py… Obviously this isn’t the best solution but it allowed me to at least run the example from Vincent’s post.

I then moved on to trying to adapt his example to work with the DBM class. I made a compositeDBM.yaml file and tried to write the necessary DBM subclasses to allow for specifying input sources, but I ran into some problems. The CompositeLayer class in the DBM module was originally extending the HiddenLayer class, whereas in this case I want to use it as a VisibleLayer. I’m not sure if it makes more sense to have a second class for a visible CompositeLayer, or to have one CompositeLayer class that extends the Layer class and can be used as either a visible or hidden layer. I have begun to implement the former solution but have not yet completed.

Representation learning for studying the brain

I recently presented the following poster at conference in Leipzig about the potential for using representational learning for studying biological neural information processing. The main idea is to use state of the art machine learning methods to generate hypotheses about neural information processing. These hypotheses can then be tested with neuroimaging experiments.

JThompson_LearningAuditoryRepresentations

Sampling from trained RBM

I adapted some code originally written by Ian Goodfellow to generate samples from the RBMs that I trained on acoustic samples. The code can be found in SampleDBM.py (or the corresponding ipynotebook).  Unfortunately the single layer RBMs that I trained failed to produce samples that anything resembling speech. Deeper models, using a time-frequency representation as input, and including phoneme information will likely improve results.

Randomized phases preserves speech content and identity

I decided to make a quick demonstration of the effect of phase information on speech reconstruction. I took the short-time Fourier transform of one of the TIMIT examples (/TRAIN/DR1/FDAW0/SI1406.wav), extracted and randomized its phase values, and then inverted it back into audio using the randomized phases. You can listen to it here.

The result is clearly distorted, but the most important information, namely the content of the speech and the identity of the speaker, is preserved. I think this justifies the use of a magnitude only time-frequency representation, at least to start with.

Relatedly, I’ve been thinking about the invertibility of the wavelet transform and how that might impact its usefulness as a representation for speech synthesis. I think an interesting experiment would be to do something similar to what João suggested last class: First, set up a CNN to perform a wavelet transform. Then, use the corresponding deconvolutional network (by transposing the kernels) as the inverse transform. We could then train the network to reconstruct the input. If the network was able to make accurate reconstructions, this would show that it is indeed possible to learn the inverse wavelet transform using convolutional neural networks (which would make it a very appropriate input representation for our speech synthesis task).

 

Noisy samples from Gaussian RBMs

I’ve continued my tests with the DBM class. I’ve trained a few different models but I’ve been unable to sample or reconstruct anything other than noise. This exercise has helped me to familiarize myself with pylearn2 but I don’t expect to get very far without incorporating additional information. So I will abandon this line of inquiry and focus instead on using  time-frequency input representations and incorporating phone information (thanks to the recent updates to Vincent’s TIMIT class).

Gaussian Visible Units within the pylearn2 DBM package

In planning my initial experiments, I tried to identify tasks that could potentially be useful to myself or others as we continue with this speech synthesis project. So to start, here are a few words about what I’ve been thinking so far about for this project long term:

I’m drawn to the convenient generativeness of probabilistic models like the RBM. However, I know that recent results have demonstrated that unsupervised pretraining with RBMs is usually only useful when the number of labeled examples is small and especially if there is unlabeled data to leverage. We’ve also seen that RNNs and their more sophisticated siblings are especially appropriate for modeling time series (like audio). I have begun to look into some existing work on combining these two approaches e.g.:

  • Boulanger-Lewandowski, N, Yoshua Bengio, and Pascal Vincent. 2012. “Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription.” In Proceedings of the 29th International Conference on Machine Learning. 
  • Sutskever, Ilya, Geoffrey Hinton, and GW Taylor. 2008. “The Recurrent Temporal Restricted Boltzmann Machine.” Neural Information Processing Systems.

In the context of predicting the next frame/sample from the previous t frames/samples, I want to be able to take advantage of the fact that the next frame and the previous frames are all frames and learn a representation of ‘frame’ that is shared across time points (à la RNN) while maintaining the nice generative properties of probabilistic graphical models.

To get my feet wet, I went back to the pylearn2 quickstart tutorial which trains a GaussianBinaryRBM on the CIFAR10 dataset and modified it to run on the TIMIT dataset (using the TIMIT class written by Vincent) with 200 input samples and 1000 hidden units. The rest of the params were the same as in the tutorial (yaml file here). The model is still training so figures are forthcoming.

David WF recommended that I use the newer, more general and featureful DBM class, in which an RBM is a specific parametarization with only hidden layer. The DBM tutorial ran successfully. However, as soon as I tried to modify the yaml to work with TIMIT, I realized that the dbm GaussianVisLayer class was broken. Ian, David, and Vincent helped me to identify the source and solution to the problem. In short, rather than have a separate class for a convolutional Gaussian visible layer, the GaussianVisLayer class will be convolutional if convolution parameters are provided and not otherwise. An additional feature was added to this class, apparently for performance concerns, allowing for the axes of the convolution (batch, channels, row, columns) to be given in any arbitrary order. This broke several functions in GaussianVisLayer which need to know the order or those axes. I have since made the necessary changes to those functions and am currently running tests with GaussianVisLayer on TIMIT with and without convolution. The current state of the code is here. After more tests I’ll try to submit a pull request to have it incorporated. However, the expected_energy_term is still only implemented for the hard sampling case (when average=True). This can be improved to reduce the variance of the negative phase by integrating out the even-numbered layers. In the Binary case this is easy because the terms are linear but they are quadratic in the Gaussian case. I would be happy to work on this with someone with a stronger math background than me. Let me know if you are interested in discussing this.

Thinking more toward the future, I would eventually like to work with time-frequency input representations. In order to still take advantage of the work that has already been put into Vincent’s TIMIT class, I was thinking that I could write a preprocessor to perform a short-time Fourier transform (or other useful transforms) within the pylearn2 framework. I’ll probably use scipy.fftpack unless someone has a better idea. 

After finalizing these with RBMs on their own, I plan to look into how to incorporate these probabilistic models within a recurrent neural network.