I decided to make a quick demonstration of the effect of phase information on speech reconstruction. I took the short-time Fourier transform of one of the TIMIT examples (/TRAIN/DR1/FDAW0/SI1406.wav), extracted and randomized its phase values, and then inverted it back into audio using the randomized phases. You can listen to it here.
The result is clearly distorted, but the most important information, namely the content of the speech and the identity of the speaker, is preserved. I think this justifies the use of a magnitude only time-frequency representation, at least to start with.
Relatedly, I’ve been thinking about the invertibility of the wavelet transform and how that might impact its usefulness as a representation for speech synthesis. I think an interesting experiment would be to do something similar to what João suggested last class: First, set up a CNN to perform a wavelet transform. Then, use the corresponding deconvolutional network (by transposing the kernels) as the inverse transform. We could then train the network to reconstruct the input. If the network was able to make accurate reconstructions, this would show that it is indeed possible to learn the inverse wavelet transform using convolutional neural networks (which would make it a very appropriate input representation for our speech synthesis task).