Jekyll2022-04-22T22:37:49+05:30https://vak.ai/feed.xmlShreeSharing ideas and progress on Speech and Language technologyShreeshreekantha.nadig@iiitb.ac.inTensorflow 2.0 tf.data.Dataset.from_generator2019-03-13T00:00:00+05:302019-03-13T00:00:00+05:30https://vak.ai/tensorflow/TensorFlow2.0-dataset<script type="text/javascript" async="" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<h2 id="introduction">Introduction</h2>
<p>When I saw the TensorFlow Dev Summit 2019, the thing that I wanted to try out the most was the new <a href="https://www.tensorflow.org/api_docs/python/tf/data/Dataset"><code class="language-plaintext highlighter-rouge">tf.data.Dataset API</code></a>. We all know how painful it is to feed data to our models in an efficient way. This is especially true if you’re working with Speech.</p>
<p>For my work with <a href="https://arxiv.org/abs/1506.03134">Pointer-Networks</a>, I was using PyTorch’s DataLoader to feed data to my models. This always left something to be desired (a discussion for another day). I was on the lookout for a different (hopefully better) way to feed data to my models when I heard about tf.data.Dataset.</p>
<p>This post is my exploration of the API, I will try to keep this post updated as I go about my exploration.
I am exploring the following APIs: <a href="https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#from_generator">tf.data.Dataset.from_generator</a>, <a href="https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Options">tf.data.Options</a>, <a href="https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/TFRecordDataset">tf.data.TFRecordDataset</a> and all the other experimental features!</p>
<h1 id="tfdatadatasetfrom_generator"><a href="https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#from_generator">tf.data.Dataset.from_generator</a></h1>
<p>This is the function to use if your data pipeline does not fit into any of the other methods.
Since I mostly work with speech, I need a way to load my data from disk batch-by-batch. I can’t fit all my data into memory because it’s just too big (typically couple of <code class="language-plaintext highlighter-rouge">100 GiBs</code>).</p>
<p>One way to feed such dataset to my models is by loading the data batch-by-batch from the disk instead of loading everything at once and iterating over it. This has always been one of the most difficult part of my model building experience in Speech. Having an efficient data pipeline makes my life easier :).</p>
<p><a href="https://github.com/espnet/espnet">ESPnet</a> did just that by using the <a href="http://kaldi-asr.org/doc/io.html">ark file splits generated by kaldi</a> to load the batches and feed them to my models. This is definitely not THE solution to the problem, but it got the job done.</p>
<p>I believe the tf.data.Dataset.from_generator is the way to go for my data pipeline.</p>
<p>Now, let’s say I need to solve the problem of finding <a href="https://en.wikipedia.org/wiki/Convex_hull">ConvexHull</a> points from a sequence of points. This is one of the problems the original Pointer-Networks paper tried to solve. Instead of using the dataset that the authors provided, I want to generate my own dataset (because why not? how difficult could it be to generate a set of points to solve this problem?). By generating my own dataset, I can practically have infinite training examples and full control over what I want to do with it.</p>
<p>For this reason alone, I can’t use the other methods as I will have to store the training examples in memory. I need to generate my examples on-the-go.</p>
<p><code class="language-plaintext highlighter-rouge">tf.data.Dataset.from_generator</code> solves this exact problem.</p>
<h2 id="how-to-use-it">How to use it?</h2>
<p>Before we even start feeding data to our model, we need to have a python <a href="https://www.programiz.com/python-programming/generator">generator</a> function which generates <strong>one</strong> training pair needed for our model.</p>
<p>What this means is, there should be a function which has a <strong><code class="language-plaintext highlighter-rouge">yield</code></strong> statement instead of a <strong><code class="language-plaintext highlighter-rouge">return</code></strong> statement. This does not mean there can’t be a return statement, in a generator function there could be multiple yields and returns.</p>
<p>Let’s say our dataset is of <code class="language-plaintext highlighter-rouge">1000</code> images of size <code class="language-plaintext highlighter-rouge">28x28</code> and belong to one of <code class="language-plaintext highlighter-rouge">10</code> classes. Our generator function might look something like this except we will be reading the images from disk.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">our_generator</span><span class="p">():</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">yield</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span>
</code></pre></div></div>
<p>We could build our <code class="language-plaintext highlighter-rouge">TensorFlow</code> dataset with this generator function.
The <code class="language-plaintext highlighter-rouge">tf.data.Dataset.from_generator</code> function has the following arguments:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">from_generator</span><span class="p">(</span>
<span class="n">generator</span><span class="p">,</span>
<span class="n">output_types</span><span class="p">,</span>
<span class="n">output_shapes</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">args</span><span class="o">=</span><span class="bp">None</span>
<span class="p">)</span>
</code></pre></div></div>
<p>While the <strong><code class="language-plaintext highlighter-rouge">output_shapes</code></strong> is optional, we need to specify the output_types. In this particular case the first returned value is a 2D array of floats and the second value is a 1D array of integers. Our dataset object will look something like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_generator</span><span class="p">(</span><span class="n">our_generator</span><span class="p">,</span> <span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">int16</span><span class="p">))</span>
</code></pre></div></div>
<p>To use this dataset in our model training, we need to either use the <a href="https://www.tensorflow.org/api_docs/python/tf/data/make_one_shot_iterator"><strong><code class="language-plaintext highlighter-rouge">make_one_shot_iterator</code></strong></a> (which is being deprecated) or use the dataset in our training loop.</p>
<h3 id="1-using-make_one_shot_iterator">1. Using make_one_shot_iterator</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_generator</span><span class="p">(</span><span class="n">our_generator</span><span class="p">,</span> <span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">int16</span><span class="p">))</span>
<span class="n">iterator</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">make_one_shot_iterator</span><span class="p">()</span>
<span class="n">x</span><span class="p">,</span><span class="n">y</span> <span class="o">=</span> <span class="n">iterator</span><span class="p">.</span><span class="n">get_next</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1">#(28, 28) (1,)
</span></code></pre></div></div>
<h3 id="2-loop-over-the-dataset-object-in-our-training-loop">2. Loop over the dataset object in our training loop</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">batch</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">print</span><span class="p">(</span><span class="s">"batch: "</span><span class="p">,</span> <span class="n">epoch</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Data shape: "</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1">#batch: 999
#Data shape: (28, 28) (1,)
</span></code></pre></div></div>
<h2 id="tfdatadataset-options---batch-repeat-shuffle">tf.data.Dataset options - batch, repeat, shuffle</h2>
<p>tf.data.Dataset comes with a couple of options to make our lives easier. If you see our previous example, we get one example every time we call the dataset object. What if we would want a batch of examples, or if we want to iterate over the dataset many times, or if we want to shuffle the dataset after every epoch.</p>
<p>Using the batch, repeat, and shuffle function we could achieve this.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">our_generator</span><span class="p">():</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">28</span><span class="p">,</span><span class="mi">28</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">yield</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_generator</span><span class="p">(</span><span class="n">our_generator</span><span class="p">,</span> <span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">int16</span><span class="p">))</span>
</code></pre></div></div>
<h3 id="batch">batch</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_generator</span><span class="p">(</span><span class="n">our_generator</span><span class="p">,</span> <span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">int16</span><span class="p">))</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">batch</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="n">iterator</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">make_one_shot_iterator</span><span class="p">()</span>
<span class="n">x</span><span class="p">,</span><span class="n">y</span> <span class="o">=</span> <span class="n">iterator</span><span class="p">.</span><span class="n">get_next</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1">#(10, 28, 28) (10, 1)
</span></code></pre></div></div>
<p>Now, every time we use the dataset object, the generator function is called 10 times. The batch function combines consecutive elements of this dataset into batches.
If we reach the end of the dataset and the batch is less than the batch_size specified, we can pass the argument <strong><code class="language-plaintext highlighter-rouge">drop_remainder=True</code></strong> to ignore that particular batch.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">batch</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">print</span><span class="p">(</span><span class="s">"batch: "</span><span class="p">,</span> <span class="n">epoch</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Data shape: "</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1">#batch: 99
#Data shape: (10, 28, 28) (10, 1)
</span></code></pre></div></div>
<h3 id="repeat">repeat</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_generator</span><span class="p">(</span><span class="n">our_generator</span><span class="p">,</span> <span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">int16</span><span class="p">))</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">count</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="k">for</span> <span class="n">batch</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">print</span><span class="p">(</span><span class="s">"batch: "</span><span class="p">,</span> <span class="n">batch</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Data shape: "</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1">#batch: 199
#Data shape: (10, 28, 28) (10, 1)
</span></code></pre></div></div>
<p>Here, the dataset is looped over 2 times. Hence we get twice the number of batches for training. If we want to repeat the dataset indefinitely, we should set the argument to <strong>count=-1</strong></p>
<h3 id="shuffle">shuffle</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_generator</span><span class="p">(</span><span class="n">our_generator</span><span class="p">,</span> <span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">int16</span><span class="p">))</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">count</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
<span class="n">iterator</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">make_one_shot_iterator</span><span class="p">()</span>
<span class="k">for</span> <span class="n">batch</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">print</span><span class="p">(</span><span class="s">"batch: "</span><span class="p">,</span> <span class="n">batch</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Data shape: "</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1">#batch: 199
#Data shape: (10, 28, 28) (10, 1)
</span></code></pre></div></div>
<p>Here, the argument <strong>buffer_size=100</strong> specifies the number of elements from this dataset from which the new dataset will sample. Essentially, this fills the dataset with <strong>buffer_size</strong> elements, then randomly samples elements from this buffer.</p>
<p>Use <strong>buffer_size>=dataset_size</strong> for perfect shuffling.</p>
<h3 id="other-options">Other options</h3>
<p>In addition to batch, repeat, and shuffle, there are many other functions the <code class="language-plaintext highlighter-rouge">TensorFlow Dataset</code> API comes with. I will update this post with options like - <code class="language-plaintext highlighter-rouge">map</code>, <code class="language-plaintext highlighter-rouge">reduce</code>, <code class="language-plaintext highlighter-rouge">with_options</code></p>
<h2 id="conclusion">Conclusion</h2>
<p><code class="language-plaintext highlighter-rouge">tf.data.Dataset</code> potentially can solve most of my data pipeline woes. I will test how I can use this to feed speech data (use a py_function to do feature extraction) to my models, and using the map function to augment the dataset (adding noise, combining files, time scaling etc).</p>
<p>You can use this notebook to play around with the functions that I have used.
<a href="https://colab.research.google.com/drive/1XxHNtgwFVZzILlOwEhvsYuov5s1MAy2N">https://colab.research.google.com/drive/1XxHNtgwFVZzILlOwEhvsYuov5s1MAy2N</a></p>Shreeshreekantha.nadig@iiitb.ac.inMy findings about the new TensorFlow 2.0 Dataset APIAttention models in ESPnet toolkit for Speech Recognition2019-01-10T00:00:00+05:302019-01-10T00:00:00+05:30https://vak.ai/attention<script type="text/javascript" async="" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>TL;DR - Different attention mechanisms available in the ESPnet toolkit explained. Have a look at the presentation that I gave in IIIT-B AI reading group (no math included) <a href="https://github.com/sknadig/attention_presentation/raw/master/Final.pdf">Attention based models in End-to-End ASR</a></p>
<p>I’ll directly jump to explaining the different Attention models available in the <a href="https://github.com/espnet/espnet">ESPnet</a> toolkit.
(I won’t be going into the implementation challenges in getting the Encoder-Decoer Attention models work.)</p>
<p>Please have a look at the <a href="/basics-attention/">previous post</a> for the basics of Attention models in Speech recognition. This post assumes you know the Attention mechanism in general and build from there.</p>
<ul>
<li>No Attention</li>
<li>Content-based Attention
<ul>
<li>Dot product Attention</li>
<li>Additive Attention</li>
</ul>
</li>
<li>Location-aware Attention
<ul>
<li>Location Aware Attention</li>
<li>2D Location Aware Attention</li>
<li>Location Aware Recurrent Attention</li>
</ul>
</li>
<li>Hybrid Attention
<ul>
<li>Coverage Mechanism Attention</li>
<li>Coverage Mechanism Location Aware Attention</li>
</ul>
</li>
<li>Multi-Head Attention
<ul>
<li>Multi-Head dot product Attention</li>
<li>Multi-Head additive Attention</li>
<li>Multi-Head Location Aware Attention</li>
<li>Multi-Head Multi-Resolution Location Aware Attention</li>
</ul>
</li>
</ul>
<h2 id="attention---recap">Attention - Recap</h2>
<ul>
<li>\(x = (x_{1}, x_{2}, .........., x_{T})\) - is the input sequence</li>
<li>\(y = (y_{1}, y_{2}, .........., y_{U})\) - is the target output sequence</li>
<li>\(h = (h_{1}, h_{2}, .........., h_{T})\) - is the output of the Encoder</li>
<li>\(h_{t} = f(x_{t}, h_{t-1})\) - is the Encoder function</li>
<li>\(C_{i} = \sum_{j=1}^{T} \alpha_{i,j} \cdot h_{j}\) - is the Context vector</li>
<li>\(\alpha_{i,j} = Softmax(e_{i,j}) = \frac{e^{e_{i,j}}}{\sum_{k=1}^{T} e^{e_{i,k}}}\) - are the Attention weights</li>
<li>\(e_{i,j} = a(s_{i-1}, h_j)\) - is the importance parameter for every encoded input</li>
<li>\(\sum_{j=1}^{T} e_{i,j} \neq 1\) - the importance parameter need not sum to 1</li>
<li>\(\sum_{j=1}^{T} \alpha_{i,j} = 1\) - the attention weights sum to 1</li>
</ul>
<h2 id="types-of-attention">Types of Attention</h2>
<p>Broadly, attention mechanisms can be categorized into 3 distinct categories</p>
<ul>
<li>Content aware Attention</li>
<li>Location aware Attention</li>
<li>Hybrid Attention</li>
</ul>
<p>Multi-Head Attention mechanisms are a different beast altogether, we will cross that bridge when we get there. For now, let’s concentrate on the 3 broad categories I mentioned.</p>
<h2 id="1-no-attention-equal-attention">1. No Attention (Equal Attention?)</h2>
<p>Here, no attention is used at all. Each of the \(h_{i}\) are given equal importance and linearly mixed and averaged to get \(C_{i}\)</p>
\[e_{t} = \frac{1}{T}\]
\[C_{i} = \sum_{j=1}^{T} \frac{1}{T} h_{j}\]
<h3 id="no-attention---code"><a href="https://github.com/sknadig/espnet/blob/12d2b8181f6e7b1c9f81b002f6096840e928adbf/espnet/nets/pytorch_backend/attentions.py#L11">No attention - code</a></h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Mask = Ones where enc_h is present. Zeros where padding is needed.
</span><span class="n">mask</span> <span class="o">=</span> <span class="mf">1.</span> <span class="o">-</span> <span class="n">make_pad_mask</span><span class="p">(</span><span class="n">enc_hs_len</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>
<span class="n">att_prev</span> <span class="o">=</span> <span class="n">mask</span> <span class="o">/</span> <span class="n">mask</span><span class="p">.</span><span class="n">new</span><span class="p">(</span><span class="n">enc_hs_len</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">att_prev</span> <span class="o">=</span> <span class="n">att_prev</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">enc_h</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">enc_h</span> <span class="o">*</span> <span class="n">att_prev</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">h_length</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="no-attention---full-picture">No attention - full picture</h3>
<p><img src="/assets/posts/espnet_attention/noatt.png" alt="image-center" class="align-center" /></p>
<h2 id="content-based-attention">Content-based Attention</h2>
<p>Content-based Attention - as the name suggests is based on the contents of the vector \(s_{i-1}\) (Decoder hidden state) and \(h_{t}\) (Annotation vectors from the Encoder). This means, our <strong>compatibility function</strong> or the Attention function depends only on the contents of these vectors, irrespective of their location in the sequence.</p>
<p>What does this mean?
Let’s say what has been spoken in the utterance is <strong>Barb burned paper and leaves in a big bonfire.</strong> with the phonetic sequence as <strong>sil b aa r sil b er n sil p ey sil p er n l iy v z ih n ah sil b ih sil b aa n f ay er sil</strong>. The feature vector of a phoneme, let’s say <strong>b</strong> will be <strong>similar</strong> no matter the location of the phoneme in the sequence <em>sil <strong>b</strong> aa r sil <strong>b</strong> er n sil p ey sil p er n l iy v z ih n ah sil <strong>b</strong> ih sil <strong>b</strong> aa n f ay er sil</em></p>
<p>This would give equal weight to the same phoneme, but from a different word which is not relevant to the current context. Also, a <strong>phonetically similar</strong> phoneme will get a close score to the actual phoneme.</p>
<p>Content-based Attention is computed as:</p>
\[\begin{equation}
e_{i,j} = a(h_{j}, s_{i-1})
\end{equation}\]
<p>Dot product and additive attention are content-based attention mechanisms.</p>
<h2 id="2-dot-product-attention">2. Dot product Attention</h2>
<p>In the dot product attention, our similarity measure is the dot product between the vector \(s_{i-1}\) and \(h_{t}\). For generating the Context vector \(C_{i}\), we take the Decoder hidden state \(s_{i-1}\) when generating the previous output symbol \(y_{i-1}\) and compute the dot product with each \(h_{t}\) to get \(e_{i,j}\) for each of the Annotation vectors.</p>
<p>Conceptually dot product signifies how similar each vectors are (the angle between them). More similar they are, higher the value.</p>
<p>Here’s an image explaining Dot Product Attention</p>
<p>Here, <strong>dec_z</strong> vector is the Decoder hidden state.</p>
<p><img src="/assets/posts/espnet_attention/02_attdot/attdot.gif" alt="image-center" class="align-center" /></p>
<p>As we discussed in the <a href="http://sknadig.dev/basics-attention/#before-we-start-with-the-different-attention-models">previous post</a>, these representations are in different dimensions. So, we learn a transformation to transform them to same dimensions so that we can compare them using dot product or addition.
This transformation is learnt with other parameters using backprop.</p>
<h3 id="dot-product-attention---code"><a href="https://github.com/sknadig/espnet/blob/12d2b8181f6e7b1c9f81b002f6096840e928adbf/espnet/nets/pytorch_backend/attentions.py#L57">Dot product attention - code</a></h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mlp_enc</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">eprojs</span><span class="p">,</span> <span class="n">att_dim</span><span class="p">)</span>
<span class="n">mlp_dec</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dunits</span><span class="p">,</span> <span class="n">att_dim</span><span class="p">)</span>
<span class="n">pre_compute_enc_h</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">mlp_enc</span><span class="p">(</span><span class="n">enc_h</span><span class="p">))</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">pre_compute_enc_h</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">mlp_dec</span><span class="p">(</span><span class="n">dec_z</span><span class="p">)).</span><span class="n">view</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">att_dim</span><span class="p">),</span>
<span class="n">dim</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">scaling</span> <span class="o">*</span> <span class="n">e</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">enc_h</span> <span class="o">*</span> <span class="n">w</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">h_length</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="dot-product-attention---full-picture">Dot product attention - full picture</h3>
<p><img src="/assets/posts/espnet_attention/02_attdot/10.png" alt="image-center" class="align-center" /></p>
<p>If we are computing the attention weights based on only the contents of the vectors from Decoder and Encoder, similar Annotation vectors get weighed equally irrespective of the position.
We can see this clearly from the Attention plots from the model. Observe in the following image how the Attention weights are not monotonic and tend to be distributed near positions where the Annotation vectors are similar in the acoustic space.</p>
<p><img src="/assets/posts/espnet_attention/02_attdot/att_ws.png" alt="image-center" class="align-center" /></p>
<p>We could also plot where the model is attending to for generating each output symbol. Here, I have added an overlay for each row of the first image just to highlight which output symbol is being generated. The actual attention weights look like the above image.</p>
<p>We could also correlate this with the spectrogram of the utterance, since we know how much sub-sampling was done in the model. I have used a sub-sampling of <strong>1_2_2_1_1</strong>. In our utterance FJSJ0_SX404, if we use a window size of 250ms and a frame shift of 10ms, we get 240 frames of feature vectors. Because of sub-sampling in our model, these features are mapped to 60 feature vectors after the Encoder network.</p>
<p><img src="/assets/posts/espnet_attention/02_attdot/att_dot_plots.gif" alt="image-center" class="align-center" /></p>Shreeshreekantha.nadig@iiitb.ac.inDetailed discussion of Attention models for Speech Recognition in ESPnet toolkit.Introduction to Attention models for Speech Recognition2019-01-02T00:00:00+05:302019-01-02T00:00:00+05:30https://vak.ai/basics-attention<script type="text/javascript" async="" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>In the <a href="/speech/attention/encoder-decoder-basics/">previous post</a> we discussed the Encoder-Decoder framework for Speech Recognition.</p>
<p><strong>So, why do we need Attention? What’s wrong with the Encoder-Decoder framework?</strong></p>
<p>As we discussed in the Encoder-Decoder framework:</p>
<ul>
<li>
<p>\(x = (x_{1}, x_{2}, .........., x_{T})\) be a length \(T\) input feature vector sequence to the Encoder network.</p>
</li>
<li>
<p>\(y = (y_{1}, y_{2}, .........., y_{U})\) be a length \(U\) output symbol sequence the Decoder (also called the Generator) network generates.</p>
</li>
<li>
<p>\(h = (h_{1}, h_{2}, .........., h_{T})\) be the Encoder network output which is the encoded hidden vector sequence of length \(T\).</p>
</li>
<li>
<p>Each encoded representation (annotation) \(h_{t}\) contains information about the input sequence with <strong>focus</strong> on the \(t^{th}\) input of the sequence.</p>
</li>
</ul>
<p>In the Encoder-Decoder framework, the Encoder <strong>tries</strong> to summarize the entire input sequence in a fixed dimension vector \(h_{t}\).</p>
<p><img src="/assets/posts/att_basics/enc_dec_post_att.png" alt="image-center" class="align-center" /></p>
<h2 id="potential-issues-with-encoder-decoder">Potential issues with Encoder-Decoder</h2>
<ul>
<li>The neural network needs to be able to <strong>compress</strong> all the necessary information of the input feature vector sequence into a fixed dimension vector</li>
<li>When the sequence is long, especially when the input sequence at test time is significantly longer than the training ones, the performance of the basic Encoder-Decoder network degrades.</li>
<li>Also, it is my opinion that forcing the Encoder to summarize the entire feature vector sequence into a fixed dimension vector depends on the size of the vector (longer the sentence - longer the vector) which we can’t fix as the sequence length can vary significantly.</li>
</ul>
<h2 id="attention">Attention!</h2>
<p>One of the solutions to this problem that people have been proposing is the use of Attention. Basically, Attention is an extension to the Encoder-Decoder framework.</p>
<blockquote>
<p class="notice--info">Each time the model needs to generate an output symbol, it (soft-) <strong>searches for a set of positions</strong> in the input feature vector sequence where the most <strong>relevant</strong> information is concentrated.</p>
</blockquote>
<p>We are now concerned with making the model select these <strong>set of positions</strong> in the input sequence <strong>accurately</strong>.</p>
<p>The main difference with the Encoder-Decoder framework is that here we are not trying to summarize the entire input sequence into a fixed dimension vector.</p>
<p>We know from the Encoder-Decoder post that the Encoder is a Recurrent neural network (RNN/LSTM/BLSTM/GRU) and \(h_{t}\) is the Encoder hidden state at time \(t\) which is computed as:</p>
\[\begin{equation}
h_{t} = f(x_{t}, h_{t-1})
\end{equation}\]
<p>Now, instead of feeding the hidden representation \(h_{T}\), let us select a subset of \(h\) which are most relevant to a particular context to help the Decoder network generate the output.</p>
<p>We linearly blend these relevant \(h_{t}\) to get what we refer to as the <strong>Context vector \(C_{i}\)</strong></p>
\[\begin{equation}
C_{i} = q(\{h_{1}, h_{2}, .........., h_{T}\}, \alpha_{i})
\end{equation}\]
<p><strong>Attention:</strong> In a way, the model is <strong>attending</strong> to a subset of the input features which are most relevant to the current context.</p>
<p>In all the deep learning techniques, we would like the functions to be differentiable so that we can learn them using backprop. To make this technique of attention to a subset differentiable, we <strong>attend to all the input feature vectors, but with different weight!</strong></p>
<h2 id="differences-with-the-encoder-decoder-network">Differences with the Encoder-Decoder network</h2>
<ul>
<li>
<p>In the Encoder-Decoder network that we discussed in the previous post, the Decoder hidden state is computed as:</p>
\[\begin{equation}
s_{i} = f(s_{i-1}, y_{i-1})
\end{equation}\]
</li>
<li>
<p>In the Attention extension, we take the Context vector in computing the Decoder hidden state:</p>
\[\begin{equation}
s_{i} = f(s_{i-1}, y_{i-1}, C_{i})
\end{equation}\]
</li>
<li>
<p>The Context vector is the summary of only the most relevant input feature vectors. To capture this <em>relevance</em>, let’s consider a variable \(\alpha\) where \(\alpha_{i}\) represents the weight of the encoded representation (also referred to as the <strong>annotation</strong>) \(h_{i}\) in the Context vector \(C_{i}\) - for predicting the output at time \(i\). Given this \(\alpha\), we can compute the Context vector as:</p>
\[\begin{equation}
C_{i} = \sum_{j=1}^{T} \alpha_{i,j} \cdot h_{j}
\end{equation}\]
</li>
</ul>
\[\begin{equation}
\sum_{j=1}^{T} \alpha_{i,j} = 1
\end{equation}\]
<ul>
<li>
<p>To compute \(\alpha_{i,j}\), we need \(e_{i,j}\) - the importance of the \(j^{th}\) annotation vector for predicting the \(i^{th}\) output symbol. This is what the <strong>compatibility function</strong> produces.</p>
<p>The weight \(\alpha_{i,j}\) of each annotation \(h_{j}\) is computed as:</p>
\[\begin{equation}
\alpha_{i,j} = Softmax(e_{i,j}) = \frac{e^{e_{i,j}}}{\sum_{k=1}^{T} e^{e_{i,k}}}
\end{equation}\]
\[\begin{equation}
\sum_{j=1}^{T} e_{i,j} \neq 1
\end{equation}\]
</li>
<li>
<p>Where \(e_{i,j} = a(s_{i-1}, h_j)\), <strong>\(a\)</strong> is a <strong>compatibility function</strong> which computes the importance of each annotation \(h_j\) with the Decoder hidden state \(s_{i-1}\).</p>
</li>
</ul>
<blockquote>
<p class="notice--info">In all our Attention models, it is this <strong>function <em>\(a()\)</em></strong> that is going to be different. <br /> <em>\(a()\)</em> defines what type of Attention it is.</p>
</blockquote>
<ul>
<li>This image summarizes the Attention mechanism. Observe each annotation vector is scaled by the attention weight \(\alpha_{i,j}\)</li>
</ul>
<p><img src="/assets/posts/att_basics/att_basic.gif" alt="image-center" class="align-center" /></p>
<ul>
<li>
<p>In the Encoder-Decoder network - Given the Decoder hidden representation \(s_{i-1}\) (from the previous output time) and the output symbol \(y_{i-1}\) (the previous output symbol), we can predict the output symbol at the current time step as:</p>
\[\begin{equation}
p(y_{i} | \{y_1, y_2, .........., y_{i-1}\}) = g(y_{i-1}, s_i)
\end{equation}\]
<p>Where \(g()\) is the entire Decoder function.</p>
</li>
<li>
<p>In the Attention extension - Given the Context vector \(C_{i}\), the Decoder hidden representation \(s_{i-1}\) (from the previous output time) and the output symbol \(y_{i-1}\) (the previous output symbol), we can predict the output symbol at the current time step as:</p>
\[\begin{equation}
p(y_{i} | \{y_1, y_2, .........., y_{i-1}\}, C_{i}) = g(y_{i-1}, s_i, C_{i})
\end{equation}\]
<p>Where \(g()\) is the entire Decoder function.</p>
</li>
<li>
<p>The probability of the full output sequence \(y\) can be computed as:</p>
\[\begin{equation}
p(y) = \prod_{i=1}^{U} p(y_i | \{y_1, y_2, .........., y_{i-1}\}, C_{i})
\end{equation}\]
</li>
</ul>
<h2 id="attention-weights-visualization">Attention weights visualization</h2>
<p>So, what does Attention even look like?</p>
<p>I trained an Attention model on the TIMIT dataset using the ESPnet toolkit and visualized the weights for 20 epochs and this is what it looks like for the speaker FJSJ0 and utterance SX404 of TIMIT:</p>
<blockquote>
<p class="notice--info">Word transcript for FJSJ0_SX404 : <strong>Barb burned paper and leaves in a big bonfire.</strong></p>
</blockquote>
<blockquote>
<p class="notice--info">Phoneme transcript for FJSJ0_SX404 : <strong>sil b aa r sil b er n sil p ey sil p er n l iy v z ih n ah sil b ih sil b aa n f ay er sil</strong></p>
</blockquote>
<h3 id="phoneme-decoding---final-weights">Phoneme decoding - final weights</h3>
<p><img src="/assets/posts/att_basics/phn_FJSJ0_SX404.ep.20.png" alt="image-center" class="align-center" /></p>
<h3 id="character-decoding---final-weights">Character decoding - final weights</h3>
<p><img src="/assets/posts/att_basics/char_FJSJ0_SX404.ep.20.png" alt="image-center" class="align-center" /></p>
<p>On the \(x\) axis from left to right is the Encoder index ranging from \(0\) to \(T\), where \(T\) is the length of the input feature vector sequence. On the \(y\) axis from top to bottom is the Decoder index ranging from \(0\) to \(U\), where \(U\) is the length of the output symbol sequence.</p>
<p>Here, you can see that each row corresponds to the weight for each input feature vector \(h_{t}\) in producing the Context vector \(C_{i}\) for generating the output symbol \(y_{i}\).</p>
<p>If you see the Attention weights before the model is trained (at epoch 0), the Attention weights are all random and hence the Context vector \(C_{i}\) contains unnecessary noise from irrelevant input feature vectors. This leads to a degraded performance of the model. It is fairly evident that a good Attention model produces a better Context vector which leads to better model performance.</p>
<h3 id="phoneme-decoding---initial-weights">Phoneme decoding - initial weights</h3>
<p><img src="/assets/posts/att_basics/phn_random_FJSJ0_SX404.ep.01.png" alt="image-center" class="align-center" /></p>
<h3 id="character-decoding---initial-weights">Character decoding - initial weights</h3>
<p><img src="/assets/posts/att_basics/char_random_FJSJ0_SX404.ep.01.png" alt="image-center" class="align-center" /></p>
<h3 id="attention-weights-for-single-output-symbol">Attention weights for single output symbol</h3>
<p>I’m working on visualizing the Attention weights over a Spectrogram every time an output symbol is generated. Ideally it should look like a Gaussian distribution with it’s mean at the most relevant \(h_{t}\) for generating \(y_{i}\) and it’s variance proportional to the duration of the phoneme utterance. This is proving more involved than I initially thought, requiring changes to the ESPnet code at a deeper level. I will update this post when I have that.</p>
<p><strong>Update:</strong></p>
<p>If we plot the Attention weights over the annotation sequence \(h_{t}\) for generating each \(y_{i}\), we could see how Attention is playing a role in producing the Context vector \(C_{i}\).</p>
<p>Here’s what the Attention weights look like for generating each \(y_{i}\) at epoch 1.</p>
<p><img src="/assets/posts/att_basics/att_single_01_phns_progress.gif" alt="image-center" class="align-center" /></p>
<p>Here’s what the Attention weights look like for generating each \(y_{i}\) at epoch 20.</p>
<p><img src="/assets/posts/att_basics/att_single_20_phns_progress.gif" alt="image-center" class="align-center" /></p>
<p>We could also see how Attention weights progress over time (epochs) to get deeper understanding of how the model is learning. I did just that combining all the Attention weights from each epoch into a gif. Here’s what it looks like:</p>
<h3 id="phoneme-decoding---attention-weights-over-epochs">Phoneme decoding - Attention weights over epochs</h3>
<p><img src="/assets/posts/att_basics/phn_att_progress.gif" alt="image-center" class="align-center" /></p>
<h3 id="character-decoding---attention-weights-over-epochs">Character decoding - Attention weights over epochs</h3>
<p><img src="/assets/posts/att_basics/char_att_progress.gif" alt="image-center" class="align-center" /></p>
<h2 id="before-we-start-with-the-different-attention-models">Before we start with the different Attention models</h2>
<p>In all the subsequent discussions of Attention models, I would like to follow <em>some</em> consistency. For example, anything that is <span style="color:orange">orange</span> in color is related to the Encoder side of the network, <span style="color:blue">blue</span> with Decoder side and <span style="color:green">green</span> with the Attention function itself.</p>
<p>We will see often that the representation (annotation) learnt by the Encoder, the hidden state of the Decoder and the representations learnt by the Attention function are of different dimensions. This means we can’t add them or take dot product.</p>
\[\begin{equation}
dim(h_{t}) \ne dim(s_{i}) \ne dim(f(e))
\end{equation}\]
<p><img src="/assets/posts/att_basics/diff_dim.png" alt="image-center" class="align-center" /></p>
<p>To overcome this issue, we project each of these vectors to a fixed dimension and this <strong>non-linear projection</strong> is learnt along with the other parameters of the network.</p>
<p><img src="/assets/posts/att_basics/dim_match.png" alt="image-center" class="align-center" /></p>
<p>In the <a href="/attention/">next post</a> we will discuss about the different Attention models available in the <a href="https://github.com/espnet/espnet">ESPnet</a> toolkit.</p>Shreeshreekantha.nadig@iiitb.ac.inIntroduction to Attention models and the differences with the Encoder-Decoder frameworkEncoder-Decoder framework for Speech Recognition2019-01-01T00:00:00+05:302019-01-01T00:00:00+05:30https://vak.ai/speech/attention/encoder-decoder-basics<script type="text/javascript" async="" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>In most of the problems we are trying to solve with Machine Learning/Deep Learning, we have a set of inputs \(x = (x_{1}, x_{2}, .........., x_{T})\) that we would like to map to a set of outputs \(y = (y_{1}, y_{2}, .........., y_{T})\). Mostly, each input \(x_{i}\) corresponds to an output \(y_{i}\).</p>
<p>We assume there is some function \(f()\) that can map all of these \(x{i}\) to their corresponding \(y_{i}\)</p>
\[\begin{equation}
y_{i} = f(x_{i}, \theta)
\end{equation}\]
<p>When we have this data (x-y pairing), a Supervised learning algorithm can be used to train a model to approximate this mapping function \(f()\).
And it has been many people’s mission to build algorithms to approximate this function \(f()\) as accurately as possible.</p>
<h2 id="problem-of-speech">Problem of Speech</h2>
<p>The problem with Speech is, our feature vectors are taken from a <a href="http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/">short time spectra</a> of speech signal. Which means we have a feature vector for every 20-25ms or so.</p>
<p>So, what is the problem?</p>
<p>The problem is, <strong>first</strong> : we don’t know where the boundary is between one sound (phoneme) and another. If you could open a speech signal in any <a href="https://www.audacityteam.org/">signal processing software</a> and cut the signal to ~200-300ms and listened to it without any context, it is not possible to distinguish which sound it is.</p>
<blockquote>
<p class="notice--info"><strong>Context</strong> matters!</p>
</blockquote>
<p><strong>Second</strong> : Intra- and inter-speaker variability - the way one person pronounces a word is different than another person. There are many reasons for this, they may speak different primary language (L1 effect on L2), their vocal tract characteristics are different, gender, age, etc. everything plays a role in how we pronounce words.</p>
<p><strong>Third</strong> : the problem with the language itself. If we take English for example. <strong>phonemes</strong> is pronounced as <strong>ˈfōnēm</strong>. Where is the sound <strong>f</strong> represented in the word phoneme?.</p>
<p>(There are other practical issues that are not relevant to this discussion)</p>
<p>That being said, let’s consider <strong>only the first problem for now</strong>, where we don’t know what sound it is when given an isolated chunk of 100-300ms speech signal. When we collect data to train a model to do Speech recognition, we might give people a previously decided transcript and ask them to read it, then record their speech to get speech data. Or we might collect already existing speech-text pair of data (Audiobooks, Broadcast news recording etc.) to train our model.
Observe, in either case, the data we have is the speech signal and the corresponding transcript at the <strong>word level</strong>. That is, we do not have data about where a word (or a phoneme) ends and where another begins.</p>
<p><strong>How do we train a model when we don’t even know the \(x - y\) pair?</strong></p>
<p>This challenging problem of <strong>sequence modeling</strong> has been the interest of speech community since many decades. There have been many approaches to tackle this problem, two of the recent ones are: <a href="https://distill.pub/2017/ctc/"><strong>Connectionist temporal classification</strong></a> and the <a href="https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf"><strong>Encoder-Decoder</strong></a> approach.</p>
<h2 id="encoder-decoder-network">Encoder-Decoder network</h2>
<ul>
<li>
<p>\(x = (x_{1}, x_{2}, .........., x_{T})\) be a length \(T\) input feature vector sequence to the Encoder network.</p>
</li>
<li>
<p>\(y = (y_{1}, y_{2}, .........., y_{U})\) be a length \(U\) output symbol sequence the Decoder (also called the Generator) network generates.</p>
</li>
<li>
<p>\(h = (h_{1}, h_{2}, .........., h_{T})\) be the Encoder network output which is the encoded hidden vector sequence of length \(T\).</p>
</li>
</ul>
<p><img src="/assets/posts/enc_dec/encoder.gif" alt="image-center" class="align-center" /></p>
<ul>
<li>
<p>Each encoded representation \(h_{t}\) contains information about the input sequence with <strong>focus</strong> on the \(t^{th}\) input of the sequence.</p>
</li>
<li>
\[\begin{equation}
h_{t} = f(x_{t}, h_{t-1})
\end{equation}\]
<p>is the hidden state at time \(t\), where \(f()\) is some function the Encoder is implementing to update it’s hidden representation.</p>
</li>
</ul>
<p>In the Encoder-Decoder framework, the Encoder tries to <strong>summarize</strong> the entire input sequence in a fixed dimension vector \(h_{t}\). The Encoder itself is a Recurrent neural network (RNN/LSTM/BLSTM/GRU) which takes each input feature vector \(x_{t}\) and switches it’s internal state to represent (summarize) the sequence till that time inside \(h_{t}\).</p>
<p>We could take \(h_{t}\) at every time step to make a prediction (or not), but we shall wait till the end of the sequence at time \(T\) and take the representation \(h_{T}\) to start generating our output sequence. This is because we don’t know the word/phoneme boundaries and we are <strong>hoping</strong> the Encoder is <em>able</em> to summarize the input sequence entirely inside \(h_{T}\).</p>
<p>We give as input a <strong><sos></strong> - start of the sequence token to the Decoder for consistency and to start generating output symbols. The Decoder is another Recurrent neural network (not bidirectional) which switches it’s internal state every time to predict the output.</p>
<p>At every time step, we feed the output from the previous time step to predict the current output.</p>
\[\begin{equation}
s_{i} = f(s_{i-1}, y_{i-1})
\end{equation}\]
<p>is the Decoder hidden state when predicting \(i^{th}\) output symbol, where \(f()\) is some function the Decoder LSTM is implementing to update it’s hidden representation.</p>
<p>Given the Decoder hidden representation \(s_{i-1}\) (from the previous output time) and the output symbol \(y_{i-1}\) (the previous output symbol), we can predict the output symbol at the current time step as:</p>
\[\begin{equation}
p(y_{i} | \{y_1, y_2, .........., y_{i-1}\}) = g(y_{i-1}, s_i)
\end{equation}\]
<p>Where \(g()\) is the entire Decoder function.</p>
<p>The probability of the full output sequence \(y\) can be computed as:</p>
\[\begin{equation}
p(y) = \prod_{i=1}^{U} p(y_i | \{y_1, y_2, .........., y_{i-1}\}, s_{i})
\end{equation}\]
<p>We will stop generating the output symbol sequence when the Decoder generates an <strong><eos></strong> - end of sequence token.</p>
<p><img src="/assets/posts/enc_dec/encoder_decoder.gif" alt="image-center" class="align-center" /></p>
<h2 id="summary">Summary</h2>
<ul>
<li>\(x = (x_{1}, x_{2}, .........., x_{T})\) is the input sequence</li>
<li>Encoder: Summarizes the entire input sequence \(x\) inside \(h_{T}\)</li>
<li>\(h = (h_{1}, h_{2}, .........., h_{T})\) is the hidden vector sequence</li>
<li>\(h_{t}\): Summary of the input sequence till time \(t\)</li>
<li>Decoder: Generates the output sequence given \(h_{T}\)</li>
<li>\(y = (y_{1}, y_{2}, .........., y_{U})\) is the output sequence</li>
</ul>
<p>In the next post we will discuss about the basics of the Attention extension to the Encoder-Decoder framework and how it is better.</p>Shreeshreekantha.nadig@iiitb.ac.inIntroduction to Encoder-Decoder framework and it's significance