Deep learning(13) - Sequence Model

read

Assuming that we need to make a model considering sequential input like previous input can have an effect on next input. Then, Depending on the time and partial input, the output can be affected.

In this post i will introduce RNN that is powerful neural network to manage sequential data such as signal, text and time series data.

Sequential Model

Why sequential?

If we consider sequential, the position of data can be meaningful.

Sequential Model: The model for dealing with sequential data.

Input: Sequence
Output: Sequence or Label

Example of sequential data

Speech recognition
Music generation
Named entity recognition
Machine translation

The difference between NN vs RNN

There are two main differences between Neural network and Recurrent neural network.

Inputs, outputs can be different lengths in Neural Network.
Doesn’t share feataures learned across different position of features.

nn_vs_rnn

Concept of RNN

The idea of Recurrent Neural Network is the sequence of output can be defined by previous output and sequence input.

Like the image below, a folded node will work recurrently by delivering its output to itself.

And In the unfolded version, the inputs of current node are the output of previous node and the sequence input, $x_t$.

unfolding_rnn

Let’s see the equation of RNN to see the detail.

\[\begin{align} & a^{<0>} = \vec{0} \\ & a^{<1>} = g(W_{aa}a^{<0>} + W_{ax}x^{<1>} + b_a) \\ & \hat y^{<1>} = g(W_{ya}a^{<1>} + b_y) \\ & \vdots \\ & a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a) \\ & \hat y^{<t>} = g(W_{ya}a^{<t>} + b_y) \end{align}\]

Above equation can be simplified like the below.

\[\begin{align} & a^{<t>} = g( \begin{bmatrix} W_{aa} & W_{ax} \end{bmatrix} \cdot \begin{bmatrix} a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_a ) \\ & \hat y^{<t>} = g(W_{ya} \cdot a^{<t>} + b_y) \end{align}\]

Weights

$W_{aa}$ is the shared weights for next node.
$W_{ya}$ is the shared weights for output.

Output

In $a$, $g()$ is tanh or ReLU commonly.
In $\hat y$, $g()$ is sigmoid or softmax (depending on the label)

Memory Cells

the output of a recurrent neuron at time step $t$ is a function of all the inputs from previous time steps, so it has a form of memory.

Input and Output Sequences

RNN can simultaneously take a sequence of inputs and produce a sequence of outputs.

Limitation of RNN

Vanishing/exploding gradient problem

When you try to train RNN on long sequences, you suffer vanishing or exploding gradient.

Solution

good parameter initialization
nonsaturating activation functions(e.g. ReLU)
Batch Normalization
Gradient Clipping
Faster optimizer

LSTM(Long Short-Term Memory)

Even if the sequence is enough long without vaninshing/exploding gradient, The effect of first input will getting smaller to last input. It can cause the problem like missing context.

For example, When we want to train the text model, the RNN model can be trained well for grammar but not context. The sentence generated by our model is not good to remember contextual information.

To solve this proble, LSTM is designed to remeber the long and short term memory through the time series.

\[\begin{align} i_{(t)} &= \sigma(W_{xi}^T \cdot x_{(t)} + W_{hi}^T \cdot h_{(t-1)} + b_i) \\ f_{(t)} &= \sigma(W_{xf}^T \cdot x_{(t)} + W_{hf}^T \cdot h_{(t-1)} + b_f) \\ o_{(t)} &= \sigma(W_{xo}^T \cdot x_{(t)} + W_{ho}^T \cdot h_{(t-1)} + b_o) \\ g_{(t)}& = \tanh(W_{xg}^T \cdot x_{(t)} + W_{hg}^T \cdot h_{(t-1)} + b_g)\\ c_{(t)} &= f_{(t)} \otimes c_{(t-1)} + i_{(t)} \otimes g_{(t)}\\ y_{(t)} &= h_{(t)} = o_{(t)} \otimes \tanh(c_{(t)}) \end{align}\]

$c_{(t)}$ is for long-term memory status.

$c_{(t)}$ will drop some memories through the forget gate, and get some memories through the input gate.
$h_{(t)}$ is for short-term memory status.

The output of $\tanh$ is between -1 and 1. So, the output value can be remebered for short-time.

The output of $\sigma$ is between 0 and 1. So, the output value can be decreased through the time step.

GRU(Gated Recurrent Unit)

The GRU cell is a simplified version of the LSTM cell, and it seems to perform just as well.

\[\begin{align} z_{(t)} &= \sigma(W_{xz}^T \cdot x_{(t)} + W_{hz}^T \cdot h_{(t-1)} + b_z) \\ r_{(t)} &= \sigma(W_{xr}^T \cdot x_{(t)} + W_{(hr)}^T \cdot h_{(t-1)} + b_r)\\ g_{(t)} &= \tanh(W_{xg}^T \cdot x_{(t)} + W_{hg}^T \cdot (r_{(t)} \otimes h_{(t-1)}) + b_g) \\ h_{(t)} &= (1-z_{(t)}) \otimes h_{(t-1)} + z_{(t)} \otimes g_{(t)} \end{align}\]

Both state vectors are merged into a single vector $h_{(t)}$
Both state vectors are merged into a single vector h_{(t)}.
A single gate controller controls both the forget gate and the input gate. If the gate controller output is 1, the input gate is one and the forget gate is closed. If it output is 0.
There is no output gate. the full state vector is output at every time step.