13_LSTM_finished

Lecture 13: Recurrent Neural Nets (RNNs) [MY SOLUTIONS]

A major characteristic of the "simple" neural networks covered in the DNN noteobook is that they have no memory. What does that mean?

By "no memory" I mean that each input shown to them is processed independently and the nodes keep no "state variable" between inputs. With cross-sectional data, this isn't a big deal but what about when what happens next depends upon what has happened now? For example, we saw earlier in the semester that predicting GDP required creating lagged variables which amounted to tricking the regression to look into the past by creating new variables.

When a DNN has no memory we have to show the entire sequence to the network at once and the algorithm will (hopefully) figure out the timeseries or squence order. In the IMDB example an entire movie review was transformed into a single large vector and processed in one go. Such networks are known as feedforward networks because the information feeds through the network in one direction.

But the real world isn't a snapshot in time. As you're reading the present sentence, you're processing it word-by-word and retaining memories of what came before; this gives you a fluid representation of the meaning conveyed by the sentence. Biological intelligence processes information incrementally while maintaining an internal model of what it is processing. Moreover, intelligence constantly updates as new information comes in. Modeling this requires creating a recurrent neural network which is a DNN that retains the order of the sequence embedded in data. Conceptually, you can think of a RNN as a model which is flexible enough to figure out how many lags to use of different data features. Because the model is a neural net, such "lags" are allowed to be very complicated.

The agenda for today's lecture is as follows:

Deep Neural Nets (DNNs)?
What is a Recurrent Neural Network (RNN)?
A "Simple" RNN
Long Short-Term Memory (LSTM)
Fitting an LSTM RNN Model on IMDB Reviews
Improving Performance
Summary

In the practice section you will learn about:

Using "dropouts" to reduce overfitting in RNN models.
"Stacking" RNNs.
"Bi-directional" RNNs.

1. Deep Neural Nets (DNNs)¶

In the DNN notebook you learned that a "Deep" Neural Network (DNN) is a ML model which has many inter-connected "layers" of "nodes." Each node may serve distinct and disparate purpose. How a node takes inputs and creates outputs is generally based on the following linear operator:

$$ \text{output} = W\times \text{input} + b $$

where "W" is a matrix of weights, "input" is the matrix of input data and "b" is another matrix.

With this kind of functional form, each layer could only learn linear transformations (affine transformations) of the input data: the hypothesis space of the layer would be the set of all possible linear transformations of the input data into a 16-dimensional space. Such a hypothesis space is too restricted and wouldn’t benefit from multiple layers of representations, because a deep stack of linear layers would still implement a linear operation: adding more layers wouldn’t extend the hypothesis space.

In order to get access to a much richer hypothesis space that would benefit from deep representations, it turns out we need to add a non-linearity -- what we call an activation function.

Each "layer" will be decomposed into "nodes." Define $z_{ij}^{h-1}$ as the output of node $j$ in layer $h\!-\!1$ for observation $i$. The output of node $k$ in layer $h$ is then $$ z_{ik}^{h}= \sum_{j}\omega_{hj}z_{ij}^{h-1} $$ where $\omega$ are weights (parameters) which we solve for when we train the model (like $\beta$ in the spam email example).

An "activation function" is a functional form decision for the researcher. The most common is the "rectified linear unit" (or relu in tensorflow language) defined as $$ z_{ik}^{h} = \max\bigg\{0,\sum_{j}\omega_{hj}z_{ij}^{h-1}\bigg\} $$

While relu is the most popular activation function in deep learning, but there are many other candidates, which all come with similarly strange names: prelu, elu, and so on.

You also learned of the Universal Approximation Theorem which shows that any continuous function can be approximated by a DNN. Put differently, this means a DNN has the flexibility to approximate any continuous process.

2. What is a Recurrent Neural Network (RNN)?¶

A recurrent neural network (RNN) processes sequences — whether daily stock prices, sentences, or sensor measurements — one element at a time while retaining a memory (called a "state") of what has come previously in the sequence. Recurrent means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not just the current input, but what it remembers about the preceding elements.

In effect, an RNN is a type of neural network that has an internal loop (see figure below). The state of the RNN is reset between processing two different, independent sequences (such as two different IMDB reviews), so you still consider one sequence a single data point: a single input to the network. What changes is that this data point is no longer processed in a single step; rather, the network internally loops over sequence elements.

This memory allows the network to learn long-term dependencies in a sequence which means it can take the entire context into account when making a prediction, whether that be the next word in a sentence, a sentiment classification, or the next temperature measurement. A RNN is designed to mimic the human way of processing sequences: we consider the entire sentence when forming a response instead of words by themselves.

For example, consider the following sentence:

The concert was boring for the first 15 minutes while the band warmed up but then was terribly exciting.

A machine learning model that considers the words in isolation — such as a bag of words model — would probably conclude this sentence is negative. An RNN by contrast should be able to see the words “but” and “terribly exciting” and realize that the sentence turns from negative to positive because it has looked at the entire sequence. Reading a whole sequence gives us a context for processing its meaning, a concept encoded in recurrent neural networks.

3. A "Simple" RNN ¶

Let's look at a "simple" RNN layer to get some intuition.

In the above example, the layer takes as inputs a 2d tensor (timesteps, input_features) and loops of the timesteps. At each timestep, it uses the current state $t$ and the input at $t$ to create an output at $t$:

output_t = activation(dot(W,input_t) + dot(U,state_t) + b) 
state_t = output_t

where $W$ and $U$ are weights we must determine when fitting the model and "dot" means (matrix) multiplication. Note that if $U=0$, we have a regular old DNN since the state cannot matter. The output_t will travel to the next layer while the state_t updates that output.

3.1 Embeddings¶

Before proceeding we should talk about the limitations of word counting that we've been avoiding. Namely, that simply counting up words (aka one-hot encoding) in a text may say little about how those words are related to eachother. For example, books like "War and Peace" and "Anna Karenina" (both classic novels by Leo Tolstoy) are no closer to one another than "War and Peace" is to "The Hitchhiker’s Guide to the Galaxy" if we use one-hot encoding as the measuring stick. All of these have a lot of the same words. What we want is some way of saying a group of text are "close" to eachother. What we would like to do is use the data to find similarities in the words in an unsupervised kind of way.

This is a bit like Principal Component Analysis -- we want to find out how the words are similar in our setting and thereby reduce the dimensionality from words to some notion of "similarity." In our IMDB example, we could reduce the dimensionality down to similarity in the words used for reviews which are positive (or negative). As a human reader, this amounts to understanding that the words "no" and "good" when put together as a phrase "no good" is likely indicative of a bad review.

How do we do this? We can start our DNN with an embedding layer which will train the one-hot encoded word counts to find similarities between good and bad reviews and thereby reduce the dimensionality. In the next lecture, we'll cover embeddings in more detail.

3.2 Classifying IMDB Reviews with RNN¶

Let's try this out with a subset of the IMDB dataset.

In [1]:

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)

print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

# A subtle point: All neural networks require inputs that have the same shape and size.
# Using texts as inputs means that not all the sentences/samples have the same length. That's a problem.
# The solution is called "padding" which is really just that we define a maximum number of words for each sentence.
# If a sentence is longer than what we designated, drop the remaining words. If you look back at the DNN
# lecture, we did this by construction with the "vectorize_sequences" function which set the length for each
# sample at 10,000. The code below is much simpler.
print('\nSequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)

# Trim to first "nobs" observations to make fitting the model faster
nobs = 5000
input_train = input_train[:nobs,:]
input_test = input_test[:nobs,:]
y_train = y_train[:nobs]
y_test = y_test[:nobs]

print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

Loading data...
25000 train sequences
25000 test sequences

Sequences (samples x time)
input_train shape: (5000, 500)
input_test shape: (5000, 500)

Let's train a simple recurrent network using an Embedding layer and a SimpleRNN layer:

In [2]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 32)          320000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 32)                2080      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 322,113
Trainable params: 322,113
Non-trainable params: 0
_________________________________________________________________

In [3]:

history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Epoch 1/10
32/32 [==============================] - 5s 105ms/step - loss: 0.7014 - accuracy: 0.5052 - val_loss: 0.6881 - val_accuracy: 0.5360
Epoch 2/10
32/32 [==============================] - 3s 95ms/step - loss: 0.6549 - accuracy: 0.6472 - val_loss: 0.6800 - val_accuracy: 0.5510
Epoch 3/10
32/32 [==============================] - 3s 95ms/step - loss: 0.5728 - accuracy: 0.7912 - val_loss: 0.6849 - val_accuracy: 0.5630
Epoch 4/10
32/32 [==============================] - 3s 94ms/step - loss: 0.4405 - accuracy: 0.8913 - val_loss: 0.7064 - val_accuracy: 0.5530
Epoch 5/10
32/32 [==============================] - 3s 95ms/step - loss: 0.3034 - accuracy: 0.9477 - val_loss: 0.7683 - val_accuracy: 0.5490
Epoch 6/10
32/32 [==============================] - 3s 96ms/step - loss: 0.1935 - accuracy: 0.9808 - val_loss: 0.7737 - val_accuracy: 0.5570
Epoch 7/10
32/32 [==============================] - 3s 95ms/step - loss: 0.1144 - accuracy: 0.9925 - val_loss: 0.8350 - val_accuracy: 0.5500
Epoch 8/10
32/32 [==============================] - 3s 107ms/step - loss: 0.0550 - accuracy: 0.9995 - val_loss: 0.9072 - val_accuracy: 0.5560
Epoch 9/10
32/32 [==============================] - 3s 103ms/step - loss: 0.0382 - accuracy: 0.9992 - val_loss: 0.9731 - val_accuracy: 0.5320
Epoch 10/10
32/32 [==============================] - 3s 100ms/step - loss: 0.0356 - accuracy: 0.9960 - val_loss: 0.9597 - val_accuracy: 0.5510

Let's display the training and validation loss and accuracy:

In [4]:

import matplotlib.pyplot as plt
import seaborn as sns

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(epochs, loss, 'bo', label='Training loss') # "bo" is for "blue dot"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss') # b is for "solid blue line"
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend(frameon=False)
sns.despine(ax=ax[0])

ax[1].plot(epochs, acc, 'bo', label='Training Accuracy')
ax[1].plot(epochs, val_acc, 'b', label='Validation Accuracy')
ax[1].set_title('Training and Validation Accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend(frameon=False)
sns.despine(ax=ax[1])

plt.show()

Last time our DNN got us to 88% training accuracy. Unfortunately, our small recurrent network doesn't perform very well at all compared to this baseline (our best is 85% validation accuracy). Part of the problem is that our inputs only consider the first 500 words rather the full sequences -- hence our RNN has access to less information than our earlier baseline model. The remainder of the problem is simply that SimpleRNN isn't very good at processing long sequences, like text. Other types of recurrent layers perform much better. Let's do something fancier...

4. Long Short-Term Memory (LSTM)¶

At the heart of an RNN is a layer made of "memory cells" which is to say a method for the model to retain memory of previous states. How to best to do that is debateable. A popular (useful) methodology is known as Long Short-Term Memory (LSTM) which saves information for later using something called a "carry". Let's look at another diagram:

The "carry" works to modulate the both the next output and the next state.

Here's where things go a bit nuts. The carry dataflow is a function of three objects we'll call i, f, and k and each one of these is computed like a simple RNN, complete with their own weight matrices. For those of you keeping track at home, we've now introduced a lot of weight matrices and we'll have to solve for all of them when fitting the model -- a computationally expensive task for the flexibility and added functionality we want. More proof that in life there is no free lunch, I guess.

Here's the pseudo-code where the weight matrices are (W,U,V):

output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo)

i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)

where the activation function is the hyperbolic tangent (tanh). Why not relu? Well, the interactions we created with the carries generate nonlinearities (polynomials) so there's no need to introduce more non-linearities via relu.

We solve for the carry workflow as a combination of (i,f,k):

c_t+1 = i_t * k_t + c_t * f_t

At each time step the LSTM considers the current word, the carry, and the cell state.

Discussion¶

If you want to get philosophical, you can interpret what each of these operations is meant to do.

multiplying c_t and f_t is a way to deliberately forget irrelevant information in the carry dataflow.
i_t and k_t provide information about the present, updating the carry track with new information.

What these operations actually do, however, is determined by the contents of the weights parameterizing them. These weights are learned in an end-to-end fashion, starting over with each training round, making it impossible to credit this or that operation with a specific purpose. The specification of an RNN cell determines the space in which you’ll search for a good model configuration during training—but it doesn’t determine what the cell does; that is up to the cell weights and the cell weights will be determined by the data. If there is memory to be retained in the data, the LSTM RNN is designed to find it.

If all of this sounds uncomfortable, take comfort in the fact you don't have to design the specific architecture of an LSTM cell but rather know that the LSTM is designed to allow sequencing embedded in the data to be important. That is, a LSTM layer is designed to allow "past" information to be reinjected at a later time. Thus, past information retains value for future predictions just as using lags enabled us to better predict GDP.

5. Fitting an LSTM RNN Model on IMDB Reviews ¶

Let's fit an LSTM RNN model to the IMDB data. There are a lot of weights to choose so sit back and grab a coffee.

In [5]:

from tensorflow.keras.layers import LSTM

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, None, 32)          320000    
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________

In [6]:

history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Epoch 1/10
32/32 [==============================] - 11s 267ms/step - loss: 0.6922 - accuracy: 0.5230 - val_loss: 0.6902 - val_accuracy: 0.5190
Epoch 2/10
32/32 [==============================] - 8s 253ms/step - loss: 0.6821 - accuracy: 0.5888 - val_loss: 0.6430 - val_accuracy: 0.6260
Epoch 3/10
32/32 [==============================] - 7s 233ms/step - loss: 0.5884 - accuracy: 0.7078 - val_loss: 0.6039 - val_accuracy: 0.6520
Epoch 4/10
32/32 [==============================] - 8s 247ms/step - loss: 0.4735 - accuracy: 0.8018 - val_loss: 0.4614 - val_accuracy: 0.7930
Epoch 5/10
32/32 [==============================] - 8s 262ms/step - loss: 0.3876 - accuracy: 0.8490 - val_loss: 0.4180 - val_accuracy: 0.8100
Epoch 6/10
32/32 [==============================] - 8s 259ms/step - loss: 0.3228 - accuracy: 0.8827 - val_loss: 0.4163 - val_accuracy: 0.8080
Epoch 7/10
32/32 [==============================] - 8s 260ms/step - loss: 0.2637 - accuracy: 0.9053 - val_loss: 0.4551 - val_accuracy: 0.8230
Epoch 8/10
32/32 [==============================] - 8s 239ms/step - loss: 0.2358 - accuracy: 0.9172 - val_loss: 0.4270 - val_accuracy: 0.8130
Epoch 9/10
32/32 [==============================] - 7s 233ms/step - loss: 0.1679 - accuracy: 0.9498 - val_loss: 0.4158 - val_accuracy: 0.8130
Epoch 10/10
32/32 [==============================] - 7s 206ms/step - loss: 0.1552 - accuracy: 0.9485 - val_loss: 0.4103 - val_accuracy: 0.8360

As before, let's assess our model by looking at the training versus validation loss:

In [7]:

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(epochs, loss, 'bo', label='Training loss') # "bo" is for "blue dot"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss') # b is for "solid blue line"
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend(frameon=False)
sns.despine(ax=ax[0])

ax[1].plot(epochs, acc, 'bo', label='Training Accuracy')
ax[1].plot(epochs, val_acc, 'b', label='Validation Accuracy')
ax[1].set_title('Training and Validation Accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend(frameon=False)
sns.despine(ax=ax[1])

plt.show()

Practice ¶

1. Try 64 nodes in the hidden layer. Keep everything else the same (Embedding layer + LSTM recurrent layer).

In [8]:

model = Sequential()
model.add(Embedding(max_features, 64))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, None, 64)          640000    
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 673,089
Trainable params: 673,089
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
32/32 [==============================] - 23s 643ms/step - loss: 0.6922 - accuracy: 0.5303 - val_loss: 0.6893 - val_accuracy: 0.5120
Epoch 2/10
32/32 [==============================] - 20s 631ms/step - loss: 0.6713 - accuracy: 0.6085 - val_loss: 1.0115 - val_accuracy: 0.5120
Epoch 3/10
32/32 [==============================] - 20s 627ms/step - loss: 0.5857 - accuracy: 0.7145 - val_loss: 0.5649 - val_accuracy: 0.7120
Epoch 4/10
32/32 [==============================] - 20s 612ms/step - loss: 0.4667 - accuracy: 0.7935 - val_loss: 0.4482 - val_accuracy: 0.8060
Epoch 5/10
32/32 [==============================] - 20s 617ms/step - loss: 0.4022 - accuracy: 0.8430 - val_loss: 0.4232 - val_accuracy: 0.7990
Epoch 6/10
32/32 [==============================] - 32132s 1037s/step - loss: 0.2866 - accuracy: 0.8923 - val_loss: 0.4792 - val_accuracy: 0.7880
Epoch 7/10
32/32 [==============================] - 26s 801ms/step - loss: 0.2587 - accuracy: 0.9062 - val_loss: 0.5257 - val_accuracy: 0.8000
Epoch 8/10
32/32 [==============================] - 21s 649ms/step - loss: 0.2302 - accuracy: 0.9215 - val_loss: 0.4160 - val_accuracy: 0.8240
Epoch 9/10
32/32 [==============================] - 17s 545ms/step - loss: 0.1690 - accuracy: 0.9448 - val_loss: 0.4932 - val_accuracy: 0.8010
Epoch 10/10
32/32 [==============================] - 17s 528ms/step - loss: 0.1551 - accuracy: 0.9470 - val_loss: 0.5161 - val_accuracy: 0.7780

2. Plot training and validation loss and accuracy as we increase the number of epochs. Compare your results to the case with 32 nodes.

In [9]:

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(epochs, loss, 'bo', label='Training loss') # "bo" is for "blue dot"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss') # b is for "solid blue line"
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend(frameon=False)
sns.despine(ax=ax[0])

ax[1].plot(epochs, acc, 'bo', label='Training Accuracy')
ax[1].plot(epochs, val_acc, 'b', label='Validation Accuracy')
ax[1].set_title('Training and Validation Accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend(frameon=False)
sns.despine(ax=ax[1])

plt.show()

Using "dropout" to fight overfitting¶

It is evident from our training and validation curves that our model is overfitting: the training and validation losses start diverging considerably after a few epochs.

A technique to deal with this is called dropout and consists of randomly zeroing-out input units of a layer in order to break collinearities in the training data that the layer is exposed to.

How to correctly apply dropout in recurrent networks, however, is not a trivial question. It has long been known that applying dropout before a recurrent layer hinders learning rather than helping with regularization. It turns out that the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that would vary randomly from timestep to timestep. This mechanism directly into Keras recurrent layers. Every recurrent layer in Keras has two dropout-related arguments: dropout, a float specifying the dropout rate for input units of the layer, and recurrent_dropout, specifying the dropout rate of the recurrent units. Let's add dropout and recurrent dropout to our LSTM layer and see how it impacts overfitting. Because networks being regularized with dropout always take longer to fully converge, we train our network for twice as many epochs. Ugh.

3. Use 32 nodes but retrain the model by setting the dropout and recurrent_dropout to 0.2.

How exactly? Include this when you initialize the model:

model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2, ...

In [10]:

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32,
                dropout=0.2,
                recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 32)          320000    
                                                                 
 lstm_2 (LSTM)               (None, 32)                8320      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
32/32 [==============================] - 24s 648ms/step - loss: 0.6928 - accuracy: 0.5077 - val_loss: 0.6916 - val_accuracy: 0.5160
Epoch 2/10
32/32 [==============================] - 21s 663ms/step - loss: 0.6907 - accuracy: 0.5285 - val_loss: 0.6884 - val_accuracy: 0.5480
Epoch 3/10
32/32 [==============================] - 21s 647ms/step - loss: 0.6838 - accuracy: 0.5853 - val_loss: 0.6703 - val_accuracy: 0.6150
Epoch 4/10
32/32 [==============================] - 22s 688ms/step - loss: 0.6211 - accuracy: 0.6830 - val_loss: 0.5797 - val_accuracy: 0.7210
Epoch 5/10
32/32 [==============================] - 29s 908ms/step - loss: 0.5012 - accuracy: 0.7820 - val_loss: 0.4786 - val_accuracy: 0.7970
Epoch 6/10
32/32 [==============================] - 25s 780ms/step - loss: 0.4129 - accuracy: 0.8325 - val_loss: 0.4607 - val_accuracy: 0.7940
Epoch 7/10
32/32 [==============================] - 25s 781ms/step - loss: 0.3284 - accuracy: 0.8773 - val_loss: 0.4805 - val_accuracy: 0.7750
Epoch 8/10
32/32 [==============================] - 25s 784ms/step - loss: 0.2758 - accuracy: 0.9005 - val_loss: 0.5162 - val_accuracy: 0.7760
Epoch 9/10
32/32 [==============================] - 27s 832ms/step - loss: 0.2404 - accuracy: 0.9135 - val_loss: 0.3909 - val_accuracy: 0.8250
Epoch 10/10
32/32 [==============================] - 33s 1s/step - loss: 0.1978 - accuracy: 0.9330 - val_loss: 0.4154 - val_accuracy: 0.8240

4. Plot training and validation loss and accuracy as we increase the number of epochs. Discuss the implications for over-fitting.

In [11]:

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(epochs, loss, 'bo', label='Training loss') # "bo" is for "blue dot"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss') # b is for "solid blue line"
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend(frameon=False)
sns.despine(ax=ax[0])

ax[1].plot(epochs, acc, 'bo', label='Training Accuracy')
ax[1].plot(epochs, val_acc, 'b', label='Validation Accuracy')
ax[1].set_title('Training and Validation Accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend(frameon=False)
sns.despine(ax=ax[1])

plt.show()

Stacking recurrent layers¶

We can also considering increase the capacity of our network. If you remember our description of the "universal machine learning workflow": it is a generally a good idea to increase the capacity of your network until overfitting becomes your primary obstacle (assuming that you are already taking basic steps to mitigate overfitting, such as using dropout). As long as you are not overfitting too badly, then you are likely under-capacity.

Increasing network capacity is typically done by increasing the number of units in the layers, or adding more layers. Recurrent layer stacking is a classic way to build more powerful recurrent networks: for instance, what currently powers the Google translate algorithm is a stack of seven large LSTM layers -- that's huge.

To stack recurrent layers on top of each other in Keras, all intermediate layers should return their full sequence of outputs (a 3D tensor) rather than their output at the last timestep. This is done by placing return_sequences=True in the first hidden layer.

5. Retrain the model using stacked LSTM RNNs.

How exactly? Include return_sequences=True inside the first RNN layer when you intialize the model. Note that we're keeping the dropout values.

model.add(LSTM(32, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, ...

In [12]:

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32,
                return_sequences=True,
                dropout=0.2,
                recurrent_dropout=0.2))
model.add(LSTM(32,
                dropout=0.2,
                recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_4 (Embedding)     (None, None, 32)          320000    
                                                                 
 lstm_3 (LSTM)               (None, None, 32)          8320      
                                                                 
 lstm_4 (LSTM)               (None, 32)                8320      
                                                                 
 dense_4 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 336,673
Trainable params: 336,673
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
32/32 [==============================] - 76s 2s/step - loss: 0.6930 - accuracy: 0.5015 - val_loss: 0.6923 - val_accuracy: 0.5100
Epoch 2/10
32/32 [==============================] - 43s 1s/step - loss: 0.6911 - accuracy: 0.5270 - val_loss: 0.6881 - val_accuracy: 0.5210
Epoch 3/10
32/32 [==============================] - 39s 1s/step - loss: 0.6362 - accuracy: 0.6440 - val_loss: 0.7215 - val_accuracy: 0.5880
Epoch 4/10
32/32 [==============================] - 38s 1s/step - loss: 0.4982 - accuracy: 0.7638 - val_loss: 0.5134 - val_accuracy: 0.7570
Epoch 5/10
32/32 [==============================] - 39s 1s/step - loss: 0.3741 - accuracy: 0.8462 - val_loss: 0.4555 - val_accuracy: 0.8010
Epoch 6/10
32/32 [==============================] - 39s 1s/step - loss: 0.3033 - accuracy: 0.8800 - val_loss: 0.4669 - val_accuracy: 0.7780
Epoch 7/10
32/32 [==============================] - 38s 1s/step - loss: 0.2486 - accuracy: 0.9082 - val_loss: 0.4462 - val_accuracy: 0.8230
Epoch 8/10
32/32 [==============================] - 38s 1s/step - loss: 0.1949 - accuracy: 0.9290 - val_loss: 0.4875 - val_accuracy: 0.8200
Epoch 9/10
32/32 [==============================] - 39s 1s/step - loss: 0.1680 - accuracy: 0.9392 - val_loss: 0.4827 - val_accuracy: 0.8190
Epoch 10/10
32/32 [==============================] - 39s 1s/step - loss: 0.1239 - accuracy: 0.9585 - val_loss: 0.4814 - val_accuracy: 0.8020

6. Plot training and validation loss and accuracy as we increase the number of epochs. Did stacking RNN layers help?

In [13]:

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(epochs, loss, 'bo', label='Training loss') # "bo" is for "blue dot"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss') # b is for "solid blue line"
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend(frameon=False)
sns.despine(ax=ax[0])

ax[1].plot(epochs, acc, 'bo', label='Training Accuracy')
ax[1].plot(epochs, val_acc, 'b', label='Validation Accuracy')
ax[1].set_title('Training and Validation Accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend(frameon=False)
sns.despine(ax=ax[1])

plt.show()

Using bidirectional RNNs¶

The last technique is called "bidirectional RNNs". A bidirectional RNN is common RNN variant which can offer higher performance than a regular RNN on certain tasks. It is frequently used in natural language processing -- you could call it the Swiss army knife of deep learning for NLP.

RNNs are notably order-dependent, or time-dependent: they process the timesteps of their input sequences in order, and shuffling or reversing the timesteps can completely change the representations that the RNN will extract from the sequence. This is precisely the reason why they perform well on problems where order is meaningful, such as our temperature forecasting problem. A bidirectional RNN exploits the order-sensitivity of RNNs: it simply consists of two regular RNNs, such as the GRU or LSTM layers that you are already familiar with, each processing input sequence in one direction (chronologically and antichronologically), then merging their representations. By processing a sequence both way, a bidirectional RNN is able to catch patterns that may have been overlooked by a one-direction RNN.

A bidirectional RNN exploits this idea to improve upon the performance of chronological-order RNNs: it looks at its inputs sequence both ways, obtaining potentially richer representations and capturing patterns that may have been missed by the chronological-order version alone.

bidirectional rnn

To instantiate a bidirectional RNN in Keras, one would use the Bidirectional layer, which takes as first argument a recurrent layer instance. Bidirectional will create a second, separate instance of this recurrent layer, and will use one instance for processing the input sequences in chronological order and the other instance for processing the input sequences in reversed order. How very nice of Keras to do that for us.

7. Train a bi-directional LSTM RNN ML model.

How exactly? Import Bidirectional from Keras and then add it with the RNN layer of choice (here LSTM) as an input.

from tensorflow.keras.layers import Bidirectional
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))

In [14]:

from tensorflow.keras.layers import Bidirectional

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()
history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_5 (Embedding)     (None, None, 32)          320000    
                                                                 
 bidirectional (Bidirectiona  (None, 64)               16640     
 l)                                                              
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 336,705
Trainable params: 336,705
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
32/32 [==============================] - 66s 2s/step - loss: 0.6925 - accuracy: 0.5190 - val_loss: 0.6912 - val_accuracy: 0.5480
Epoch 2/10
32/32 [==============================] - 56s 2s/step - loss: 0.6889 - accuracy: 0.5663 - val_loss: 0.6814 - val_accuracy: 0.6650
Epoch 3/10
32/32 [==============================] - 56s 2s/step - loss: 0.6454 - accuracy: 0.6587 - val_loss: 0.6173 - val_accuracy: 0.6620
Epoch 4/10
32/32 [==============================] - 84s 3s/step - loss: 0.5473 - accuracy: 0.7495 - val_loss: 0.5062 - val_accuracy: 0.7980
Epoch 5/10
32/32 [==============================] - 156s 5s/step - loss: 0.4625 - accuracy: 0.8158 - val_loss: 0.5371 - val_accuracy: 0.7500
Epoch 6/10
32/32 [==============================] - 184s 6s/step - loss: 0.3778 - accuracy: 0.8675 - val_loss: 0.5584 - val_accuracy: 0.7120
Epoch 7/10
32/32 [==============================] - 369s 12s/step - loss: 0.3203 - accuracy: 0.8830 - val_loss: 0.4129 - val_accuracy: 0.8190
Epoch 8/10
32/32 [==============================] - 56s 2s/step - loss: 0.2832 - accuracy: 0.8957 - val_loss: 0.3863 - val_accuracy: 0.8430
Epoch 9/10
32/32 [==============================] - 45s 1s/step - loss: 0.2229 - accuracy: 0.9205 - val_loss: 0.4900 - val_accuracy: 0.8130
Epoch 10/10
32/32 [==============================] - 44s 1s/step - loss: 0.2022 - accuracy: 0.9287 - val_loss: 0.4983 - val_accuracy: 0.8180

8. Plot training and validation loss and accuracy as we increase the number of epochs. Interpret your results by comparing them to the regular chronological order.

In [15]:

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(epochs, loss, 'bo', label='Training loss') # "bo" is for "blue dot"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss') # b is for "solid blue line"
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend(frameon=False)
sns.despine(ax=ax[0])

ax[1].plot(epochs, acc, 'bo', label='Training Accuracy')
ax[1].plot(epochs, val_acc, 'b', label='Validation Accuracy')
ax[1].set_title('Training and Validation Accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend(frameon=False)
sns.despine(ax=ax[1])

plt.show()

6. Improving Performance ¶

At this stage, there are still many other things you could try in order to improve performance:

Adjust the learning rate used by our RMSprop optimizer. This is the most important parameter as it governs convergence as we change the number of epochs. It's value also varies across problems depending on how nonlinear the true DGP is.
Adjust the number of units in each recurrent layer in the stacked setup. Our current choices are largely arbitrary and thus likely suboptimal.
Try the Adam optimizer. Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to Kingma et al., 2014, the method is "computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters."
Try using a bigger densely-connected regressor on top of the recurrent layers, i.e. a bigger Dense layer or even a stack of Dense layers.
Don't forget to eventually run the best performing models on the test set. Otherwise, you'll start developing architectures that are overfitting to the validation set.

Deep learning is part art and part science. While we can provide guidelines as to what is likely to work or not work on a given problem, ultimately every problem is unique and you will have to try and evaluate different strategies empirically. There is currently no theory that will tell you in advance precisely what you should do to optimally solve a problem. You must try and iterate. For a more systematic way of tuning DNNs, see the supporting lecture noutbook not-so-cleverly-titled "TuningDNNs.ipynb".

7. Summary ¶

We covered a lot of stuff in this lecture. Here's what you should take away:

Try simple models before expensive ones to justify the additional expense. Sometimes a simple model will turn out to be your best option. Plus, the simple model will almost always give you some good intuition as to how the data behaves.
Recurrent networks are a great fit for data where order matters and easily outperform models that ignore sequenced data.
dropout is a useful trick to deal with overfitting recurrent networks and is built into Keras recurrent layers, so all you have to do is use the dropout and recurrent_dropout arguments of recurrent layers.
Stacked RNNs provide more representational power than a single RNN layer. They are also much more expensive computationally and may not be worth the hassle. While they offer clear gains on complex problems (e.g. machine translation), they might not always be relevant to smaller, simpler problems.
Bidirectional RNNs look at a sequence in both directions are very useful on natural language processing problems. If you know how the sequence will matter, these are likely overkill. Instead, build the sequencing into the data pre-processing step.