A Recurrent Neural Network (RNN) is a class of Artificial Neural Networks, where connections between units form a directed graph along a sequence. The term “recurrent neural network” is used indiscriminately to refer to two broad classes of networks with a similar general structure, where one is finite impulse and the other is infinite impulse. Both classes of networks exhibit temporal dynamic behaviour. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that cannot be unrolled.
Both finite impulse and infinite impulse recurrent networks can have additional stored state, and the storage can be under direct control by the neural network. Such controlled states are referred to as gated state or gated memory, and are part of long short-term memorys (LSTMs) and gated recurrent units. Unlike Feed Forward Neural Networks, as discussed in our last blog article, RNNs can use their internal state (memory) to process a sequence of inputs. This makes them applicable to more complex tasks in our modern-day world, such as, unsegmented, connected handwriting recognition, speech recognition, Natural Language processing etc. that we shall cover elaborately in this article!
What are RNNs?
Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones. Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.
Let us consider the following example – “working love learning we on deep”, did this make any sense to you? Not really – read this one – “We love working on deep learning”. Made perfect sense! A little jumble in the words made the sentence incoherent. Well, can we expect a neural network to make sense out of it? Not really! If the human brain was confused on what it meant I am sure a neural network is going to have a tough time deciphering such text.
There are multiple such tasks in everyday life which get completely disrupted when their sequence is disturbed. For instance, language as we saw earlier- the sequence of words defines their meaning, a time series data – where time defines the occurrence of events, the data of a genome sequence- where every sequence has a different meaning. There are multiple such cases wherein the sequence of information determines the event itself. If we are trying to use such data for any reasonable output, we need a network which has access to some prior knowledge about the data to completely understand it. Recurrent neural networks thus come into play.
Recurrent Neural Networks achieve this by taking as their input not just the current input example they see, but also what they perceived one or more steps back in time. Those previous time steps provide context for the current time step’s data. So RNNs analyse a representation of the current data, as well as a representation of its history, and combining those to make their decision about that data.
Recurrent Neural Networks were created in the 1980’s but have just been recently gaining popularity from advances to the networks designs and increase of cheaper computational power from graphic processing units. They’re especially useful with sequential data because each neuron or unit can use its internal memory to maintain information about the previous input. This is great because in cases of language, “I had washed my house” is much more different than “I had my house washed”. This allows the network to gain a deeper understanding of the statement. This is important to note because reading through a sentence even as a human, you’re picking up the context of each word from the words before it.
Similar to the model which human brains use to pick up words in sequence and understand sentences, a RNN has loops in them that allow information to be carried across neurons while reading in inputs as shown below:
In these diagrams Xt is some input, A is a part of the RNN and ht is the output. Essentially, we can feed in words from the sentence or even characters from a string as Xt and through the RNN it will come up with a ht.
The goal here, is to use h_t as output and compare it to your test data (which is usually a small subset of the original data) we will then get our error rate. After comparing your output to our test data, with error rate in hand, you can use a technique called Back Propagation Through Time (BPTT). BPTT back checks through the network and adjusts the weights based on our error rate. This adjusts the network and makes it learn to do better. This concept shall be covered in further details in our next section where we shall break down how RNNs work.
How do RNNs work?
In order to discuss and differentiate the working mechanism of Recurrent Neural Nets form other forms of traditional Neural Networks, we shall first carry out a comparison with Feed Forward Neural Nets which were covered in depth on our last blog article. RNN’s and Feed-Forward Neural Networks are both named after the way they channel information.
In a Feed-Forward Neural Network, as discussed earlier, the information only moves in one direction, from the input layer, through the hidden layers, to the output layer. The information moves straight through the network. For this reason, the information never touches any node within the network twice. As we saw earlier, Feed-Forward Neural Networks, have no memory of the input they received previously and are therefore bad in predicting what’s coming next. Because a feedforward network only considers the current input, it has no notion of order in time. They simply can’t remember anything about what happened in the past, except their training.
In a RNN, the information cycles through a loop. When it decides, it takes into consideration the current input and also what it has learned from the inputs it received previously. The two images below illustrate the difference in the information flow between a RNN and a Feed-Forward Neural Network.
Therefore, a Recurrent Neural Network has two inputs, the present and the recent past. This is important because the sequence of data contains crucial information about what is coming next, which is why a RNN can do things other algorithms can’t.
A Feed-Forward Neural Network assigns, like all other Deep Learning algorithms, a weight matrix to its inputs and then produces the output. Note that RNN’s apply weights to the current and also to the previous input. Furthermore, they also tweak their weights for both through gradient descent and Backpropagation Through Time. Also, we must note that while Feed-Forward Neural Networks map one input to one output, RNN’s can map one to many, many to many (translation) and many to one (classifying a voice) as shown below.
Backpropagation Through Time (BPTT)
To understand the concept of Backpropagation Through Time, one must understand the concepts of Forward and Back-Propagation first. In Neural Networks, on needs Forward-Propagation to get the output of a particular model and check if this output is correct or incorrect, to get the error.
Now for Backward-Propagation, which is nothing but going backwards through the Neural Network to find the partial derivatives of the error with respect to the weights, which enables us to subtract this value from the weights, thereby maximizing the probability of success for the specific task assigned to the Neural Net. Those derivatives are then used by Gradient Descent, an algorithm that is used to iteratively minimize a given function. Then it adjusts the weights up or down, depending on which decreases the error. That is exactly how a Neural Network learns during the training process. So, with Backpropagation one basically tries to tweak the weights of the model, while training.
Backpropagation Through Time (BPTT) is basically just a fancy buzz word for doing Backpropagation on an unrolled Recurrent Neural Network. Unrolling is a visualization and conceptual tool, which helps us understand what’s going on within the network. Most of the time when we implement a Recurrent Neural Network in the common programming frameworks, they automatically take care of the Backpropagation but you need to understand how it works, which enables you to troubleshoot problems that come up during the development process.
We can view a RNN as a sequence of Neural Networks that we can train one after another with Backpropagation. The image below refers to our first image of an unrolled RNN which we shall use once more to explain Backpropagation visually. On the left, you can see the RNN, which is unrolled after the equal sign. Note that there is no cycle after the equal sign since the different timesteps are visualized and information gets passed from one timestep to the next. This illustration also shows why a RNN can be seen as a sequence of Neural Networks.
If we do Backpropagation Through Time, it is required to do the conceptualization of unrolling, since the error of a given timestep depends on the previous timestep. Within BPTT, however, the error is back-propagated from the last to the first timestep, while unrolling all the timesteps. This allows calculating the error for each timestep, which allows updating the weights. Note that BPTT can be computationally expensive when we have a high number of timesteps.
There are a few major obstacles that RNN’s have had to deal with. But in order to understand them, we first need to understand what a gradient is. A gradient is a partial derivative with respect to its inputs. Let us think of it like this: A gradient measures how much the output of a function changes, if we change the inputs a little bit. We can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops to learning. A gradient simply measures the change in all weights with regard to the change in error.
There are two major problems that had to be tackled efficiently with respect to Gradient Descent before RNNs could make significant breakthroughs in modern technology. These are namely: Exploding Gradients and Vanishing Gradients.
Let us first discuss the problem of Exploding Gradients. An error gradient as discussed, is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.
In deep networks or specifically in Recurrent Neural Networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN (Not a Number) values. Thus, the explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.
In Recurrent Neural Networks, Exploding Gradients can result in an unstable network that is unable to learn from training data and at best a network that cannot learn over long input sequences of data. This particularly happens when the algorithm assigns a stupidly high importance to the weights, without much reason. But fortunately, this problem can be easily solved if you truncate or squash the gradients.
On the other hand, the vanishing gradient problem is a difficulty found in training Artificial Neural Networks with gradient-based learning methods and backpropagation. In such methods, each of the Neural Network’s weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the “front” layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the front layers train very slowly.
Thus, we speak of Vanishing Gradients, when the values of a gradient are too small and the model stops learning or takes way too long because of that. This was a major problem in the 1990s and much harder to solve than the exploding gradients. Fortunately, it was solved through the concept of LSTM (Long-Short-Term Memory) by Sepp Hochreiter and Juergen Schmidhuber, which we will discuss further. Hochreiter’s diploma thesis of 1991 formally identified the reason for this failure in the “vanishing gradient problem”, which not only affects many-layered feedforward networks, but mostly Recurrent Neural Networks. RNNs are trained by unfolding them into very deep Feed-Forward networks, where a new layer is created for each time step of an input sequence processed by the network.
Long-Short Term Memory (LSTM)
Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks, which basically extends their memory. Therefore, it is well suited to learn from important experiences that have very long-time lags in between. The LSTM is a particular type of Recurrent Neural Network that works slightly better in practice, owing to its more powerful update equation and some appealing backpropagation dynamics.
Adding a Long Short-Term Memory (LSTM) unit inside the Neural Network is like adding a memory unit that can remember context from the very beginning of the input. These little memory units allow for RNNs to be much more accurate, and have been the recent cause of the popularity around this model. These memory units allow for the ability across inputs for context to be remembered. Two of these units are widely used today LSTMs and Gated Recurrent Units (GRU) as shown below.
The units of an LSTM are used as building units for the layers of a RNN, which is then often called an LSTM network. LSTM’s enable RNN’s to remember their inputs over a long period of time. This is because LSTM’s contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory.
This memory can be seen as a gated cell, where gated means that the cell decides whether or not to store or delete information (e.g if it opens the gates or not), based on the importance it assigns to the information. The assigning of importance happens through weights, which are also learned by the algorithm. This simply means that it learns over time which information is important and which not.
An LSTM has three main gates: input gate, forget gate and output gate. These gates determine whether or not to let the new input in (input gate), delete the information because it isn’t important (forget gate) or to let it impact the output at the current time step (output gate). This is illustrated in the following diagram below.
The gates in a LSTM are analogue, in the form of sigmoids, meaning that they range from 0 to 1. The fact that they are analogue, enables them to do backpropagation with it. The problematic issues of vanishing gradients are solved through LSTM because it keeps the gradients steep enough and therefore makes the training process relatively short while increasing the accuracy and the probability of success of the given task.
RNNs Powered by LSTMs
Now we shall elaborate how RNNs work effectively in conjunction with LSTMs to solve problems of Natural Language Processing (NLP) or Optical Character Recognition (OCR) as discussed below. We shall now take an in-depth look at what happens when we give RNNs powered by LSTMs a huge chunk of text and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new text one character at a time covering both the domains of NLP and OCR.
As a working example, suppose we only had a vocabulary of four possible letters “helo”, and wanted to train an RNN on the training sequence “hello”. This training sequence is in fact a source of 4 separate training examples:
- The probability of “e” should be likely given the context of “h”
- “l” should be likely in the context of “he”
- “l” should also be likely given the context of “hel”
- “o” should be likely given the context of “hell”
Concretely, we will encode each character into a vector using 1-of-k encoding (i.e. all zero except for a single one at the index of the character in the vocabulary), and feed them into the RNN one at a time with the help of a step function. We will then observe a sequence of 4-dimensional output vectors (one dimension per character), which we interpret as the confidence the RNN currently assigns to each character coming next in the sequence. Here’s a diagram for the above written explanation:
For example, we see that in the first-time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”. Since in our training data (the string “hello”) the next correct character is “e”, we would like to increase its confidence (green) and decrease the confidence of all other letters (red). Similarly, we have a desired target character at every one of the 4-time steps that we’d like the network to assign a greater confidence to.
Since the RNN consists entirely of differentiable operations we can run the back-propagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers).
We can then perform a parameter update, which nudges every weight a tiny amount in this gradient direction. If we were to feed the same inputs to the RNN after the parameter update we would find that the scores of the correct characters (e.g. “e” in the first-time step) would be slightly higher (e.g. 2.3 instead of 2.2), and the scores of incorrect characters would be slightly lower.
We then repeat this process over and over many times until the network converges and its predictions are eventually consistent with the training data such that correct characters are always predicted next. Notice also that the first time the character “l” is input, the target is “l”, but the second time the target is “o”. The RNN therefore cannot rely on the input alone and must use its recurrent connection to keep track of the context to achieve this task.
So, at the time of testing, we simply feed a character into the RNN and get a distribution over what characters are likely to come next. We sample from this distribution, and feed it right back in to get the next letter. Repeat this process and voila we are ready to apply our conjoined and synched RNN LSTM for recognizing and sampling any given text!
So, with that in-depth discussion we wrap up this week’s article while giving our readers a comprehensive understanding of what a RNN is and how it works to enhance the process of deep machine learning and solve problems in our daily lives. We specifically established the concept of RNNs by showing its distinguishing features in contrast with a Feed-Forward Neural Network while explaining when we should use a Recurrent Neural Network, how Backpropagation and Backpropagation Through Time work, what are the most pertinent problems in RNNs are and how they combine with LSTMs to overcome these problems efficiently!
However, this is just a glimpse into the world of endless possibilities that opened up with the advent of RNNs and LSTMs! So, stay tuned for our next article where we shall reveal more about this fascinating new world to you!!