Back propagation of error: an example
We
will now show an example of a backprop network as it learns to model the highly
nonlinear data we encountered before.
The
left hand panel shows the data to be modeled. The right hand panel shows a
network with two hidden units, each with a tanh nonlinear activation function.
The output unit computes a linear combination of the two functions
|
(1)
|
Where
|
(2)
|
and
|
(3)
|
To
begin with, we set the weights, a..g, to random initial values in the range
[-1,1]. Each hidden unit is thus computing a random tanh function. The next
figure shows the initial two activation functions and the output of the
network, which is their sum plus a negative constant. (If you have difficulty
making out the line types, the top two curves are the tanh functions, the one
at the bottom is the network output).
We
now train the network (learning rate 0.3), updating the weights after each
pattern (online learning). After we have been through the entire dataset 10
times (10 training epochs), the functions computed look like this (the output
is the middle curve):
After 20 epochs, we have (output is
the humpbacked curve):
and after 27 epochs we have a pretty
good fit to the data:
As
the activation functions are stretched, scaled and shifted by the changing
weights, we hope that the error of the model is dropping. In the next figure we
plot the total sum squared error over all 88 patterns of the data as a function
of training epoch. Four training runs are shown, with different weight
initialization each time:
You
can see that the path to the solution differs each time, both because we start
from a different point in weight space, and because the order in which patterns
are presented is random. Nonetheless, all training curves go down
monotonically, and all reach about the same level of overall error.
Over fitting
In
the previous example we used a network with two hidden units. Just by looking
at the data, it was possible to guess that two tanh functions would do a pretty
good job of fitting the data. In general, however, we may not know how many
hidden units, or equivalently, how many weights, we will need to produce a
reasonable approximation to the data. Furthermore, we usually seek a model of
the data which will give us, on average, the best possible predictions for
novel data. This goal can conflict with the simpler task of modelling a
specific training set well. In this section we will look at some techniques for
preventing our model becoming too powerful (overfitting). In the next, we
address the related question of selecting an appropriate architecture with just
the right amount of trainable parameters.
Bias-Variance
trade-off
Consider the two fitted functions
below. The data points (circles) have all been generated from a smooth
function, h(x), with some added noise. Obviously, we want to end up with
a model which approximates h(x), given a specific set of data y(x)
generated as:
|
(1)
|
In
the left hand panel we try to fit the points using a function g(x) which
has too few parameters: a straight line. The model has the virtue of being
simple; there are only two free parameters. However, it does not do a good job
of fitting the data, and would not do well in predicting new data points. We
say that the simpler model has a high bias.
The
right hand panel shows a model which has been fitted using too many free
parameters. It does an excellent job of fitting the data points, as the error
at the data points is close to zero. However it would not do a good job of
predicting h(x) for new values of x. We say that the model has a
high variance. The model does not reflect the structure which we expect
to be present in any data set generated by equation (1) above.
Clearly
what we want is something in between: a model which is powerful enough to
represent the underlying structure of the data (h(x)), but not so
powerful that it faithfully models the noise associated with this particular
data sample.
The
bias-variance trade-off is most likely to become a problem if we have
relatively few data points. In the opposite case, where we have essentially an
infinite number of data points (as in continuous online learning), we are not
usually in danger of overfitting the data, as the noise associated with any
single data point plays a vanishingly small role in our overall fit. The
following techniques therefore apply to situations in which we have a finite
data set, and, typically, where we wish to train in batch mode.
Preventing
overfitting
Early
stopping
One of the simplest and most widely
used means of avoiding overfitting is to divide the data into two sets: a
training set and a validation set. We train using only the training
data. Every now and then, however, we stop training, and test network
performance on the independent validation set. No weight updates are made
during this test! As the validation data is independent of the training data,
network performance is a good measure of generalization, and as long as the
network is learning the underlying structure of the data (h(x) above),
performance on the validation set will improve with training. Once the network
stops learning things which are expected to be true of any data sample and
learns things which are true only of this sample (epsilon in Eqn 1 above),
performance on the validation set will stop improving, and will typically get
worse. Schematic learning curves showing error on the training and validation
sets are shown below. To avoid overfitting, we simply stop training at time t,
where performance on the validation set is optimal.
One
detail of note when using early stopping: if we wish to test the trained
network on a set of independent data to measure its ability to generalize, we
need a third, independent, test set. This is because we used the validation set
to decide when to stop training, and thus our trained network is no longer
entirely independent of the validation set. The requirements of independent
training, validation and test sets means that early stopping can only be used
in a data-rich situation.
Weight
decay
The over-fitted function above shows
a high degree of curvature, while the linear function is maximally smooth. Regularization
refers to a set of techniques which help to ensure that the function computed
by the network is no more curved than necessary. This is achieved by adding a
penalty to the error function, giving:
|
(2)
|
One
possible form of the regularizer comes from the informal observation that an
over-fitted mapping with regions of large curvature requires large weights. We
thus penalize large weights by choosing
|
(3)
|
Using this modified error function,
the weights are now updated as
|
(4)
|
where the right hand term causes the
weight to decrease as a function of its own size. In the absence of any input,
all weights will tend to decrease exponentially, hence the term "weight
decay".
Training with noise
A final method which can often help
to reduce the importance of the specific noise characteristics associated with
a particular data sample is to add an extra small amount of noise (a small
random value with mean value of zero) to each input. Each time a specific input
pattern x is presented, we add a different random number, and use instead.
At
first, this may seem a rather odd thing to do: to deliberately corrupt ones own
data. However, perhaps you can see that it will now be difficult for the
network to approximate any specific data point too closely. In practice,
training with added noise has indeed been shown to reduce overfitting and thus
improve generalization in some situations.
If
we have a finite training set, another way of introducing noise into the
training process is to use online training, that is, updating weights after
every pattern presentation, and to randomly reorder the patterns at the end of
each training epoch. In this manner, each weight update is based on a noisy
estimate of the true gradient.
Growing and Pruning Networks
The
neural network modeler is faced with a huge array of models and training
regimes from which to select. This course can only serve to introduce you to
the most common and general models. However, even after deciding, for example,
to train a simple feed forward network, using some specific form of gradient
descent, with tanh nodes in a single hidden layer, an important question to be
addressed is remains: how big a network should we choose? How many hidden
units, or, relatedly, how many weights?
By
way of an example, the nonlinear data which formed our first example can be
fitted very well using 40 tanh functions. Learning with 40 hidden units is
considerably harder than learning with 2, and takes significantly longer. The
resulting fit is no better (as measured by the sum squared error) than the
2-unit model.
The
most usual answer is not necessarily the best: we guess an appropriate number
(as we did above).
Another
common solution is to try out several network sizes, and select the most
promising. Neither of these methods is very principled.
Two
more rigorous classes of methods are available, however. We can either start
with a network which we know to be too small, and iteratively add units and
weights, or we can train an oversized network and remove units/weights from the
final network. We will look briefly at each of these approaches.
Growing
networks
The
simplest form of network growing algorithm starts with a small network, say one
with only a single hidden unit. The network is trained until the improvement in
the error over one epoch falls below some threshold. We then add an additional
hidden unit, with weights from inputs and to outputs. We initialize the new
weights randomly and resume training. The process continues until no
significant gain is achieved by adding an extra unit. The process is
illustrated below.
Cascade
correlation
Beyond
simply having too many parameters (danger of overfitting), there is a problem
with large networks which has been called the herd effect. Imagine we
have a task which is essentially decomposable into two sub-tasks A and B.
We have a number of hidden units and randomly weighted connections. If task A
is responsible for most of the error signal arriving at the hidden units, there
will be a tendency for all units to simultaneously try to solve A. Once
the error attributable to A has been reduced, error from subtask B
will predominate, and all units will now try to solve that, leading to an
increase again in the error from A. Eventually, due mainly to the
randomness in the weight initialization, the herd will split and different
units will address different sub-problems, but this may take considerable time.
To
get around this problem, Fahlman (1991) proposed an algorithm called cascade
correlation which begins with a minimal network having just input and
output units. Training a single layer requires no back-propagation of error and
can be done very efficiently. At some point further training will not produce
much improvement. If network performance is satisfactory, training can be
stopped. If not, there must be some remaining error which we wish to reduce
some more. This is done by adding a new hidden unit to the network, as
described in the next paragraph. The new unit is added, its input weights are
frozen (i.e. they will no longer be change
Preconditioning the Network
Ill-Conditioning
In the preceding section on
overfitting, we have seen what can happen when the network learns a given set
of data too well. Unfortunately a far more frequent problem encountered
by backpropagation users is just the opposite: that the network does not learn
well at all! This is usually due to ill-conditioning of the network.
(Fig. 1a)
Recall
that gradient descent requires a reasonable learning rate to work well: if it
is too low (Fig. 1a), convergence will be very slow; set it too high, and the
network will diverge (Fig. 1b).
(Fig. 1b)
Unfortunately
the best learning rate is typically different for each weight in the
network! Sometimes these differences are small enough for a single, global
compromise learning rate to work well - other times not. We call a network
ill-conditioned if it requires learning rates for its weights that differ by so
much that there is no global rate at which the network learns reasonably well.
The error function for such a network is characterized by long, narrow valleys:
(Fig. 2)
(Mathematically,
ill-conditioning is characterized by a high condition number. The
condition number is the ratio between the largest and the smallest eigenvalue
of the network's Hessian. The Hessian is the matrix of second
derivatives of the loss function with respect to the weights. Although it is
possible to calculate the Hessian for a multi-layer network and determine its
condition number explicitly, it is a rather complicated procedure, and rarely
done.)
Ill-conditioning
in neural networks can be caused by the training data, the network's
architecture, and/or its initial weights. Typical problems are: having large
inputs or target valuess, having both large and small layers in the network,
having more than one hidden layer, and having initial weights that are too
large or too small. This should make it clear that ill-conditioning is a very
common problem indeed! In what follows, we look at each possible source of
ill-conditioning, and describe a simple method to remove the problem. Since
these methods are all used before training of the network begins, we
refer to them as preconditioning techniques.
Normalizing
Inputs and Targets
(Fig. 3)
Recall
the simple linear network (Fig. 3) we first used to learn the car data set.
When we presented the best linear fit, we had rescaled both the x (input) and y
(target) axes. Why did we do this? Consider what would happen if we used the
original data directly instead: the input (weight of the car) would be quite
large - over 3000 (pounds) on average. To map such large inputs onto the far
smaller targets, the weight from input to output must become quite small -
about -0.01. Now assume that we are 10% (0.001) away from the optimal value.
This would cause an error of (typically) 3000*0.001 = 3 at the output. At
learning rate µ, the weight change resulting from this error would be µ*3*3000
= 9000 µ. For stable convergence, this should be smaller than the distances to
the weight's optimal value: 9000 µ < 0.001, giving us µ < 10-7,
a very small learning rate. (And this is for online learning - for batch
learning, where the weight changes for several patterns are added up, the
learning rate would have to be even smaller!)
Why
should such a small learning rate be a problem? Consider that the bias unit has
a constant output of 1. A bias weight that is, say, 0.1 away from its optimal
value would therefore have a gradient of 0.1. At a learning rate of 10-7,
however, it would take 10 million steps to move the bias weight by this
distance! This is a clear case of ill-conditioning caused by the vastly
different scale of input and bias values. The solution is simple: normalize
the input, so that it has an average of zero and a standard deviation of one.
Normalization is a two-step process:
To normalize a variable, first
|
|
Note
that for our purposes it is not really necessary to calculate the mean and
standard deviation of each input exactly - approximate values are perfectly
sufficient. (In the case of the car data, the "mean" of 3000 and
"standard deviation" of 1000 were simply guessed after looking at the
data plot.) This means that in situations where the training data is not known
in advance, estimates based on either prior knowledge or a small sample of the
data are usually good enough. If the data is a time series x(t), you may also
want to consider using the first differences x(t) - x(t-1) as network inputs
instead; they have zero mean as long as x(t) is stationary. Whichever way you
do it, remember that you should always
- normalize the inputs, and
- normalize the targets.
|
(Fig. 4)
|
To
see why the target values should also be normalized, consider the network we've
used to fit a sigmoid to the car data (Fig. 4). If the target values were those
found in the original data, the weight from hidden to output unit would have to
be 10 times larger. The error signal propagated back to the hidden unit would
thus be multiplied by 17 along the way. In order to compensate for this, the
global learning rate would have to be lowered correspondingly, slowing down the
weights that go directly to the output unit. Thus while large inputs cause
ill-conditioning by leading to very small weights, large targets do so by
leading to very large weights.
Finally,
notice that the argument for normalizing the inputs can also be applied to the
hidden units (which after all look like inputs to their posterior nodes).
Ideally, we would like hidden unit activations as well to have a mean of zero
and a standard deviation of one. Since the weights into hidden units keep
changing during training, however, it would be rather hard to predict their
mean and standard deviation accurately! Fortunately we can rely on our tanh
activation function to keep things reasonably well-conditioned: its range from
-1 to +1 implies that the standard deviation cannot exceed 1, while its
symmetry about zero means that the mean will typically be relatively small.
Furthermore, its maximum derivative is also 1, so that backpropagated errors
will be neither magnified nor attenuated more than necessary.
Note: For historic reasons, many people use the logistic sigmoid
f(u) = 1/(1 + e-u) as activation function for hidden
units. This function is closely related to tanh (in fact, f(u) = tanh(u/2)/2 +
0.5) but has a smaller, asymmetric range (from 0 to 1), and a maximum derivative
of 0.25. We will later encounter a legitimate use for this function, but as
activation function for hidden units it tends to orsen the network's
conditioning. Thus
- do not use the logistic sigmoid f(u) = 1/(1 + e-u) as activation function for hidden units.
Use tanh instead: your network will
be better conditioned.
Initializing
the Weights
Before training, the network weights
are initialized to small random values. The random values are usually drawn
from a uniform distribution over the range [-r,r]. What should r be? If the
initial weights are too small, both activation and error signals will die out
along their way through the network. Conversely, if they are too large, the
tanh function of the hidden units will saturate - be very close to its
asymptotic value of +/-1. This means that its derivative will be close to zero,
blocking any backpropagated error signals from passing through the node; this
is sometimes called paralysis of the node.
To
avoid either extreme, we would initally like the hidden units' net input to be
approximately normalized. We do not know the inputs to the node, but we do know
that they're approximately normalized - that's what we ensured in the previous
section. It seems reasonable then to model the expected inputs as independent,
normalized random variables. This means that their variances add, so we can
write
since the initial weights are in the
range [-r,r]. To ensure that Var(neti) is at most 1, we can thus set
r to the inverse of the square root of the fan-in |Ai| of the
node - the number of weights coming into it:
- initialize weight wij to a uniformly random value in the range [-ri, ri], where
Setting
Local Learning Rates
Above we have seen that the architecture
of the network - specifically: the fan-in of its nodes - determines the range
within which its weights should be initialized. The architecture also affects
how the error signal scales up or down as it is backpropagated through the
network. Modelling the error signals as independent random variables, we have
Let
us define a new variable v for each hidden or output node, proportional to the
(estimated) variance of its error signal divided by its fan-in. We can
calculate all the v by a backpropagation procedure:
- for all output nodes o, set
- backpropagate: for all hidden nodes j, calculate
Since the activations in the network
are already normalized, we can expect the gradient for weight wij to
scale with the square root of the corresponding error signal's variance, vi|Ai|.
The resulting weight change, however, should be commensurate with the
characteristic size of the weight, which is given by ri. To achieve
this,
- set the learning rate µi (used for all weights wij into node i) to
If
you follow all the points we have made in this section before the start of
training, you should have a reasonably well-conditioned network that can be
trained effectively. It remains to determine a good global learning rate µ.
This must be done by trial and error; a good first guess (on the high size)
would be the inverse of the square root of the batch size (by a similar
argument as we have made above), or 1 for online learning. If this leads to
divergence, reduce µ and try again.
Local
Minima
In
gradient descent we start at some point on the error function defined over the
weights, and attempt to move to the global minimum of the function. In
the simplified function of Fig 1a the situation is simple. Any step in a
downward direction will take us closer to the global minimum. For real
problems, however, error surfaces are typically complex, and may more resemble
the situation shown in Fig 1b. Here there are numerous local minima, and
the ball is shown trapped in one such minimum. Progress here is only possible
by climbing higher before descending to the global minimum.
(Fig. 1a) (Fig.
1b)
We have already mentioned one way to
escape a local minimum: use online learning. The noise in the stochastic error
surface is likely to bounce the network out of local minima as long as they are
not too severe.
Momentum
Another technique that can help the
network out of local minima is the use of a momentum term. This is
probably the most popular extension of the backprop algorithm; it is hard to
find cases where this is not used. With momentum m, the weight update at a given
time t becomes
|
(1)
|
where
0 < m < 1 is a new global parameter which must be determined by trial and
error. Momentum simply adds a fraction m of the previous weight update to the
current one. When the gradient keeps pointing in the same direction, this will
increase the size of the steps taken towards the minimum. It is otherefore
often necessary to reduce the global learning rate µ when using a lot of
momentum (m close to 1). If you combine a high learning rate with a lot of
momentum, you will rush past the minimum with huge steps!
When
the gradient keeps changing direction, momentum will smooth out the variations.
This is particularly useful when the network is not well-conditioned. In such
cases the error surface has substantially different curvature along different
directions, leading to the formation of long narrow valleys. For most points on
the surface, the gradient does not point towards the minimum, and successive
steps of gradient descent can oscillate from one side to the other, progressing
only very slowly to the minimum (Fig. 2a). Fig. 2b shows how the addition of
momentum helps to speed up convergence to the minimum by damping these
oscillations.
(Fig. 2a) (Fig.
2b)
To
illustrate this effect in practice, we trained 20 networks on a simple problem
(4-2-4 encoding), both with and without momentum. The mean training times (in
epochs) were
momentum
|
Training
time
|
0
|
217
|
0.9
|
95
|
Learning Rate Adaptation
In the section on preconditioning, we
have employed simple heuristics to arrive at reasonable guesses for the global
and local learning rates. It is possible to refine these values significantly
once training has commenced, and the network's response to the data can be
observed. We will now introduce a few methods that can do so automatically
by adapting the learning rates during training.
Bold
Driver
A
useful batch method for adapting the global learning rate µ is the bold
driver algorithm. Its operation is simple: after each epoch, compare the
network's loss E(t) to its previous value, E(t-1). If the error has decreased,
increase µ by a small proportion (typically 1%-5%). If the error has increased
by more than a tiny proportion (say, 10-10), however, undo the last
weight change, and decrease µ sharply - typically by 50%. Thus bold driver will
keep growing µ slowly until it finds itself taking a step that has clearly gone
too far up onto the opposite slope of the error function. Since this means that
the network has arrived in a tricky area of the error surface, it makes sense
to reduce the step size quite drastically at this point.
Annealing
Unfortunately
bold driver cannot be used in this form for online learning: the stochastic
fluctuations in E(t) would hopelessly confuse the algorithm. If we keep µ
fixed, however, these same fluctuations prevent the network from ever properly
converging to the minimum - instead we end up randomly dancing around it. In
order to actually reach the minimum, and stay there, we must anneal
(gradually lower) the global learning rate. A simple, non-adaptive annealing
schedule for this purpose is the search-then-converge schedule
µ(t) = µ(0)/(1 + t/T)
|
(2)
|
Its
name derives from the fact that it keeps µ nearly constant for the first T
training patterns, allowing the network to find the general location of the
minimum, before annealing it at a (very slow) pace that is known from theory to
guarantee convergence to the minimum. The characteristic time T of this
schedule is a new free parameter that must be determined by trial and error.
Local
Rate Adaptation
If
we are willing to be a little more sophisticated, we go a lot further than the
above global methods. First let us define an online weight update that uses a
local, time-varying learning rate for each weight:
|
(3)
|
The
idea is to adapt these local learning rates by gradient descent, while
simultaneously adapting the weights. At time t, we would like to change the
learning rate (before changing the weight) such that the loss E(t+1) at the
next time step is reduced. The gradient we need is
|
(4)
|
Ordinary gradient descent in µij,
using the meta-learning rate q (a new global parameter), would give
|
(5)
|
We
can already see that this would work in a similar fashion to momentum: increase
the learning rate as long as the gradient keeps pointing in the same direction,
but decrease it when you land on the opposite slope of the loss function.
Problem: µij might become negative! Also, the step size
should be proportional to µij so that it can be adapted over several
orders of magnitude. This can be achieved by performing the gradient descent on
log(µij) instead:
|
(6)
|
Exponentiating this gives
|
(7)
|
where
the approximation serves to avoid an expensive exp function call. The
multiplier is limited below by 0.5 to guard against very small (or even negative)
factors.
Problem: the gradient is noisy; the product of two of them will be
even noisier - the learning rate will bounce around a lot. A popular way to
reduce the stochasticity is to replace the gradient at the previous time step
(t-1) by an exponential average of past gradients. The exponential
average of a time series u(t) is defined as
|
(8)
|
where 0 < m < 1 is a new
global parameter.
Problem: if the gradient is ill-conditioned, the product of two
gradients will be even worse - the condition number is squared. We will need to
normalize the step sizes in some way. A radical solution is to throw away the
magnitude of the step, and just keep the sign, giving
|
(9)
|
where r = eq. This works
fine for batch learning, but...
(Fig. 3)
Problem: Nonlinear normalizers such as the sign function lead to
systematic errors in stochastic gradient descent (Fig. 3): a skewed but
zero-mean gradient distribution (typical for stochastic equilibrium) is mapped
to a normalized distribution with non-zero mean. To avoid the problems this is
casuing, we need a linear normalizer for online learning. A good method is to
divide the step by , an
exponential average of the squared gradient. This gives
|
(10)
|
Problem: successive training patterns may be correlated, causing the
product of stochastic gradients to behave strangely. The exponential averaging
does help to get rid of short-term correlations, but it cannot deal with input
that exhibits correlations across long periods of time. If you are iterating
over a fixed training set, make sure you permute (shuffle) it before
each iteration to destroy any correlations. This may not be possible in a true
online learning situation, where training data is received one pattern at a
time.
To
show that all these equations actually do something useful, here is a typical
set of online learning curves (in postscript) for a difficult benchmark
problem, given either uncorrelated training patterns, or patterns with strong
short-term or long-term correlations. In these figures "momentum"
corresponds to using equation (1) above, and "s-ALAP" to equation
(10). "ALAP" is like "s-ALAP" but without the exponential
averaging of past gradients, while "ELK1" and "SMD" are
more advanced methods (developed by one of us).
Momentum and Learning Rate Adaptation
Local Minima
In
gradient descent we start at some point on the error function defined over the
weights, and attempt to move to the global minimum of the function. In
the simplified function of Fig 1a the situation is simple. Any step in a downward
direction will take us closer to the global minimum. For real problems,
however, error surfaces are typically complex, and may more resemble the
situation shown in Fig 1b. Here there are numerous local minima, and the
ball is shown trapped in one such minimum. Progress here is only possible by
climbing higher before descending to the global minimum.
(Fig. 1a) (Fig.
1b)
We
have already mentioned one way to escape a local minimum: use online learning.
The noise in the stochastic error surface is likely to bounce the network out
of local minima as long as they are not too severe.
Momentum
Another technique that can help the
network out of local minima is the use of a momentum term. This is
probably the most popular extension of the backprop algorithm; it is hard to
find cases where this is not used. With momentum m, the weight update at a
given time t becomes
|
(1)
|
where
0 < m < 1 is a new global parameter which must be determined by trial and
error. Momentum simply adds a fraction m of the previous weight update to the
current one. When the gradient keeps pointing in the same direction, this will
increase the size of the steps taken towards the minimum. It is otherefore
often necessary to reduce the global learning rate µ when using a lot of momentum
(m close to 1). If you combine a high learning rate with a lot of momentum, you
will rush past the minimum with huge steps!
When
the gradient keeps changing direction, momentum will smooth out the variations.
This is particularly useful when the network is not well-conditioned. In such
cases the error surface has substantially different curvature along different
directions, leading to the formation of long narrow valleys. For most points on
the surface, the gradient does not point towards the minimum, and successive
steps of gradient descent can oscillate from one side to the other, progressing
only very slowly to the minimum (Fig. 2a). Fig. 2b shows how the addition of
momentum helps to speed up convergence to the minimum by damping these oscillations.
(Fig. 2a) (Fig.
2b)
To
illustrate this effect in practice, we trained 20 networks on a simple problem
(4-2-4 encoding), both with and without momentum. The mean training times (in
epochs) were
momentum
|
Training
time
|
0
|
217
|
0.9
|
95
|
Learning
Rate Adaptation
In the section on preconditioning,
we have employed simple heuristics to arrive at reasonable guesses for the
global and local learning rates. It is possible to refine these values
significantly once training has commenced, and the network's response to the
data can be observed. We will now introduce a few methods that can do so automatically
by adapting the learning rates during training.
Bold Driver
A
useful batch method for adapting the global learning rate µ is the bold
driver algorithm. Its operation is simple: after each epoch, compare the
network's loss E(t) to its previous value, E(t-1). If the error has decreased,
increase µ by a small proportion (typically 1%-5%). If the error has increased
by more than a tiny proportion (say, 10-10), however, undo the last
weight change, and decrease µ sharply - typically by 50%. Thus bold driver will
keep growing µ slowly until it finds itself taking a step that has clearly gone
too far up onto the opposite slope of the error function. Since this means that
the network has arrived in a tricky area of the error surface, it makes sense
to reduce the step size quite drastically at this point.
Annealing
Unfortunately
bold driver cannot be used in this form for online learning: the stochastic
fluctuations in E(t) would hopelessly confuse the algorithm. If we keep µ
fixed, however, these same fluctuations prevent the network from ever properly
converging to the minimum - instead we end up randomly dancing around it. In
order to actually reach the minimum, and stay there, we must anneal
(gradually lower) the global learning rate. A simple, non-adaptive annealing
schedule for this purpose is the search-then-converge schedule
µ(t) = µ(0)/(1 + t/T)
|
(2)
|
Its
name derives from the fact that it keeps µ nearly constant for the first T
training patterns, allowing the network to find the general location of the
minimum, before annealing it at a (very slow) pace that is known from theory to
guarantee convergence to the minimum. The characteristic time T of this
schedule is a new free parameter that must be determined by trial and error.
Local
Rate Adaptation
If
we are willing to be a little more sophisticated, we go a lot further than the
above global methods. First let us define an online weight update that uses a
local, time-varying learning rate for each weight:
|
(3)
|
The
idea is to adapt these local learning rates by gradient descent, while
simultaneously adapting the weights. At time t, we would like to change the
learning rate (before changing the weight) such that the loss E(t+1) at the
next time step is reduced. The gradient we need is
|
(4)
|
Ordinary gradient descent in µij,
using the meta-learning rate q (a new global parameter), would give
|
(5)
|
We
can already see that this would work in a similar fashion to momentum: increase
the learning rate as long as the gradient keeps pointing in the same direction,
but decrease it when you land on the opposite slope of the loss function.
Problem: µij might become negative! Also, the step size
should be proportional to µij so that it can be adapted over several
orders of magnitude. This can be achieved by performing the gradient descent on
log(µij) instead:
|
(6)
|
Exponentiating this gives
|
(7)
|
where
the approximation serves to avoid an expensive exp function call. The
multiplier is limited below by 0.5 to guard against very small (or even
negative) factors.
Problem: the gradient is noisy; the product of two of them will be
even noisier - the learning rate will bounce around a lot. A popular way to reduce
the stochasticity is to replace the gradient at the previous time step (t-1) by
an exponential average of past gradients. The exponential average of a
time series u(t) is defined as
|
(8)
|
where 0 < m < 1 is a new
global parameter.
Problem: if the gradient is ill-conditioned, the product of two
gradients will be even worse - the condition number is squared. We will need to
normalize the step sizes in some way. A radical solution is to throw away the
magnitude of the step, and just keep the sign, giving
|
(9)
|
where r = eq. This works
fine for batch learning, but...
(Fig. 3)
Problem: Nonlinear normalizers such as the sign function lead to
systematic errors in stochastic gradient descent (Fig. 3): a skewed but
zero-mean gradient distribution (typical for stochastic equilibrium) is mapped
to a normalized distribution with non-zero mean. To avoid the problems this is
casuing, we need a linear normalizer for online learning. A good method is to
divide the step by , an
exponential average of the squared gradient. This gives
|
(10)
|
Problem: successive training patterns may be correlated, causing the
product of stochastic gradients to behave strangely. The exponential averaging
does help to get rid of short-term correlations, but it cannot deal with input
that exhibits correlations across long periods of time. If you are iterating
over a fixed training set, make sure you permute (shuffle) it before
each iteration to destroy any correlations. This may not be possible in a true
online learning situation, where training data is received one pattern at a
time.
To
show that all these equations actually do something useful, here is a typical
set of online learning curves (in postscript) for a difficult benchmark
problem, given either uncorrelated training patterns, or patterns with strong
short-term or long-term correlations. In these figures "momentum"
corresponds to using equation (1) above, and "s-ALAP" to equation
(10). "ALAP" is like "s-ALAP" but without the exponential
averaging of past gradients, while "ELK1" and "SMD" are
more advanced methods (developed by one of us).
Classification
Discriminants
Neural networks can also be used to
classify data. Unlike regression problems, where the goal is to produce a
particular output value for a given input, classification problems require us
to label each data point as belonging to one of n classes. Neural
networks can do this by learning a discriminant function which separates
the classes. For example, a network with a single linear output can solve a
two-class problem by learning a discriminant function which is greater than
zero for one class, and less than zero for the other. Fig. 6 shows two such
two-class problems, with filled dots belonging to one class, and unfilled dots
to the other. In each case, a line is drawn where a discriminant function that
separates the two classes is zero.
(Fig. 6)
On
the left side, a straight line can serve as a discriminant: we can place the
line such that all filled dots lie on one side, and all unfilled ones lie on the
other. The classes are said to be linearly separable. Such problems can
be learned by neural networks without any hidden units. On the right side, a
highly non-linear function is required to ensure class separation. This problem
can be solved only by a neural network with hidden units.
Binomial
To use a neural network for
classification, we need to construct an equivalent function approximation
problem by assigning a target value for each class. For a binomial
(two-class) problem we can use a network with a single output y, and binary
target values: 1 for one class, and 0 for the other. We can thus interpret the
network's output as an estimate of the probability that a given pattern belongs
to the '1' class. To classify a new pattern after training, we then employ the maximum
likelihood discriminant, y > 0.5.
A
network with linear output used in this fashion, however, will expend a lot of
its effort on getting the target values exactly right for its training
points - when all we actually care about is the correct positioning of the
discriminant. The solution is to use an activation function at the output that
saturates at the two target values: such a function will be close to the target
value for any net input that is sufficiently large and has the correct sign.
Specifically, we use the logistic sigmoid function
Given
the probabilistic interpretation, a network output of, say, 0.01 for a pattern
that is actually in the '1' class is a much more serious error than,
say, 0.1. Unfortunately the sum-squared loss function makes almost no
distinction between these two cases. A loss function that is appropriate for
dealing with probabilities is the cross-entropy error. For the two-class
case, it is given by
When
logistic output units and cross-entropy error are used together in
backpropagation learning, the error signal for the output unit becomes just the
difference between target and output:
In
other words, implementing cross-entropy error for this case amounts to nothing
more than omitting the f'(net) factor that the error signal would otherwise get
multiplied by. This is not an accident, but indicative of a deeper mathematical
connection: cross-entropy error and logistic outputs are the
"correct" combination to use for binomial probabilities, just like linear
outputs and sum-squared error are for scalar values.
Multinomial
If we have multiple independent
binary attributes by which to classify the data, we can use a network with
multiple logistic outputs and cross-entropy error. For multinomial
classification problems (1-of-n, where n > 2) we use a network with n
outputs, one corresponding to each class, and target values of 1 for the
correct class, and 0 otherwise. Since these targets are not independent of each
other, however, it is no longer appropriate to use logistic output units. The
corect generalization of the logistic sigmoid to the multinomial case is the softmax
activation function:
where o ranges over the n output
units. The cross-entropy error for such an output layer is given by
Since
all the nodes in a softmax output layer interact (the value of each node
depends on the values of all the others), the derivative of the cross-entropy
error is difficult to calculate. Fortunately, it again simplifies to
so we don't have to worry about it.
Non-Supervised Learning
It
is possible to use neural networks to learn about data that contains neither
target outputs nor class labels. There are many tricks for getting error
signals in such non-supervised settings; here we'll briefly discuss a
few of the most common approaches: autoassociation, time series prediction, and
reinforcement learning.
Autoassociation
Autoassociation is based on a simple
idea: if you have inputs but no targets, just use the inputs as targets. An
autoassociator network thus tries to learn the identity function. This is only
non-trivial if the hidden layer forms an information bottleneck -
contains less units than the input (output) layer, so that the network must
perform dimensionality reduction (a form of data compression).
A
linear autoassociator trained with sum-squared error in effect performs principal
component analysis (PCA), a well-known statistical technique. PCA extracts
the subspace (directions) of highest variance from the data. As was the case
with regression, the linear neural network offers no direct advantage over
known statistical methods, but it does suggest an interesting nonlinear
generalization:
This
nonlinear autoassociator includes a hidden layer in both the encoder and
the decoder part of the network. Together with the linear bottleneck layer,
this gives a network with at least 3 hidden layers. Such a deep network should be
preconditioned if it is to learn successfully.
Time Series Prediction
When the input data x forms a
temporal series, an important task is to predict the next point: the weather
tomorrow, the stock market 5 minutes from now, and so on. We can (attempt to)
do this with a feedforward network by using time-delay embedding: at
time t, we give the network x(t), x(t-1), ... x(t-d) as input, and try to
predict x(t+1) at the output. After propagating activity forward to make the
prediction, we wait for the actual value of x(t+1) to come in before
calculating and backpropagating the error. Like all neural network architecture
parameters, the dimension d of the embedding is an important but difficult
choice.
A
more powerful (but also more complicated) way to model a time series is to use recurrent
neural networks.
Reinforcement
Learning
Sometimes we are faced with the
problem of delayed reward: rather than being told the correct answer for
each input pattern immediately, we may only occasionally get a positive or
negative reinforcement signal to tell us whether the entire sequence of actions
leading up to this was good or bad. Reinforcement learning provides ways to get
a continuous error signal in such situations.
Q-learning associates an expected utility (the Q-value) with each
action possible in a particular state. If at time t we are in state s(t) and
decide to perform action a(t), the corresponding Q-value is updated as follows:
where
r(t) is the instantaneous reward resulting from our action, s(t+1) is
the state that it led to, a are all possible actions in that state, and gamma
<= 1 is a discount factor that leads us to prefer instantaneous over
delayed rewards.
A
common way to implement Q-learning for small problems is to maintain a table of
Q-values for all possible state/action pairs. For large problems, however, it
is often impossible to keep such a large table in memory, let alone learn its
entries in reasonable time. In such cases a neural network can provide a
compact approximation of the Q-value function. Such a network takes the state
s(t) as its input, and has an output ya for each possible action. To
learn the Q-value Q(s(t), a(t)), it uses the right-hand side of the above
Q-iteration as a target:
Note
that since we require the network's outputs at time t+1 in order to calculate
its error signal at time t, we must keep a one-step memory of all input and
hidden node activity, as well as the most recent action. The error signal is
applied only to the output corresponding to that action; all other output nodes
receive no error (they are "don't cares").
TD-learning is a variation that assigns utility values to states alone
rather than state/action pairs. This means that search must be used to
determine the value of the best successor state. TD() replaces the one-step
memory with an exponential average of the network's gradient; this is similar
to momentum, and can help speed the transport of delayed reward signals across
large temporal distances.
One
of the most successful applications of neural networks is TD-Gammon, a
network that used TD() to learn the game of
backgammon from scratch, by playing only against itself. TD-Gammon is now the
world's strongest backgammon program, and plays at the level of human
grandmasters.
Time Sequences
There
are many tasks that require learning a temporal sequence of events. These
problems can be broken into 3 distinct types of tasks:
- Sequence Recognition: Produce a particular output pattern when a specific input sequence is seen. Applications: speech recognition
- Sequence Reproduction: Generate the rest of a sequence when the network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc)
- Temporal Association: Produce a particular output sequence in response to a specific input sequence. Applications: speech generation
Some
of the methods that are used include
- Tapped Delay Lines (time delay networks)
- Context Units (e.g. Elman Nets, Jordan Nets)
- Back propagation through time (BPTT)
- Real Time Recurrent Learning (RTRL)
Tapped Delay Lines / Time Delay Neural Networks
One
of the simplest ways of performing sequence recognition because conventional
backpropagation algorithms can be used.
Downsides:
Memory is limited by length of tapped delay line. If a large number of input
units are needed then computation can be slow and many examples are needed.
A simple extension to this is to
allow non-uniform sampling:
where
wi is the
integer delay assoicated with component i. Thus if there are n input
units, the memory is not limited simply the previous n timesteps.
Another
extension that deals is for each "input" to really be a convolution
of the original input sequence.
In the case of the delay line
memories:
Other variations for c are shown
graphically below:
This
figure is taken from "Neural Net Architectures for Temporal Sequence
Processing", by Mike Moser.
Recurrent Networks I
Consider the following two networks:
(Fig. 1)
The
network on the left is a simple feed forward network of the kind we have
already met. The right hand network has an additional connection from the
hidden unit to itself. What difference could this seemingly small change
to the network make?
Each
time a pattern is presented, the unit computes its activation just as in a feed
forward network. However its net input now contains a term which reflects the
state of the network (the hidden unit activation) before the pattern was seen.
When we present subsequent patterns, the hidden and output units' states will
be a function of everything the network has seen so far.
The
network behavior is based on its history, and so we must think of pattern
presentation as it happens in time.
Network topology
Once
we allow feedback connections, our network topology becomes very free: we can
connect any unit to any other, even to itself. Two of our basic requirements
for computing activations and errors in the network are now violated. When
computing activations, we required that before computing yi,
we had to know the activations of all units in the posterior set of nodes, Pi.
For computing errors, we required that before computing, we had to know
the errors of all units in its anterior set of nodes, Ai.
For
an arbitrary unit in a recurrent network, we now define its activation at time t as:
yi(t) = fi(neti(t-1))
At each time step, therefore, activation propagates forward through one layer of connections only. Once some level of activation is present in the network, it will continue to flow around the units, even in the absence of any new input whatsoever. We can now present the network with a time series of inputs, and require that it produce an output based on this series. These networks can be used to model many new kinds of problems, however, these nets also present us with many new difficult issues in training.
Before
we address the new issues in training and operation of recurrent neural
networks, let us first look at some sample tasks which have been attempted (or
solved) by such networks.
- Learning formal grammars
Given
a set of strings S, each composed of a series of symbols, identify the
strings which belong to a language L. A simple example: L = {an,bn}
is the language composed of strings of any number of a's, followed by
the same number of b's. Strings belonging to the language include aaabbb,
ab, aaaaaabbbbbb. Strings not belonging to the language include aabbb,
abb, etc. A common benchmark is the language defined by the reber
grammar. Strings which belong to a language L are said to be grammatical
and are ungrammatical otherwise.
- Speech recognition
In
some of the best speech recognition systems built so far, speech is first
presented as a series of spectral slices to a recurrent network. Each output of
the network represents the probability of a specific phone (speech sound, e.g.
/i/, /p/, etc), given both present and recent input. The probabilities are then
interpreted by a Hidden Markov Model which tries to recognize the whole
utterance. Details are provided here.
- Music composition
A
recurrent network can be trained by presenting it with the notes of a musical
score. It's task is to predict the next note. Obviously this is impossible to
do perfectly, but the network learns that some notes are more likely to occur
in one context than another. Training, for example, on a lot of music by J. S.
Bach, we can then seed the network with a musical phrase, let it predict the
next note, feed this back in as input, and repeat, generating new music. Music
generated in this fashion typically sounds fairly convincing at a very local
scale, i.e. within a short phrase. At a larger scale, however, the compositions
wander randomly from key to key, and no global coherence arises. This is an
interesting area for further work.... The original work is described here.
The Simple Recurrent Network
One
way to meet these requirements is illustrated below in a network known
variously as an Elman network (after
Jeff Elman, the originator), or as a Simple
Recurrent Network. At each time step, a copy of the hidden layer
units is made to a copy layer. Processing is done as follows:
- Copy inputs for time t to the input units
- Compute hidden unit activations using net input from input units and from copy layer
- Compute output unit activations as usual
- Copy new hidden unit activations to copy layer
In
computing the activation, we have eliminated cycles, and so our requirement
that the activations of all posterior nodes be known is met. Likewise, in
computing errors, all trainable weights are feed forward only, so we can apply
the standard backpropagation algorithm as before. The weights from the copy
layer to the hidden layer play a special role in error computation. The error
signal they receive comes from the hidden units, and so depends on the error at
the hidden units at time t. The activations in the hidden units,
however, are just the activation of the hidden units at time t-1. Thus,
in training, we are considering a gradient of an error function which is
determined by the activations at the present and the previous time steps.
A
generalization of this approach is to copy the input and hidden unit
activations for a number of previous timesteps. The more context (copy layers)
we maintain, the more history we are explicitly including in our gradient
computation. This approach has become known as Back
Propagation Through Time. It can be seen as an approximation to the
ideal of computing a gradient which takes into consideration not just the most
recent inputs, but all inputs seen so far by the network. The figure below
illustrates one version of the process:
The
inputs and hidden unit activations at the last three time steps are stored. The
solid arrows show how each set of activations is determined from the input and
hidden unit activations on the previous time step. A backward pass, illustrated
by the dashed arrows, is performed to determine separate values of delta (the
error of a unit with respect to its net input) for each unit and each
time step separately. Because each earlier layer is a copy of the layer one
level up, we introduce the new constraint that the weights at each level be
identical. Then the partial derivative of the negative error with respect to wi,j
is simply the sum of the partials calculated for the copy of wi,j
between each two layers.
Elman networks and their
generalization, Back Propagation Through Time, both seek to approximate the
computation of a gradient based on all past inputs, while retaining the
standard back prop algorithm. BPTT has been used in a number of applications (e.g.
ecg modeling). The main task is to to produce a particular output sequences in
response to specific input sequences. The downside of BPTT is that it requires
a large amount of storage, computation, and training examples in order to work
well. In the next section we will see how we can compute the true temporal
gradient using a method known as Real Time Recurrent Learning.
Real Time Recurrent Learning
In
deriving a gradient-based update rule for recurrent networks, we now make
network connectivity very very unconstrained. We simply suppose that we have a
set of input units, I = {xk(t), 0<k<m}, and a
set of other units, U = {yk(t), 0<k<n}, which
can be hidden or output units. To index an arbitrary unit in the network we can
use
|
(1)
|
Let
W be the weight matrix with n rows and n+m columns, where wi,j
is the weight to unit i (which is in U) from unit j (which
is in I or U). Units compute their activations in the now
familiar way, by first computing the weighted sum of their inputs:
|
(2)
|
where the only new element in the
formula is the introduction of the temporal index t. Units then compute
some non-linear function of their net input
yk(t+1) = fk(netk(t))
|
(3)
|
Usually,
both hidden and output units will have non-linear activation functions. Note
that external input at time t does not influence the output of any unit
until time t+1. The network is thus a discrete dynamical system.
Some
of the units in U are output units, for which a target is defined. A
target may not be defined for every single input however. For example, if we
are presenting a string to the network to be classified as either grammatical or
ungrammatical, we may provide a target only for the last symbol in the string.
In defining an error over the outputs, therefore, we need to make the error
time dependent too, so that it can be undefined (or 0) for an output unit for
which no target exists at present. Let T(t) be the set of indices k
in U for which there exists a target value dk(t) at
time t. We are forced to use the notation dk instead
of t here, as t now refers to time. Let the error at the output
units be
|
(4)
|
and define our error function for a
single time step as
|
(5)
|
The error function we wish to
minimize is the sum of this error over all past steps of the network
|
(6)
|
Now,
because the total error is the sum of all previous errors and the error at this
time step, so also, the gradient of the total error is the sum of the gradient
for this time step and the gradient for previous steps
|
(7)
|
As a time series is presented to the
network, we can accumulate the values of the gradient, or equivalently, of the
weight changes. We thus keep track of the value
|
(8)
|
After the network has been presented
with the whole series, we alter each weight wij by
|
(9)
|
We therefore need an algorithm that
computes
|
(10)
|
at each time step t. Since we
know ek(t) at all times (the difference between our targets
and outputs), we only need to find a way to compute the second factor .
IMPORTANT
The key to understanding RTRL is to
appreciate what this factor expresses. It is essentially a measure of the
sensitivity of the value of the output of unit k at time t to a
small change in the value of wij, taking into account the
effect of such a change in the weight over the entire network trajectory from t0
to t. Note that wij does not have to be connected to
unit k. Thus this algorithm is non-local, in that we need to consider
the effect of a change at one place in the network on the values computed at an
entirely different place. Make sure you understand this before you dive into
the derivation given next
Derivation
of
This is given here for completeness,
for those who wish perhaps to implement RTRL. Make sure you at least know what
role the factor plays in computing the gradient.
From Equations 2 and 3, we get
|
(11)
|
where is the Kronecker delta
|
(12)
|
[Exercise: Derive Equation 11 from
Equations 2 and 3]
Because input signals do not depend
on the weights in the network,
|
(13)
|
Equation 11 becomes:
|
(14)
|
This
is a recursive equation. That is, if we know the value of the left hand side
for time 0, we can compute the value for time 1, and use that value to compute
the value at time 2, etc. Because we assume that our starting state (t = 0)
is independent of the weights, we have
|
(15)
|
These equations hold for all .
We therefore need to define the
values
|
(16)
|
for every time step t and all
appropriate i, j and k. We start with the initial condition
pijk(t0) = 0
|
(17)
|
and compute at each time step
|
(18)
|
The algorithm then consists of
computing, at each time step t, the quantities pijk(t)
using equations 16 and 17, and then using the differences between targets and
actual outputs to compute weight changes
|
(19)
|
and the overall correction to be
applied to wij is given by
|
(20)
|
Dynamics and RNNs
Consider
the recurrent network illustrated below. A single input unit is connected to
each of the three "hidden" units. Each hidden unit in turn is
connected to itself and the other hidden units. As in the RTRL derivation, we
do not distinguish now between hidden and output units. Any activation which
enters the network through the input node can flow around from one unit to the
other, potentially forever. Weights less than 1.0 will exponentially reduce the
activation, weights larger than 1.0 will cause it to increase. The non-linear
activation functions of the hidden units will hopefully prevent it from growing
without bound.
As
we have three hidden units, their activation at any given time t
describes a point in a 3-dimensional state space. We can visualize the temporal
evolution of the network state by watching the state evolve over time.
In
the absence of input, or in the presence of a steady-state input, a network
will usually approach a fixed point attractor. Other behaviors are possible,
however. Networks can be trained to oscillate in regular fashion, and chaotic
behavior has also been observed. The development of architectures and
algorithms to generate specific forms of dynamic behavior is still an active
research area.
Some
limitations of gradient methods and RNNs
The
simple recurrent network computed a gradient based on the present state of the
network and its state one time step ago. Using Back Prop Through Time, we could
compute a gradient based on some finite n time steps of network
operation. RTRL provided a way of computing the true gradient based on the
complete network history from time 0 to the present. Is this perfection?
Unfortunately
not. With feedforward networks which have a large number of layers, the weights
which are closest to the output are the easiest to train. This is no surprise,
as their contribution to the network error is direct and easily measurable.
Every time we back propagate an error one layer further back, however, our
estimate of the contribution of a particular weight to the observed error
becomes more indirect. You can think of error flowing in the top of the network
in distinct streams. Each pack propagation dilutes the error, mixing up error
from distinct sources, until, far back in the network, it becomes virtually
impossible to tell who is responsible for what. The error signal has become
completely diluted.
With
RTRL and BPTT we face a similar problem. Error is now propagated back in time,
but each time step is exactly equivalent to propagating through an additional
layer of a feed forward network. The result, of course, is that it becomes very
difficult to assess the importance of the network state at times which lie far
back in the past. Typically, gradient based networks cannot reliably use
information which lies more than about 10 time steps in the past. If you now
imagine an attempt to use a recurrent neural network in a real life situation,
e.g. monitoring an industrial process, where data are presented as a time
series at some realistic sampling rate (say 100 Hz), it becomes clear that
these networks are of limited use. The next section shows a recent model which
tries to address this problem.
Long Short-Term Memory
In
a recurrent network, information is stored in two distinct ways. The
activations of the units are a function of the recent history of the model, and
so form a short-term memory. The weights too form a memory, as they are
modified based on experience, but the timescale of the weight change is much
slower than that of the activations. We call those a long-term memory. The Long
Short-Term Memory model [1] is an attempt to allow the unit activations to
retain important information over a much longer period of time than the 10 to
12 time steps which is the limit of RTRL or BPTT models.
The
figure below shows a maximally simple LSTM network, with a single input, a
single output, and a single memory block in place of the familiar hidden unit.
|
This figure below shows a
maximally simple LSTM network, with a single input, a single output, and a
single memory block in place of the familiar hidden unit. Each block has two
associated gate units (details below). Each layer may, of course, have
multiple units or blocks. In a typical configuration, the first layer of
weights is provided from input to the blocks and gates. There are then
recurrent connections from one block to other blocks and gates. Finally there
are weights from the blocks to the outputs. The next figure shows the details
of the memory block in more detail.
|
The hidden units of a conventional
recurrent neural network have now been replaced by memory blocks, each
of which contains one or more memory cells. At the heart of the cell
is a simple linear unit with a single self-recurrent connection with weight
set to 1.0. In the absence of any other input, this connection serves to
preserve the cell's current state from one moment to the next. In addition to
the self-recurrent connection, cells receive input from input units and other
cell and gates. While the cells are responsible for maintaining information
over long periods of time, the responsibility for deciding what information
to store, and when to apply that information lies with an input and output
gating unit, respectively.
|
The
input to the cell is passed through a non-linear squashing function (g(x),
typically the logistic function, scaled to lie within [-2,2]), and the result
is then multiplied by the output of the input gating unit. The activation of
the gate ranges over [0,1], so if its activation is near zero, nothing can
enter the cell. Only if the input gate is sufficiently active is the signal
allowed in. Similarly, nothing emerges from the cell unless the output gate is
active. As the internal cell state is maintained in a linear unit, its
activation range is unbounded, and so the cell output is again squashed when it
is released (h(x), typical range [-1,1]). The gates themselves are
nothing more than conventional units with sigmoidal activation functions
ranging over [0,1], and they each receive input from the network input units
and from other cells.
Thus we have:
- Cell output: ycj(t) is
ycj(t) = youtj(t) h(scj(t))
- where youtj(t) is the activation of the output gate, and the state, scj(t) is given by
scj(0) = 0, and
scj(t) = scj(t-1) + yinj(t)
g(netcj(t)) for t > 0.
This
division of responsibility---the input gates decide what to store, the cell
stores information, and the output gate decides when that information is to be
applied---has the effect that salient events can be remembered over arbitrarily
long periods of time. Equipped with several such memory blocks, the network can
effectively attend to events at multiple time scales.
Network
training uses a combination of RTRL and BPTT, and we won't go into the details
here. However, consider an error signal being passed back from the output unit.
If it is allowed into the cell (as determined by the activation of the output
gate), it is now trapped, and it gets passed back through the self-recurrent
connection indefinitely. It can only affect the incoming weights, however, if
it is allowed to pass by the input gate.
On
selected problems, an LSTM network can retain information over arbitrarily long
periods of time; over 1000 time steps in some cases. This gives it a
significant advantage over RTRL and BPTT networks on many problems. For
example, a Simple Recurrent Network can learn the Reber Grammar, but not the Embedded
Reber Grammar. An RTRL network can sometimes, but not always, learn the
Embedded Reber Grammar after about 100 000 training sequences. LSTM always
solves the Embedded problem, usually after about 10 000 sequence presentations.
One
of us is currently training LSTM networks to distinguish between different
spoken languages based on speech prosody (roughly: the melody and rhythm of
speech).
References
Hochreiter,
Sepp and Schmidhuber, Juergen, (1997) "Long Short-Term Memory", Neural
Computation, Vol 9 (8), pp: 1735-1780