Backpropagation at scale

The method that trains deep networks, backpropagation, was published in 1986. AlexNet's contribution was running that same idea on a far bigger network and far more data.

It is easy to assume AlexNet introduced a new way to train networks. It did not. The learning method it used, backpropagation, was published in 1986, in a Nature paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.^[1] That paper describes a procedure that repeatedly adjusts the connection weights “so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.”^[1] As a side effect, the hidden units “come to represent important features of the task domain.”^[1]

In plain terms: the network makes a guess, measures how wrong it is, and passes that error backward through its layers to nudge every weight in the direction that would have been less wrong. Repeat this across millions of examples and the network gradually shapes itself.

AlexNet ran exactly this idea. The paper describes training with stochastic gradient descent, using the derivative of the objective with respect to each weight to update it.^[2] The new part in 2012 was scale, not the recipe. Hinton was an author on both papers, twenty-six years apart. What had changed in between was the amount of data and the speed of the hardware, which finally let an old learning method run on a network big enough to matter.

References

Learning representations by back-propagating errors — Nature, vol. 323 (Rumelhart, Hinton, Williams, 1986)
ImageNet Classification with Deep Convolutional Neural Networks — NeurIPS 2012 (Krizhevsky, Sutskever, Hinton)

Was the learning method in 2012 actually new?

References