In the last years a lot of coverage was given to Artificial Intelligence. During the last months of my Bachelor’s, reading more and more about the successful applications, I started to feel really curious about that area of AI called Deep Learning. Actually, Deep Learning is a sub-field of the more general Machine Learning, that is the field of Computer Science which “gives a computer the ability to learn without be explicitly programmed” (Wikipedia) or, in different words, aims to design programs that can improve their performance by “learning” from examples (a better and easier definition here). The theories of Deep Learning focus on building Deep Neural Networks (DNN), which are complex mathematical models that loosely mimic the behaviour of the human brain. The procedure to make DNNs “intelligent”; simulates the way people (and especially babies) learn: you perceive something, you make sense of it and “change your mind” accordingly. Really fascinating!

Samples of similar images that were used to train AlexNet. They are arranged by the similarity of the abstractions they contain.

Luckily, my university offers a Machine Learning course so I had the opportunity to study and put hands on the things I was interested. As part of a project, I was asked to implement and train a Convolutional Neural Network (ConvNet), a special type of neural network that is usually applied to images (they are used in your Prisma filters too!). You can think this kind of networks as the artificial counterpart of our vision system. ConvNets are composed of different layers of artificial neurons that are able to learn various levels of abstractions: for example, when applied to an image of a person, the first layers of neurons get excited when they recognise simple patterns as lines or edges; the immediate following layers use the edges to compose more complicated abstractions such as eyes, nose, mouth, hands, etc; then towards the last layers, the previous patters are merged in a full abstraction of the person. To learn this “internal representation” these models are shown just a lot if images (with their category associated) a lot of times. Today, ConvNets are the standard in Computer Vision.

I implemented AlexNet, the first model of ConvNet that was used by the scientific community to win the ILSVRC competition. I trained it showing and showing a set of 1.2 million images of 1000 different categories (the famous ImageNetdataset) for more than forty times. The training took a few days. It was stopped after the neural network reached a good level of generalisation, that is the ability to predict the category of examples that it has never seen. Once trained, if you input it with an image, it will output the top five categories that best classify the image. For example, feeding AlexNet with the image below

gives the output:

AlexNet saw:
alp - score: 0.575796604156
church, church building - score: 0.0516746938229
valley, vale - score: 0.0432425364852
castle - score: 0.0284509658813
monastery - score: 0.0265731271356

where the categories are ordered by the probability (score) that the image belongs to that class.

I created a repository on GitHub with all the source codes I used to implement, train and test AlexNet. I coded them in Python using Google’s TensorFlow framework. If you like to dive deeper into the technical details, refer to the repo.

EDIT 20-11-2017: I added to the repository a version of the code using TensorFlow’s new imperative style, TensorFlow Eager. Unfortunately I did not have the time nor the resources to train and test AlexNet using these new scripts, but I expect results similar to the old ones (model architecture and training pipeline remained the same). If anyone would try to take charge of this, it would be really appreciated!