20 July 2018

Capsule Neural Networks (CNN) a Better alternative for Convolutional Neural Networks (CNN)





Capsule Neural Networks (CNN) a Better alternative


Geoffrey Hinton and his team published two papers that introduced a completely new type of neural network based on so-called capsules. In addition to that, the team published an algorithm, called dynamic routing between capsules, that allows to train such a network.

Introduction to Capsule Networks (CapsNets)


For everyone in the deep learning community, this is huge news, and for several reasons. First of all, Hinton is one of the founders of deep learning and an inventor of numerous models and algorithms that are widely used today. Secondly, these papers introduce something completely new, and this is very exciting because it will most likely stimulate additional wave of research and very cool applications.


Capsule Neural Networks


What is a CapsNet or Capsule Network?

Introduction to How Faster R-CNN, Fast R-CNN and R-CNN Works


Faster R-CNN Architecture

How RPN (Region Proposal Networks) Works


What is a Capsule Network? What is a Capsule? Is CapsNet better than a Convolutional Neural Network (CNN)? In this article I will talk about all the above questions about CapsNet or Capsule Network released by Hinton.
Note: This article is not about pharmaceutical capsules. It is about Capsules in Neural Networks or Machine Learning world.
There is an expectation from you as a reader. You need to be aware of CNNs. If not, I would like you to go through this article on Hackernoon. Next I will run through a small recap of relevant points of CNN. That way you can easily grab on to the comparison done below. So without further ado lets dive in.

CNN are essentially a system where we stack a lot of neurons together. These networks have been proven to be exceptionally great at handling image classification problems. It would be hard to have a neural network map out all the pixels of an image since it‘s computationally really expensive. So convolutional is a method which helps you simplify the computation to a great extent without losing the essence of the data. Convolution is basically a lot of matrix multiplication and summation of those results.

Capsule Networks Are Shaking up AI — An Introduction


After an image is fed to the network, a set of kernels or filters scan it and perform the convolution operation. This leads to creation of feature maps inside the network. These features next pass via activation layer and pooling layers in succession and then based on the number of layers in the network this continues. Activation layers are required to induce a sense of non linearity in the network (eg: ReLU). Pooling (eg: max pooling) helps in reducing the training time. The idea of pooling is that it creates “summaries” of each sub-region. It also gives you a little bit of positional and translational invariance in object detection. At the end of the network it will pass via a classifier like softmax classifier which will give us a class. Training happens based on back propagation of error matched against some labelled data. Non linearity also helps in solving the vanishing gradient in this step.

What is the problem with CNNs?

CNNs perform exceptionally great when they are classifying images which are very close to the data set. If the images have rotation, tilt or any other different orientation then CNNs have poor performance. This problem was solved by adding different variations of the same image during training. In CNN each layer understands an image at a much more granular level. Lets understand this with an example. If you are trying to classify ships and horses. The innermost layer or the 1st layer understands the small curves and edges. The 2nd layer might understand the straight lines or the smaller shapes, like the mast of a ship or the curvature of the entire tail. Higher up layers start understanding more complex shapes like the entire tail or the ship hull. Final layers try to see a more holistic picture like the entire ship or the entire horse. We use pooling after each layer to make it compute in reasonable time frames. But in essence it also loses out the positional data.

Pooling helps in creating the positional invariance. Otherwise CNNs would fit only for images or data which are very close to the training set. This invariance also leads to triggering false positive for images which have the components of a ship but not in the correct order. So the system can trigger the right to match with the left in the above image. You as an observer clearly see the difference. The pooling layer also adds this sort of invariance.

Depthwise Separable Convolution - A FASTER CONVOLUTION!


This was never the intention of pooling layer. What the pooling was supposed to do is to introduce positional, orientational, proportional invariances. But the method we use to get this uses is very crude. In reality it adds all sorts of positional invariance. Thus leading to the dilemma of detecting right ship in image 2.0 as a correct ship. What we needed was not invariance but equivariance. Invariance makes a CNN tolerant to small changes in the viewpoint. Equivariance makes a CNN understand the rotation or proportion change and adapt itself accordingly so that the spatial positioning inside an image is not lost. A ship will still be a smaller ship but the CNN will reduce its size to detect that. This leads us to the recent advancement of Capsule Networks.


Hinton himself stated that the fact that max pooling is working so well is a big mistake and a disaster:

Hinton: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”

Of course, you can do away with max pooling and still get good results with traditional CNNs, but they still do not solve the key problem:

Internal data representation of a convolutional neural network does not take into account important spatial hierarchies between simple and complex objects.

In the example of a Dog, a mere presence of 2 eyes, a mouth and a nose in a picture does not mean there is a face, we also need to know how these objects are oriented relative to each other.

What is a Capsule Network?

Every few days there is an advancement in the field of Neural Networks. Some brilliant minds are working on this field. You can pretty much assume every paper on this topic is almost ground breaking or path changing. Sara Sabour, Nicholas Frost and Geoffrey Hinton released a paper titled “Dynamic Routing Between Capsules” 4 days back. Now when one of the Godfathers of Deep Learning “Geoffrey Hinton” is releasing a paper it is bound to be ground breaking. The entire Deep Learning community is going crazy on this paper as you read this article. So this paper talks about Capsules, CapsNet and a run on MNIST. MNIST is a database of tagged handwritten digit images. Results are showing a significant increase in performance in case of overlapped digits. The paper compares to the current state-of-the-art CNNs. In this paper the authors project that human brain have modules called “capsules”. These capsules are particularly good at handling different types of visual stimulus and encoding things like pose (position, size, orientation), deformation, velocity, albedo, hue, texture etc. The brain must have a mechanism for “routing” low level visual information to what it believes is the best capsule for handling it.

Capsule Networks



Capsule is a nested set of neural layers. So in a regular neural network you keep on adding more layers. In CapsNet you would add more layers inside a single layer. Or in other words nest a neural layer inside another. The state of the neurons inside a capsule capture the above properties of one entity inside an image. A capsule outputs a vector to represent the existence of the entity. The orientation of the vector represents the properties of the entity. The vector is sent to all possible parents in the neural network. For each possible parent a capsule can find a prediction vector. Prediction vector is calculated based on multiplying it’s own weight and a weight matrix. Whichever parent has the largest scalar prediction vector product, increases the capsule bond. Rest of the parents decrease their bond. This routing by agreement method is superior than the current mechanism like max-pooling. Max pooling routes based on the strongest feature detected in the lower layer. Apart from dynamic routing, CapsNet talks about adding squashing to a capsule. Squashing is a non-linearity. So instead of adding squashing to each layer like how you do in CNN, you add the squashing to a nested set of layers. So the squashing function gets applied to the vector output of each capsule.

Why Deep Learning Works: Self Regularization in Deep Neural Networks



The paper introduces a new squashing function. You can see it in image 3.1. ReLU or similar non linearity functions work well with single neurons. But the paper found that this squashing function works best with capsules. This tries to squash the length of output vector of a capsule. It squashes to 0 if it is a small vector and tries to limit the output vector to 1 if the vector is long. The dynamic routing adds some extra computation cost. But it definitely gives added advantage.

Now we need to realise that this paper is almost brand new and the concept of capsules is not throughly tested. It works on MNIST data but it still needs to be proven against much larger dataset across a variety of classes. There are already (within 4 days) updates on this paper who raise the following concerns:
1. It uses the length of the pose vector to represent the probability that the entity represented by a capsule is present. To keep the length less than 1 requires an unprincipled non-linearity that prevents there from being any sensible objective function that is minimized by the iterative routing procedure.
2. It uses the cosine of the angle between two pose vectors to measure their agreement for routing. Unlike the log variance of a Gaussian cluster, the cosine is not good at distinguishing between quite good agreement and very good agreement.
3. It uses a vector of length n rather than a matrix with n elements to represent a pose, so its transformation matrices have n 2 parameters rather than just n.

The current implementation of capsules has scope for improvement. But we should also keep in mind that the Hinton paper in the first place only says:

The aim of this paper is not to explore this whole space but to simply show that one fairly straightforward implementation works well and that dynamic routing helps.


Capsule Neural Networks: The Next Neural Networks?  CNNs and their problems.
Convolutional (‘regular’) Neural Networks are the latest hype in machine learning, but they have their flaws. Capsule Neural Networks are the recent development from Hinton which help us solve some of these issues.

Neural Networks may be the hottest field in Machine Learning. In recent years, there were many new developments improving neural networks and building making them more accessible. However, they were mostly incremental, such as adding more layers or slightly improving the activation function, but did not introduce a new type of architecture or topic.

Geoffery Hinton is one of the founding fathers of many highly utilized deep learning algorithms including many developments to Neural Networks — no wonder, for having Neurosciences and Artificial Intelligence background.

Capsule Networks: An Improvement to Convolutional Networks


Neural Networks may be the hottest field in Machine Learning. In recent years, there were many new developments improving neural networks and building making them more accessible. However, they were mostly incremental, such as adding more layers or slightly improving the activation function, but did not introduce a new type of architecture or topic.

Geoffery Hinton is one of the founding fathers of many highly utilized deep learning algorithms including many developments to Neural Networks — no wonder, for having Neurosciences and Artificial Intelligence background.

 At late October 2017, Geoffrey Hinton, Sara Sabour, and Nicholas Frosst Published a research paper under Google Brain named “Dynamic Routing Between Capsules”, introducing a true innovation to Neural Networks. This is exciting, since such development has been long awaited for, will likely spur much more research and progress around it, and is supposed to make neural networks even better than they are now.

Capsule networks: overview

The Baseline: Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are extremely flexible machine learning models which were originally inspired by principles from how our brains are theorized to work.
Neural Networks utilize layers of “neurons” to process raw data into patterns and objects.
The primary building blocks of a Neural Network is a “Convolutional” layer (hence the name). What does it do? It takes raw information from a previous layer, makes sense of patterns in it, and send it onward to the next layer to make sense of a larger picture.

 If you are new to neural networks and want to understand it, I recommend:

  • Watching the animated videos by 3Blue1Brown.
  • For a more detailed textual/visual guide, you can check out this beginner’s blogpost
  • If you can deal with some more math and greater details, you can read instead this guide from CS231 at Stanford. 


 In case you didn’t do any of the above, and plan to continue, here is a hand-wavy brief overview.

The Intuition Behind Convolutional Neural Networks

Let’s start from the beginning.
 The Neural Net receives raw input data. Let’s say it’s a doodle of a dog. When you see a dog, you brain automatically detects it’s a dog. But to the computer, the image is really just an array of numbers representing the colors intensity in the colors channels. Let’s say it’s just a Black&White doodle, so we can represent it with one array where each cell represents the brightness of the pixel from black to white.

Understanding Convolutional Neural Networks.

Convolutional Layers. The first convolutional layer maps the image space to a lower space — summarizing what’s happening in each group of, say 5x5 pixels — is it a vertical line? horizontal line? curve of what shape? This happens with element wise multiplication and then summation of all the values in the filter with the original filter value and summing up to a single number.

This leads to the Neuron, or convolutional filters. Each Filter / Neuron is designed to react to one specific form (a vertical line? a horizontal line? etc…). The groups of pixels from layer 1 reach these neurons, and lights up the neurons that match its structure according to how much this slice is similar to what the neuron looks for.

Activation (usually “ReLU”) Layers — After each convolutional layer, we apply a nonlinear layer (or activation layer), which introduces non-linearity to the system, enabling it to discover also nonlinear relations in the data. ReLU is a very simple one: making any negative input to 0, or if it’s positive — keeping it the same. ReLU(x) = max(0,x).

Pooling Layers. This allows to reduce “unnecessary” information, summarize what we know about a region, and continue to refine information. For example, this might be “MaxPooling” where the computer will just take the highest value of the passed this patch — so that the computer knows “around these 5x5 pixels, the most dominant value is 255. I don’t know exactly in which pixel but the exact location isn’t as important as that it’s around there. → Notice: This is not good. We loose information here. Capsule Networks don’t have this operation here, which is an improvement.

Dropout Layers. This layer “drops out” a random set of activations in that layer by setting them to zero. This makes the network more robust (kind of like you eating dirt builds up your immunity system, the network is more immune to small changes) and reduces overfitting. This is only used when training the network.

Last Fully Connected Layer. For a classification problem, we want each final neuron represents the final class. It looks at the output of the previous layer (which as we remember should represent the activation maps of high level features) and determines which features most correlate to a particular class.

SoftMax — This layer is sometimes added as a another way to represent the outputs per classes that we can later pass on in a loss function. Softmax represents the distribution of probabilities to the various categories.
Usually, there are more layers which provide nonlinearities and preservation of dimensions (like padding with 0’s around the edges) that help to improve the robustness of the network and control overfitting. But these are the basics you need to understand what comes after.

Capsule Network



Usually, there are more layers which provide nonlinearities and preservation of dimensions (like padding with 0’s around the edges) that help to improve the robustness of the network and control overfitting. But these are the basics you need to understand what comes after.

 Now, importantly, these layers are connected only SEQUENTIALLY. This is in contrast to the structure of capsule networks.


What is The Problem With Convolutional Neural Networks?

If this interests you, watch Hinton's lecture explaining exactly what it wrong with them. Below you'll get a couple of key points that are improved by Capsule Networks.

Hinton says that they have too few levels of substructures (nets are composed from layers composed from neurons, that's it); and that we need to group the neurons in each layer into “capsules”, like mini-columns, that do a lot of internal computations, and then output a summary result.

Problems with CNNs and Introduction to capsule neural networks

Problem #1: Pooling looses information

CNN use “pooling” or equivalent methods to “summarize” what's going on in the smaller regions and make sense of larger and larger chunks of the image. This was a solution that made CNNs work well, but it looses valuable information.

 Capsule networks will compute a pose (transnational and rotational) relationship between smaller features to make up a larger feature.
 This loss of information leads to loss of spatial information.

Problem #2: CNNs don't account for the spatial relations between the parts of the image. Therefore, they also are too sensitive to orientation.

Subsampling (and pooling) loses the precise spatial relationships between higher-level parts like a nose and a mouth. The precise spatial relationships are needed for identity recognition.

(Hinton, 2012, in his lecture).

Geoffrey Hinton Capsule theory



 CNNs don't account for spatial relationships between the underlying objects. By having these flat layers of neurons that light up according to which objects they've seen, they recognize the presence of such objects. But then they are passed on to other activation and pooling layers and on to the next layer of neurons (filters), without recognizing what are the relations between these objects we identified in that single layer.
 They just account for their presence.

Hinton: Dynamic Routing Between Capsules


So a (simplistic) Neural network will not hesitate about categorizing both these dogs, Pablo and Picasso, as similarly good representations of “corgi-pit-bull-terrier mix”.

Capsule Networks (CapsNets) – Tutorial

Problem #3: CNNs can't transfer their understanding of geometric relationships to new viewpoints.

This makes them more sensitive to the original image itself in order to classify images as the same category.

 CNNs are great for solving problems with data similar to what they have been trained on. It can classify images or objects within them which are very close to things it has seen before.

 But if the object is slightly rotated, photographed from a slightly different angle, especially in 3D, is tilted or in another orientation than what the CNN has seen - the network won't recognize it well.

 One solution is to artificially create tilted representation of the image or groups and add them to the “training” set. However, this still lacks a fundamentally more robust structure.


What is a Capsule?

Capsule Networks Explained in detail ! (Deep learning)



In order to answer this question, I think it is a good idea to refer to the first paper where capsules were introduced — “Transforming Autoencoders” by Hinton et al. The part that is important to understanding of capsules is provided below:

“Instead of aiming for viewpoint invariance in the activities of “neurons” that use a single scalar output to summarize the activities of a local pool of replicated feature detectors, artificial neural networks should use local “capsules” that perform some quite complicated internal computations on their inputs and then encapsulate the results of these computations into a small vector of highly informative outputs. Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold.”

The paragraph above is very dense, and it took me a while to figure out what it means, sentence by sentence. Below is my version of the above paragraph, as I understand it:

Artificial neurons output a single scalar. In addition, CNNs use convolutional layers that, for each kernel, replicate that same kernel’s weights across the entire input volume and then output a 2D matrix, where each number is the output of that kernel’s convolution with a portion of the input volume. So we can look at that 2D matrix as output of replicated feature detector. Then all kernel’s 2D matrices are stacked on top of each other to produce output of a convolutional layer.

Then, we try to achieve viewpoint invariance in the activities of neurons. We do this by the means of max pooling that consecutively looks at regions in the above described 2D matrix and selects the largest number in each region. As result, we get what we wanted — invariance of activities. Invariance means that by changing the input a little, the output still stays the same. And activity is just the output signal of a neuron. In other words, when in the input image we shift the object that we want to detect by a little bit, networks activities (outputs of neurons) will not change because of max pooling and the network will still detect the object.

Dynamic routing between capsules


The above described mechanism is not very good, because max pooling loses valuable information and also does not encode relative spatial relationships between features. We should use capsules instead, because they will encapsulate all important information about the state of the features they are detecting in a form of a vector (as opposed to a scalar that a neuron outputs).

[PR12] Capsule Networks - Jaejun Yoo



Capsules encapsulate all important information about the state of the feature they are detecting in vector form.

Capsules encode probability of detection of a feature as the length of their output vector. And the state of the detected feature is encoded as the direction in which that vector points to (“instantiation parameters”). So when detected feature moves around the image or its state somehow changes, the probability still stays the same (length of vector does not change), but its orientation changes.

t.1 – Capsules and routing techniques (part 1/2)



Imagine that a capsule detects a face in the image and outputs a 3D vector of length 0.99. Then we start moving the face across the image. The vector will rotate in its space, representing the changing state of the detected face, but its length will remain fixed, because the capsule is still sure it has detected a face. This is what Hinton refers to as activities equivariance: neuronal activities will change when an object “moves over the manifold of possible appearances” in the picture. At the same time, the probabilities of detection remain constant, which is the form of invariance that we should aim at, and not the type offered by CNNs with max pooling.

PR-012: Faster R-CNN : Towards Real-Time Object Detection with Region Proposal Networks


More Information:

“Understanding Dynamic Routing between Capsules (Capsule Networks)”
https://jhui.github.io/2017/11/03/Dynamic-Routing-Between-Capsules/


Understanding Hinton’s Capsule Networks. Part I: Intuition.   https://medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b



Understanding Capsule Networks — AI’s Alluring New Architecture.  https://medium.freecodecamp.org/understanding-capsule-networks-ais-alluring-new-architecture-bdb228173ddc



What is a CapsNet or Capsule Network?   https://hackernoon.com/what-is-a-capsnet-or-capsule-network-2bfbe48769cc


https://en.wikipedia.org/wiki/Capsule_neural_network


https://en.wikipedia.org/wiki/Convolutional_neural_network


A “weird” introduction to Deep Learning.  https://towardsdatascience.com/a-weird-introduction-to-deep-learning-7828803693b0


Faster R-CNN Explained   https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8


A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN   https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4


Convolutional Neural Network (CNN)   https://skymind.ai/wiki/convolutional-network


Capsule Neural Networks: The Next Neural Networks? Part 1: CNNs and their problems  https://towardsdatascience.com/capsule-neural-networks-are-here-to-finally-recognize-spatial-relationships-693b7c99b12


Understanding Hinton’s Capsule Networks. Part II: How Capsules Work    https://medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-ii-how-capsules-work-153b6ade9f66


Understanding Hinton’s Capsule Networks. Part III: Dynamic Routing Between Capsules   https://medium.com/ai³-theory-practice-business/understanding-hintons-capsule-networks-part-iii-dynamic-routing-between-capsules-349f6d30418


Understanding Hinton’s Capsule Networks. Part IV: CapsNet Architecture   https://medium.com/@pechyonkin/part-iv-capsnet-architecture-6a64422f7dce



Share:

0 reacties:

Post a Comment