Content deleted Content added
→Pooling: Fix red link Tags: Mobile edit Mobile web edit |
MichaelMaggs (talk | contribs) Adding short description: "Type of artificial neural network" |
||
(37 intermediate revisions by 23 users not shown) | |||
Line 1:
{{Short description|Type of artificial neural network}}
A ''' The idea is to add structures called
Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level.<ref name=":16">{{cite web|url=http://www.cedar.buffalo.edu/~srihari/CSE676/9.12%20CapsuleNets.pdf|title=Capsule Nets|last=Srihari|first=Sargur|publisher=[[University of Buffalo]]|access-date=2017-12-07}}</ref> This can be compared to inverting the rendering of an object of multiple parts.<ref name=":0">{{Cite book|url=http://papers.nips.cc/paper/1710-learning-to-parse-images.pdf|title=Advances in Neural Information Processing Systems 12|
{{TOC limit|3}}
== History ==
In 2000, [[Geoffrey Hinton]] et
A dynamic routing mechanism for capsule networks was introduced by Hinton and his team in 2017. The approach was claimed to reduce error rates on [[MNIST database|MNIST]] and to reduce training set sizes. Results were claimed to be considerably better than a CNN on highly overlapped digits.<ref name=":1"/>
In Hinton's original idea one minicolumn would represent and detect one multidimensional entity.<ref>{{Citation|last=Meher Vamsi|title=Geoffrey Hinton Capsule theory|date=2017-11-15|url=https://www.youtube.com/watch?v=6S1_WqE55UQ|
== Transformations ==
Line 21 ⟶ 22:
A nonequivariant is a property whose value does not change predictably under a transformation. For example, transforming a circle into an ellipse means that its perimeter can no longer be computed as π times the diameter.
In computer vision, the class of an object is expected to be an invariant over many transformations. I.e., a cat is still a cat if it is shifted, turned upside down or shrunken in size. However, many other properties are instead equivariant. The volume of a cat changes when it is scaled.
Equivariant properties such as a spatial relationship are captured in a ''pose'', data that describes an object's [[Translation (geometry)|translation]], [[Rotation (mathematics)|rotation]], scale and reflection. Translation is a change in ___location in one or more dimensions. Rotation is a change in orientation. Scale is a change in size. Reflection is a mirror image.<ref name=":1" />
Line 27 ⟶ 28:
[[Unsupervised learning|Unsupervised]] capsnets learn a global [[Affine space|linear manifold]] between an object and its pose as a matrix of weights. In other words, capsnets can identify an object independent of its pose, rather than having to learn to recognize the object while including its spatial relationships as part of the object. In capsnets, the pose can incorporate properties other than spatial relationships, e.g., color (cats can be of various colors).
Multiplying the object by the manifold poses the object (for an object, in space).<ref>{{Cite web|url=https://kndrck.co/posts/capsule_networks_explained/|title=Capsule Networks Explained|last=Tan|first=Kendrick|date=November 10, 2017|website=kndrck.co|language=en
== Pooling ==
Capsnets reject the [[
* violates biological shape perception in that it has no intrinsic coordinate frame;
* provides invariance (discarding positional information) instead of equivariance (disentangling that information);
Line 40 ⟶ 41:
A capsule is a set of neurons that individually activate for various properties of a type of object, such as position, size and hue. Formally, a capsule is a set of neurons that collectively produce an ''activity vector'' with one element for each neuron to hold that neuron's instantiation value (e.g., hue).<ref name=":1"/> Graphics programs use instantiation value to draw an object. Capsnets attempt to derive these from their input. The probability of the entity's presence in a specific input is the vector's length, while the vector's orientation quantifies the capsule's properties.<ref name=":1"/><ref name=":16"/>
[[Artificial neuron]]s traditionally output a scalar, real-valued activation that loosely represents the probability of an observation. Capsnets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement.<ref name=":1"/>
Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher. A minimal cluster of two capsules considering a six-dimensional entity would agree within 10% by chance only once in a million trials. As the number of dimensions increase, the likelihood of a chance agreement across a larger cluster with higher dimensions decreases exponentially.<ref name=":1"/>
Line 51 ⟶ 52:
The outputs from one capsule (child) are routed to capsules in the next layer (parent) according to the child's ability to predict the parents' outputs. Over the course of a few iterations, each parents' outputs may converge with the predictions of some children and diverge from those of others, meaning that that parent is present or absent from the scene.<ref name=":1" />
For each possible parent, each child computes a prediction vector by multiplying its output by a weight matrix (trained by [[backpropagation]]).<ref name=":16"/> Next the output of the parent is computed as the [[Dot product|scalar product]] of a prediction with a coefficient representing the probability that this child belongs to that parent. A child whose predictions are relatively close to the resulting output successively increases the coefficient between that parent and child and decreases it for parents that it matches less well. This increases the contribution that that child makes to that parent, thus increasing the scalar product of the
The coefficients' initial logits are the log prior probabilities that a child belongs to a parent. The priors can be trained discriminatively along with the weights. The priors depend on the ___location and type of the child and parent capsules, but not on the current input. At each iteration, the coefficients are adjusted via a "routing" [[Softmax function|softmax]] so that they continue to sum to 1 (to express the probability that a given capsule is the parent of a given child.) Softmax amplifies larger values and diminishes smaller values beyond their proportion of the total. Similarly, the probability that a feature is present in the input is exaggerated by a nonlinear "squashing" function that reduces values (smaller ones drastically and larger ones such that they are less than 1).<ref name=":16"/>
Line 62 ⟶ 63:
The pose vector <math display="inline">\mathbf{u}_{i}</math> is rotated and translated by a matrix <math display="inline">\mathbf{W}_{ij}</math> into a vector <math display="inline">\mathbf{\hat{u}}_{j|i}</math> that predicts the output of the parent capsule.
Capsules <math display="inline">s_{j}</math> in the next higher level are fed the sum of the predictions from all capsules in the lower layer, each with a coupling coefficient <math display="inline">c_{ij}</math>
==== Procedure softmax ====
The coupling coefficients from a capsule <math display="inline">i</math> in layer <math display="inline">l</math> to all capsules in layer <math display="inline">l+1</math> sum to one, and are defined by a "[[Softmax function|routing softmax]]". The initial [[
1: \mathbf{procedure}~ \mathrm{softmax} ( \mathbf{b}, i ) \\
2: \quad \triangleright \mbox{argument matrix} \\
Line 85 ⟶ 86:
==== Procedure squash ====
Because the length of the vectors represents probabilities they should be between zero
1: \mathbf{procedure}~ \mathrm{squash} ( \mathbf{a} ) \\
2: \quad \triangleright \mbox{argument vector} \\
Line 100 ⟶ 101:
One approach to routing is the following<ref name=":1"/>
~~1: \mathbf{procedure}~ \mathrm{routing} ( \mathbf{\hat{u}}_{j|i}, r, l ) \\
~~2: \quad \triangleright \mbox{argument vector} \\
Line 115 ⟶ 116:
\end{array}</math>
At line 8, the softmax function can be replaced by any type of [[Winner-take-all (computing)|winner-take-all]] network. Biologically this somewhat resembles [[
At line 9, the weight matrix for the coupling coefficients and the hidden prediction matrix are shown. The structure in layer I and II is somewhat similar to the [[cerebral cortex]] if [[stellate cell]]s are assumed to be involved in transposing input vectors. Whether both types of stellate cells have the same function is not clear, as layer I has excitatory spiny cells and layer II has inhibitory aspiny cells. The latter indicates a much different network.
Line 131 ⟶ 132:
=== Margin loss ===
The length of the instantiation vector represents the probability that a
L_{k} & = \underbrace{T_{k} ~ { \max \left ( 0, m^{+} - \| \mathbf{v}_{k} \| \right )}^{2}}_\mbox{class present}
+ \underbrace{\lambda \left ( 1 - T_{k} \right ) ~ { \max \left ( 0, \| \mathbf{v}_{k} \| - m^{-} \right )}^{2}}_\mbox{class not present}
Line 148 ⟶ 149:
== Example configuration ==<!-- Article states that there are 256 x 81 activations. That seems to be wrong. Isn't it 256 x 20 x 20? And the weight matrix. Isn't it [32x6x6] x 10 instead of 8x16? -->
The first convolutional layers perform feature extraction. For the 28x28 pixel MNIST image test an initial 256 9x9 [[pixel]] convolutional [[Kernel (statistics)|kernels]] (using stride 1 and [[Rectifier (neural networks)|rectified linear unit]] (ReLU) activation, defining 20x20 [[
The primary (lowest) capsule layer divides the 256 kernels into 32 capsules of 8 9x9 kernels each (using stride 2, defining 6x6 receptive fields). Capsule activations effectively invert the graphics rendering process, going from pixels to features. A single weight matrix is used by each capsule across all receptive fields. Each primary capsule sees all of the lower-layer outputs whose fields overlap with the center of the field in the primary layer. Each primary capsule output (for a particular field) is an 8-dimensional vector.<ref name=":1"/><ref name=":16"/>
Line 156 ⟶ 157:
Capsnets are hierarchical, in that each lower-level capsule contributes significantly to only one higher-level capsule.<ref name=":1"/>
However, replicating learned knowledge remains valuable. To achieve this, a capsnet's lower layers are [[
== Human vision ==
Human vision examines a sequence of focal points (directed by [[
Capsnets explore the intuition that the human visual system creates a [[Parse tree|tree]]-like structure for each focal point and coordinates these trees to recognize objects. However, with capsnets each tree is "carved" from a fixed network (by adjusting coefficients) rather than assembled on the fly.<ref name=":1"/>
== Alternatives ==
CapsNets are claimed to have four major conceptual advantages over [[
* Viewpoint invariance: the use of pose matrices allows capsule networks to recognize objects regardless of the perspective from which they are viewed.
* Fewer parameters: Because capsules group neurons, the connections between layers require fewer parameters.
* Better generalization to new viewpoints: CNNs, when trained to understand rotations, often learn that an object can be viewed similarly from several different rotations. However, capsule networks generalize better to new viewpoints because pose matrices can capture these characteristics as linear transformations.
* Defense against white-box adversarial attacks: the Fast Gradient Sign Method (FGSM) is a typical method for attacking CNNs. It evaluates the gradient of each pixel against the loss of the network, and changes each pixel by at most epsilon (the error term) to maximize the loss. Although this method can drop the accuracy of CNNs dramatically (e.g.: to below 20%), capsule networks maintain accuracy above 70%.
Purely convolutional nets cannot generalize to unlearned viewpoints (other than translation). For other [[affine transformation]]
Capsnet's transformation matrices learn the (viewpoint independent) spatial relationship between a part and a whole, allowing the latter to be recognized based on such relationships. However, capsnets assume that each ___location displays at most one instance of a capsule's object. This assumption allows a capsule to use a distributed representation (its activity vector) of an object to represent that object at that ___location.<ref name=":1"/>
Line 188 ⟶ 189:
==References==
{{reflist|2|refs=
<ref name=":1">{{Cite
}}
== External links ==
* {{Citation
* {{Citation
* {{Citation|title=A PyTorch implementation of the NIPS 2017 paper "Dynamic Routing Between Capsules"|date=2017-12-08|url=https://github.com/gram-ai/capsule-networks|publisher=Gram.AI|
* {{
* {{Cite web|url=http://www.cedar.buffalo.edu/~srihari/CSE676|title=Deep Learning|website=www.cedar.buffalo.edu|access-date=2017-12-07}}
*
*
*
* {{Citation|last=Guo|first=Xifeng|title=CapsNet-Keras: A Keras implementation of CapsNet in NIPS2017 paper "Dynamic Routing Between Capsules". Now test error = 0.34%.|date=2017-12-08|url=https://github.com/XifengGuo/CapsNet-Keras|access-date=2017-12-08}}
* {{Cite web|url=https://openreview.net/pdf?id=HJWLfGWRb|title=MATRIX CAPSULES WITH EM ROUTING|last1=Hinton|first1=Geoffrey|last2=Sabour|first2=Sara|last3=Frosst|first3=Nicholas|date=November 2017}}
* {{YouTube|Hinton and Google Brain - Capsule Networks |id=x5Vxk9twXlE}}
* {{Citation|last=Liao|first=Huadong|title=CapsNet-Tensorflow: A Tensorflow implementation of CapsNet(Capsules Net) in Hinton's paper Dynamic Routing Between Capsules|date=2017-12-08|url=https://github.com/naturomics/CapsNet-Tensorflow|access-date=2017-12-08}}
*{{Cite web|first=Fangyu|last=Cai|date=2020-12-18|title='We Can Do It' — Geoffrey Hinton and UBC, UT, Google & UVic Team Propose Unsupervised Capsule...|url=https://medium.com/syncedreview/we-can-do-it-geoffrey-hinton-and-ubc-ut-google-uvic-team-propose-unsupervised-capsule-c1f2edb6b1e9|access-date=2021-01-18|website=Medium|language=en}}
* {{cite arXiv|last1=Sun|first1=Weiwei|last2=Tagliasacchi|first2=Andrea|last3=Deng|first3=Boyang|last4=Sabour|first4=Sara|last5=Yazdani|first5=Soroosh|last6=Hinton|first6=Geoffrey|last7=Yi|first7=Kwang Moo|date=2020-12-08|title=Canonical Capsules: Unsupervised Capsules in Canonical Pose|class=cs.CV|eprint=2012.04718}}
[[Category:Artificial neural networks]]
|