Object recognition: differenze tra le versioni

Contenuto cancellato Contenuto aggiunto
Cifi79 (discussione | contributi)
Cifi79 (discussione | contributi)
Nessun oggetto della modifica
Riga 45:
Il metodo Lowe per la generazione di un immagine caratteristica chiamata '''Scale Invariant Feature Transform''' ([[Scale-invariant_feature_transform|SIFT]]) trasforma l'immagine in una grande collezione di caratteristiche vettoriali, ognuna delle quali è invariante rispetto a traslazione, ridimensionamento, rotazione e, in parte rispetto all'illuminazione. Tale metodo è robusto rispetto a distorsioni geometriche. Queste caratteristiche hanno proprietà simili ai neuroni del [[lobo occipitale]], i quali vengono utilizzati per il riconoscimento di oggetti nei sistema di visione dei primati<ref>Serre, T., Kouh, M., Cadieu, C., Knoblich, U., Kreiman, G., Poggio, T., “A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex”, Computer Science and Artificial Intelligence Laboratory Technical Report, December 19, 2005 MIT-CSAIL-TR-2005-082.</ref>.Le posizioni dei punti chiave sono definite come massimi e minimi del risultato della differencza delle Gaussiane (vedi [http://en.wikipedia.org/wiki/Difference_of_Gaussians Difference of Gaussians]), di una serie di immagini ottenute col sistema [http://en.wikipedia.org/wiki/Scale-space spazio-scala]. Vengono scartati i punti a basso contrasto e i punti di bordo che si trovano lungo un bordo. Maggiore credibilità viene assegnata ai punti chiave localizzati. Queste fasi garantiscono che i punti chiave siano più stabili durante il riconoscimento.
La solidità del metodo [[Scale-invariant_feature_transform|SIFT]] rispetto alla distorsione è quindi ottenuta considerando i pixel nell'intorno del punto chiave e sfocando e ricampionando l'immagine locale.
 
===Ricerca e indicizzazione===
 
L'indicizzazione è il problema di immagazzinare i punti chiave [[Scale-invariant_feature_transform|SIFT]] e di individuarli in una nuova immagine. Lowe ha usato una modifica dell'algoritmo [[k-d tree]] chiamato metodo del '''Best-bin-first search''' <ref>Beis, J., and Lowe, D.G “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces”, Conference on Computer Vision and Pattern Recognition,Puerto Rico, 1997, pp. 1000–1006.</ref>che può indivuduare il [[nearest neighbor]]s with high probability using only a limited amount of computation. The BBF algorithm uses a modified search ordering for the [[k-d tree]] algorithm so that bins in feature space are searched in the order of their closest distance from the query ___location. This search order requires the use of a [[heap (data structure)]] based [[priority queue]] for efficient determination of the search order. The best candidate match for each keypoint is found by identifying its [[nearest neighbor]] in the database of keypoints from training images. The [[nearest neighbor]]s are defined as the keypoints with minimum [[Euclidean distance]] from the given descriptor vector. The probability that a match is correct can be determined by taking the ratio of distance from the closest neighbor to the distance of the second closest.
 
Lowe<ref name="lowe04" /> rejected all matches in which the distance ratio is greater than 0.8, which eliminates 90% of the false matches while discarding less than 5% of the correct matches. To further improve the efficiency of the best-bin-first algorithm search was cut off after checking the first 200 [[nearest neighbor]] candidates. For a database of 100,000 keypoints, this provides a speedup over exact [[nearest neighbor]] search by about 2 orders of magnitude yet results in less than a 5% loss in the number of correct matches.
 
===Cluster identification by Hough transform voting===
[[Hough Transform]] is used to cluster reliable model hypotheses to search for keys that agree upon a particular model [[pose]]. [[Hough transform]] identifies clusters of features with a consistent interpretation by using each feature to vote for all object [[pose]]s that are consistent with the feature. When clusters of features are found to vote for the same pose of an object, the probability of the interpretation being correct is much higher than for any single feature. An entry in a [[hash table]] is created predicting the model ___location, orientation, and scale from the match hypothesis.The [[hash table]] is searched to identify all clusters of at least 3 entries in a bin, and the bins are sorted into decreasing order of size.
 
Each of the [[Scale-invariant_feature_transform|SIFT]] keypoints specifies 2D ___location, scale, and orientation, and each matched keypoint in the database has a record of the keypoint’s parameters relative to the training image in which it was found. The similarity transform implied by these 4 parameters is only an approximation to the full 6 degree-of-freedom pose space for a 3D object and also does not account for any non-rigid deformations. Therefore, Lowe<ref name="lowe04" /> used broad bin sizes of 30 degrees for orientation, a factor of 2 for scale, and 0.25 times the maximum projected training image dimension (using the predicted scale) for ___location. The [[Scale-invariant_feature_transform|SIFT]] key samples generated at the larger scale are given twice the weight of those at the smaller scale. This means that the larger scale is in effect able to filter the most likely neighbours for checking at the smaller scale. This also improves recognition performance by giving more weight to the least-noisy scale. To avoid the problem of boundary effects in bin assignment, each keypoint match votes for the 2 closest bins in each dimension, giving a total of 16 entries for each hypothesis and further broadening the pose range.
 
===Model verification by linear least squares===
 
Each identified cluster is then subject to a verification procedure in which a [[linear least squares]] solution is performed for the parameters of the [[affine transformation]] relating the model to the image. The [[affine transformation]] of a model point [x y]<sup>T</sup> to an image point [u v]<sup>T</sup> can be written as below
 
 
:<math>
\begin{bmatrix} u \\ v \end{bmatrix} = \begin{bmatrix} m1 & m2 \\ m3 & m4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} + \begin{bmatrix} tx \\ ty \end{bmatrix}
</math>
 
 
where the model translation is [tx ty]<sup>T</sup> and the affine rotation, scale, and stretch are represented by the parameters m1, m2, m3 and m4. To solve for the transformation parameters the equation above can be rewritten to gather the unknowns into a column vector.
 
:<math>
\begin{bmatrix} x & y & 0 & 0 & 1 & 0 \\ 0 & 0 & x & y & 0 & 1 \\ ....\\ ....\end{bmatrix} \begin{bmatrix}m1 \\ m2 \\ m3 \\ m4 \\ tx \\ ty \end{bmatrix} = \begin{bmatrix} u \\ v \\ . \\ . \end{bmatrix}
</math>
 
This equation shows a single match, but any number of further matches can be added, with each match contributing two more rows to the first and last matrix. At least 3 matches are needed to provide a solution.
We can write this linear system as
 
:<math>A\hat{\mathbf{x}} \approx \mathbf{b},</math>
 
where ''A'' is a known ''m''-by-''n'' [[Matrix (mathematics)|matrix]] (usually with ''m'' > ''n''), '''x''' is an unknown ''n''-dimensional parameter [[vector space|vector]], and '''b''' is a known ''m''-dimensional measurement vector.
 
Therefore the minimizing vector <math>\hat{\mathbf{x}}</math> is a solution of the '''normal equation'''
:<math> A^T \! A \hat{\mathbf{x}} = A^T \mathbf{b}. </math>
 
The solution of the system of linear equations is given in terms of the matrix <math>(A^TA)^{-1}A^T</math> , called the [[Moore-Penrose pseudoinverse|pseudoinverse]] of ''A'', by
 
:<math> \hat{\mathbf{x}} = (A^T\!A)^{-1} A^T \mathbf{b}. </math>
 
which minimizes the sum of the squares of the distances from the projected model locations to the corresponding image locations.
 
 
===Outlier detection===
 
[[Outlier]]s can now be removed by checking for agreement between each image feature and the model, given the parameter solution. Given the [[linear least squares]] solution, each match is required to agree within half the error range that was used for the parameters in the [[Hough transform]] bins. As outliers are discarded, the [[linear least squares]] solution is re-solved with the remaining points, and the process iterated. If fewer than 3 points remain after discarding [[outlier]]s, then the match is rejected. In addition, a top-down matching phase is used to add any further matches that agree with the projected model position, which may have been missed from the [[Hough transform]] bin due to the similarity transform approximation or other errors.
 
The final decision to accept or reject a model hypothesis is based on a detailed probabilistic model<ref>Lowe, D.G., Local feature view clustering for 3D object recognition. IEEE Conference on Computer Vision and Pattern Recognition,Kauai, Hawaii, 2001, pp. 682-688.</ref>. This method first computes the expected number of false matches to the model pose, given the projected size of the model, the number of features within the region, and the accuracy of the fit. A [[Bayesian probability]] analysis then gives the probability that the object is present based on the actual number of matching features found. A model is accepted if the final probability for a correct interpretation is greater than 0.98. Lowe's SIFT based object recognition gives excellent results except under wide illumination variations and under non-rigid transformations.
 
==Competing methods for scale invariant object recognition under clutter / partial occlusion==
 
RIFT <ref>Lazebnik, S., Schmid, C., and Ponce, J., Semi-Local Affine Parts for Object Recognition, BMVC, 2004.</ref> is a rotation-invariant generalization of SIFT. The RIFT descriptor is constructed using circular normalized patches divided into concentric rings of equal width and within each ring a gradient orientation histogram is computed. To maintain rotation invariance, the orientation is measured at each point relative to the direction pointing outward from the center.
 
G-RIF<ref>Sungho Kim, Kuk-Jin Yoon, In So Kweon, "Object Recognition Using a Generalized Robust Invariant Feature and Gestalt’s Law of Proximity and Similarity," Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06), 2006</ref> : Generalized Robust Invariant Feature is a general context descriptor which encodes edge orientation, edge density and hue information in a unified form combining perceptual information with spatial encoding. The object recognition scheme uses neighbouring context based voting to estimate object models.
 
"[[SURF]]<ref>Bay, H., Tuytelaars, T., Gool, L.V., "SURF: Speeded Up Robust Features", Proceedings of the ninth European Conference on Computer Vision, May 2006.</ref> : Speeded Up Robust Features" is a high-performance scale and rotation-invariant interest point detector / descriptor claimed to approximate or even outperform previously proposed schemes with respect to repeatability, distinctiveness, and robustness. [[SURF]] relies on integral images for image convolutions to reduce computation time, builds on the strengths of the leading existing detectors and descriptors (using a fast [[Hessian matrix]]-based measure for the detector and a distribution-based descriptor). It describes a distribution of [[Haar wavelet]] responses within the interest point neighbourhood. Integral images are used for speed and only 64 dimensions are used reducing the time for feature computation and matching. The indexing step is based on the sign of the [[Laplacian]],which increases the matching speed and the robustness of the descriptor.
 
PCA-SIFT <ref>Ke, Y., and Sukthankar, R., PCA-SIFT: A More Distinctive Representation for Local Image DescriptorsComputer Vision and Pattern Recognition, 2004.</ref>and [[GLOH]] <ref>Mikolajczyk, K., and Schmid, C., "A performance evaluation of local descriptors", IEEE Transactions on Pattern Analysis and Machine Intelligence, 10, 27, pp 1615--1630, 2005.</ref> are variants of [[Scale-invariant_feature_transform|SIFT]]. PCA-SIFT descriptor is a vector of image gradients in x and y direction computed within the support region. The gradient region is sampled at 39x39 locations, therefore the vector is of dimension 3042. The dimension is reduced
to 36 with [[PCA]]. Gradient ___location-orientation histogram ([[GLOH]]) is an extension of the [[Scale-invariant_feature_transform|SIFT]] descriptor designed to increase its robustness and distinctiveness. The [[Scale-invariant_feature_transform|SIFT]] descriptor is computed for a log-polar ___location grid with three bins in radial direction (the radius set to 6, 11, and 15) and 8 in angular direction, which results in 17 ___location bins. The central bin is not divided in angular directions. The gradient orientations are quantized in 16 bins resulting in 272 bin histogram. The size of this descriptor is reduced with [[PCA]]. The [[covariance matrix]] for [[PCA]] is estimated on image patches collected from various images. The 128 largest [[eigenvector]]s are used for description.
 
==Applications==
Object recognition methods has the following applications:
* Image panoramas<ref>Brown, M., and Lowe, D.G., "Recognising Panoramas," ICCV, p. 1218, Ninth IEEE International Conference on Computer Vision (ICCV'03) - Volume 2, Nice,France, 2003</ref>
* Image watermarking<ref>Li, L., Guo, B., and Shao, K., " Geometrically robust image watermarking using scale-invariant feature transform and Zernike moments," Chinese Optics Letters, Volume 5, Issue 6, pp. 332-335, 2007.</ref>
* Global robot localization<ref>Se,S., Lowe, D.G., and Little, J.J.,"Vision-based global localization and mapping for mobile robots", IEEE Transactions on Robotics, 21, 3 (2005), pp. 364-375.</ref>
 
== Note ==