Region Based Convolutional Neural Networks: Difference between revisions

Content deleted Content added
m typo patrol
Citation bot (talk | contribs)
Added bibcode. Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 251/967
 
(39 intermediate revisions by 20 users not shown)
Line 1:
{{Short description|Machine learning model family}}
{{Technical|date=August 2020}}
[[File:R-cnn.svg|thumb|272x272px|R-CNN architecture]]
'''Region Based Convolutional Neural Networks (R-CNN)''' are a family of machine learning models for [[computer vision]] and specifically [[object detection]].
'''Region-based Convolutional Neural Networks (R-CNN)''' are a family of machine learning models for [[computer vision]], and specifically [[object detection]] and localization.<ref name=":0">{{Cite book |last1=Zhang |first1=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |___location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=14.8. Region-based CNNs (R-CNNs) |chapter-url=https://d2l.ai/chapter_computer-vision/rcnn.html}}</ref> The original goal of R-CNN was to take an input image and produce a set of [[Minimum bounding box|bounding boxes]] as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. In general, R-CNN architectures perform selective search<ref name=":1">{{Cite journal |last1=Uijlings |first1=J. R. R. |last2=van de Sande |first2=K. E. A. |last3=Gevers |first3=T. |last4=Smeulders |first4=A. W. M. |date=2013-09-01 |title=Selective Search for Object Recognition |url=https://link.springer.com/article/10.1007/s11263-013-0620-5 |journal=International Journal of Computer Vision |volume=104 |issue=2 |pages=154–171 |doi=10.1007/s11263-013-0620-5 |issn=1573-1405|url-access=subscription }}</ref> over feature maps outputted by a CNN.
 
R-CNN has been extended to perform other computer vision tasks, such as: tracking objects from a drone-mounted camera,<ref>{{Cite news |last=Nene |first=Vidi |date=Aug 2, 2019 |title=Deep Learning-Based Real-Time Multiple-Object Detection and Tracking via Drone |url=https://dronebelow.com/2019/08/02/deep-learning-based-real-time-multiple-object-detection-and-tracking-via-drone/ |access-date=Mar 28, 2020 |work=Drone Below}}</ref> locating text in an image,<ref>{{Cite news |last=Ray |first=Tiernan |date=Sep 11, 2018 |title=Facebook pumps up character recognition to mine memes |url=https://www.zdnet.com/article/facebook-pumps-up-character-recognition-to-mine-memes/ |access-date=Mar 28, 2020 |publisher=[[ZDNET]]}}</ref> and enabling object detection in [[Google Lens]].<ref>{{Cite news |last=Sagar |first=Ram |date=Sep 9, 2019 |title=These machine learning methods make google lens a success |url=https://analyticsindiamag.com/these-machine-learning-techniques-make-google-lens-a-success/ |access-date=Mar 28, 2020 |work=Analytics India}}</ref>
 
Mask R-CNN is also one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks.<ref>{{cite arXiv |eprint=1910.01500v3 |class=math.LG |first=Peter |last=Mattson |title=MLPerf Training Benchmark |date=2019 |display-authors=etal}}</ref>
 
== History ==
 
The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where the each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. More recently, R-CNN has been extended to perform other computer vision tasks. The following covers some of the versions of R-CNN that have been developed.
 
* November 2013: '''R-CNN'''.<ref name=":2" />
* November 2013: '''R-CNN'''. Given an input image, R-CNN begins by applying a mechanism called Selective Search to extract [[Region of interest|regions of interest]] (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as two thousand ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI's output features, a collection of [[support-vector machine]] classifiers is used to determine what type of object (if any) is contained within the ROI.<ref>{{Cite news|last=Gandhi|first=Rohith|url=https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e|title=R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms|date=July 9, 2018|work=Towards Data Science|access-date=March 12, 2020|url-status=live}}</ref>
* April 2015: '''Fast R-CNN'''.<ref name=":3">{{Cite book |last=Girshick |first=Ross |chapter=Fast R-CNN |date=7–13 December 2015 |title=2015 IEEE International Conference on Computer Vision (ICCV) |publisher=IEEE |pages=1440–1448 |doi=10.1109/ICCV.2015.169 |isbn=978-1-4673-8391-2}}</ref>
* April 2015: '''Fast R-CNN'''. While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. At the end of the network is a novel method called ROIPooling, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses Selective Search to generate its region proposals.<ref name=":0">{{Cite news|last=Bhatia|first=Richa|url=https://analyticsindiamag.com/what-is-region-of-interest-pooling/|title=What is region of interest pooling?|date=September 10, 2018|work=Analytics India|access-date=March 12, 2020|url-status=live}}</ref>
* June 2015: '''Faster R-CNN'''.<ref name=":4">{{Cite journal |last1=Ren |first1=Shaoqing |last2=He |first2=Kaiming |last3=Girshick |first3=Ross |last4=Sun |first4=Jian |date=2017-06-01 |title=Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=39 |issue=6 |pages=1137–1149 |doi=10.1109/TPAMI.2016.2577031 |pmid=27295650 |issn=0162-8828|arxiv=1506.01497 |bibcode=2017ITPAM..39.1137R }}</ref>
* June 2015: '''Faster R-CNN'''. While Fast R-CNN used Selective Search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself.<ref name=":0" />
* March 2017: '''Mask R-CNN'''.<ref name=":5">{{Cite book |last1=He |first1=Kaiming |last2=Gkioxari |first2=Georgia |last3=Dollar |first3=Piotr |last4=Girshick |first4=Ross |chapter=Mask R-CNN |date=October 2017 |title=2017 IEEE International Conference on Computer Vision (ICCV) |publisher=IEEE |pages=2980–2988 |doi=10.1109/ICCV.2017.322 |isbn=978-1-5386-1032-9}}</ref>
* March 2017: '''Mask R-CNN'''. While previous versions of R-CNN focused on object detection, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.<ref>{{Cite news|last=Farooq|first=Umer|url=https://medium.com/@umerfarooq_26378/from-r-cnn-to-mask-r-cnn-d6367b196cfd|title=From R-CNN to Mask R-CNN|date=February 15, 2018|work=Medium|access-date=March 12, 2020|url-status=live}}</ref><ref>{{Cite news|last=Weng|first=Lilian|url=https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html|title=Object Detection for Dummies Part 3: R-CNN Family|date=December 31, 2017|work=Lil'Log|access-date=March 12, 2020|url-status=live}}</ref>
* December 2017: '''Cascade R-CNN''' is trained with increasing Intersection over Union (IoU, also known as the [[Jaccard index]]) thresholds, making each stage more selective against nearby false positives.<ref>{{Cite journal |last1=Cai |first1=Zhaowei |last2=Vasconcelos |first2=Nuno |date=2017 |title=Cascade R-CNN: Delving into High Quality Object Detection |arxiv=1712.00726 }}</ref>
* June 2019: '''Mesh R-CNN''' adds the ability to generate a 3D mesh from a 2D image.<ref>{{Cite news|last=Wiggers|first=Kyle|url=https://venturebeat.com/2019/10/29/facebook-highlights-ai-that-converts-2d-objects-into-3d-shapes/|title=Facebook highlights AI that converts 2D objects into 3D shapes|date=October 29, 2019|work=VentureBeat|access-date=March 12, 2020|url-status=live}}</ref>
* June 2019: '''Mesh R-CNN''' adds the ability to generate a 3D mesh from a 2D image.<ref>{{Cite journal |last1=Gkioxari |first1=Georgia |last2=Malik |first2=Jitendra |last3=Johnson |first3=Justin |date=2019 |title=Mesh R-CNN |url=https://openaccess.thecvf.com/content_ICCV_2019/html/Gkioxari_Mesh_R-CNN_ICCV_2019_paper.html |pages=9785–9795|arxiv=1906.02739 }}</ref>
 
== ApplicationsArchitecture ==
For review articles see.<ref name=":0" /><ref>{{Cite news |last=Weng |first=Lilian |date=December 31, 2017 |title=Object Detection for Dummies Part 3: R-CNN Family |url=https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html |access-date=March 12, 2020 |work=Lil'Log}}</ref>
Region Based Convolutional Neural Networks have been used for tracking objects from a drone-mounted camera,<ref>{{Cite news|last=Nene|first=Vidi|url=https://dronebelow.com/2019/08/02/deep-learning-based-real-time-multiple-object-detection-and-tracking-via-drone/|title=Deep Learning-Based Real-Time Multiple-Object Detection and Tracking via Drone|date=Aug 2, 2019|work=Drone Below|access-date=Mar 28, 2020|url-status=live}}</ref> locating text in an image,<ref>{{Cite news|last=Ray|first=Tiernan|url=https://www.zdnet.com/article/facebook-pumps-up-character-recognition-to-mine-memes/|title=Facebook pumps up character recognition to mine memes|date=Sep 11, 2018|work=ZDnet|access-date=Mar 28, 2020|url-status=live}}</ref> and enabling object detection in [[Google Lens]].<ref>{{Cite news|last=Sagar|first=Ram|url=https://analyticsindiamag.com/these-machine-learning-techniques-make-google-lens-a-success/|title=These machine learning methods make google lens a success|date=Sep 9, 2019|work=Analytics India|access-date=Mar 28, 2020|url-status=live}}</ref> Mask R-CNN serves as one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks.<ref>{{cite arXiv|eprint=1910.01500v3|class=math.LG|first=Peter|last=Mattson|title=MLPerf Training Benchmark|date=2019|display-authors=etal}}</ref>
 
=== Selective search ===
Given an image (or an image-like feature map), '''selective search''' (also called Hierarchical Grouping) first segments the image by the algorithm in (Felzenszwalb and Huttenlocher, 2004),<ref>{{Cite journal |last1=Felzenszwalb |first1=Pedro F. |last2=Huttenlocher |first2=Daniel P. |date=2004-09-01 |title=Efficient Graph-Based Image Segmentation |url=https://link.springer.com/article/10.1023/B:VISI.0000022288.19776.77 |journal=International Journal of Computer Vision |language=en |volume=59 |issue=2 |pages=167–181 |doi=10.1023/B:VISI.0000022288.19776.77 |issn=1573-1405|url-access=subscription }}</ref> then performs the following:<ref name=":1" />
 
'''Input:''' (colour) image
'''Output:''' Set of object ___location hypotheses L
Segment image into initial regions R = {r<sub>1</sub>, ..., r<sub>n</sub>} using Felzenszwalb and Huttenlocher (2004)
Initialise similarity set S = ∅
'''foreach''' Neighbouring region pair (r<sub>i</sub>, r<sub>j</sub>) do
Calculate similarity s(r<sub>i</sub>, r<sub>j</sub>)
S = S ∪ s(r<sub>i</sub>, r<sub>j</sub>)
'''while''' S ≠ ∅ do
Get highest similarity s(r<sub>i</sub>, r<sub>j</sub>) = max(S)
Merge corresponding regions r<sub>t</sub> = r<sub>i</sub> ∪ r<sub>j</sub>
Remove similarities regarding r<sub>i</sub>: S = S \ s(r<sub>i</sub>, r∗)
Remove similarities regarding r<sub>j</sub>: S = S \ s(r∗, r<sub>j</sub>)
Calculate similarity set S<sub>t</sub> between r<sub>t</sub> and its neighbours
S = S ∪ S<sub>t</sub>
R = R ∪ r<sub>t</sub>
Extract object ___location boxes L from all regions in R
 
=== R-CNN ===
[[File:R-cnn.svg|thumb|272x272px|R-CNN architecture]]
* November 2013: '''R-CNN'''. Given an input image, R-CNN begins by applying aselective mechanism called Selective Searchsearch to extract [[Region of interest|regions of interest]] (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as {{nobr|two thousand}} ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI's output features, aan collectionensemble of [[support-vector machine]] classifiers is used to determine what type of object (if any) is contained within the ROI.<ref name=":2">{{Cite newsjournal |lastlast1=GandhiGirshick |firstfirst1=RohithRoss |urllast2=https://towardsdatascience.com/rDonahue |first2=Jeff |last3=Darrell |first3=Trevor |last4=Malik |first4=Jitendra |date=2016-cnn01-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e01 |title=RRegion-CNN,Based FastConvolutional R-CNN,Networks Fasterfor R-CNN, YOLO —Accurate Object Detection Algorithmsand Segmentation |datejournal=JulyIEEE 9,Transactions on Pattern Analysis and Machine Intelligence 2018|workvolume=Towards38 Data|issue=1 Science|access-datepages=March142–158 12,|doi=10.1109/TPAMI.2015.2437384 2020|url-statuspmid=live26656583 |bibcode=2016ITPAM..38..142G |issn=0162-8828}}</ref>
{{-}}
 
=== Fast R-CNN ===
[[File:Fast-rcnn.svg|thumb|Fast R-CNN]]While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image.<ref name=":3" />
[[File:RoI_pooling_animated.gif|thumb|268x268px|RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5.]]
At the end of the network is a '''ROIPooling''' module, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses selective search to generate its region proposals.
{{-}}
 
=== Faster R-CNN ===
* June 2015[[File: '''Faster-rcnn.svg|thumb|Faster R-CNN'''. ]]While Fast R-CNN used Selectiveselective Searchsearch to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself.<ref name=":04" />
{{-}}
 
=== Mask R-CNN ===
[[File:Mask-rcnn.svg|thumb|Mask R-CNN]]While previous versions of R-CNN focused on object detections, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.<ref name=":5" />
 
== References ==
<references />
 
== Further reading ==
 
* {{Cite web |last=Parthasarathy |first=Dhruv |date=2017-04-27 |title=A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN |url=https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4 |access-date=2024-09-11 |website=Medium |language=en}}
 
[[Category:Object recognition and categorization]]
[[Category:Deep learning]]