Multimodal learning: Difference between revisions

Content deleted Content added
Mppria (talk | contribs)
m Clarified concepts, making them more concrete and accessible.
Reworked the introduction, and removed the template "Multiple issues" at the top and the obsolete invisible comments
Line 1:
{{Short description|Machine learning methods using multiple input modalities}}
{{multiple issues|
{{technical|date=June 2015}}
{{tone|date=June 2015}}
}}
{{machine learning}}
'''Multimodal learning''', in the context of [[machine learning]], is a type of [[deep learning]] using multiple [[Modality (human–computer interaction)|modalities]] of data, such as text, audio, or images.
<!--- Don't mess with this line! ---><!--- Write your article below this line --->
 
'''Multimodal learning''', in the context of [[machine learning]], is a type of [[deep learning]] using a combination of various [[Modality (human–computer interaction)|modalities]] of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text (typically represented as [[feature vector]]) or imaging data (consisting of [[pixel]] intensities and annotation tags) independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.
In contrast, unimodal models can process only one type of data, such as text (typically represented as [[feature vector|feature vectors]]) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.<ref>{{Cite web |last=Rosidi |first=Nate |date=March 27, 2023 |title=Multimodal Models Explained |url=https://www.kdnuggets.com/multimodal-models-explained |access-date=2024-06-01 |website=KDnuggets |language=en-US}}</ref>
 
Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a more robust model of real-world phenomena.<ref>{{Cite web |last=Zia |first=Tehseen |date=January 8, 2024 |title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 |url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ |access-date=2024-06-01 |website=Unite.ai}}</ref>
 
==Motivation==
Line 34 ⟶ 33:
==References==
{{reflist}}
 
<!--- After listing your sources please cite them using inline citations and place them after the information they cite. Please see http://en.wikipedia.org/wiki/Wikipedia:REFB for instructions on how to add citations. --->
 
<!--- STOP! Be warned that by using this process instead of Articles for Creation, this article is subject to scrutiny. As an article in "mainspace", it will be DELETED if there are problems, not just declined. If you wish to use AfC, please return to the Wizard and continue from there. --->
 
[[Category:Artificial neural networks]]
[[Category:Multimodal interaction]]