Multimodal learning: Difference between revisions

Content deleted Content added
No edit summary
Line 1:
{{Short description|Machine learning methods using multiple input modalities}}
{{machine learning}}
'''Multimodal learning''' is a type of [[deep learning]] that integrates and processes multiple types of data, referred to as [[Modality (human–computer interaction)|modalities]], such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval<ref>{{Cite arXiv |last1=Hendriksen |first1=Mariya |last2=Bleeker |first2=Maurits |last3=Vakulenko |first3=Svitlana |last4=van Noord |first4=Nanne |last5=Kuiper |first5=Ernst |last6=de Rijke |first6=Maarten |date=2021 |title=Extending CLIP for Category-to-image Retrieval in E-commerce |class=cs.CV |eprint=2112.11294}}</ref> text-to-image generation,<ref name="stable-diffusion-github">{{cite web |date=17 September 2022 |title=Stable Diffusion Repository on GitHub |url=https://github.com/CompVis/stable-diffusion |url-status=live |archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion |archive-date=January 18, 2023 |access-date=17 September 2022 |publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich}}</ref> aesthetic ranking,<ref>{{Citation |title=LAION-AI/aesthetic-predictor |date=2024-09-06 |url=https://github.com/LAION-AI/aesthetic-predictor |access-date=2024-09-08 |publisher=LAION AI}}</ref> and image captioning.<ref>{{Cite arXiv |last1=Mokady |first1=Ron |last2=Hertz |first2=Amir |last3=Bermano |first3=Amit H. |date=2021 |title=ClipCap: CLIP Prefix for Image Captioning |class=cs.CV |eprint=2111.09734}}</ref>
'''Multimodal learning''', in the context of [[machine learning]], is a type of [[deep learning]] using multiple [[Modality (human–computer interaction)|modalities]] of data, such as text, audio, or images.
 
In contrast, unimodal models can process only one type of data, such as text (typically represented as [[feature vector|feature vectors]]) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.<ref>{{Cite web |last=Rosidi |first=Nate |date=March 27, 2023 |title=Multimodal Models Explained |url=https://www.kdnuggets.com/multimodal-models-explained |access-date=2024-06-01 |website=KDnuggets |language=en-US}}</ref>
 
Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.<ref>{{Cite web |last=Zia |first=Tehseen |date=January 8, 2024 |title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 |url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ |access-date=2024-06-01 |website=Unite.ai}}</ref>