Revision as of 21:31, 8 October 2024 edit Scientia et sapientia (talk \| contribs) 12 edits →Application ← Previous edit		Revision as of 21:50, 23 October 2024 edit undo Scientia et sapientia (talk \| contribs) 12 edits No edit summary Next edit →
Line 1: {{Short description\|Machine learning methods using multiple input modalities}} {{machine learning}} '''Multimodal learning''' is a type of [[deep learning]] that integrates and processes multiple types of data, referred to as [[Modality (human–computer interaction)\|modalities]], such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval<ref>{{Cite arXiv \|last1=Hendriksen \|first1=Mariya \|last2=Bleeker \|first2=Maurits \|last3=Vakulenko \|first3=Svitlana \|last4=van Noord \|first4=Nanne \|last5=Kuiper \|first5=Ernst \|last6=de Rijke \|first6=Maarten \|date=2021 \|title=Extending CLIP for Category-to-image Retrieval in E-commerce \|class=cs.CV \|eprint=2112.11294}}</ref> text-to-image generation,<ref name="stable-diffusion-github">{{cite web \|date=17 September 2022 \|title=Stable Diffusion Repository on GitHub \|url=https://github.com/CompVis/stable-diffusion \|url-status=live \|archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion \|archive-date=January 18, 2023 \|access-date=17 September 2022 \|publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich}}</ref> aesthetic ranking,<ref>{{Citation \|title=LAION-AI/aesthetic-predictor \|date=2024-09-06 \|url=https://github.com/LAION-AI/aesthetic-predictor \|access-date=2024-09-08 \|publisher=LAION AI}}</ref> and image captioning.<ref>{{Cite arXiv \|last1=Mokady \|first1=Ron \|last2=Hertz \|first2=Amir \|last3=Bermano \|first3=Amit H. \|date=2021 \|title=ClipCap: CLIP Prefix for Image Captioning \|class=cs.CV \|eprint=2111.09734}}</ref> '''Multimodal learning''', in the context of [[machine learning]], is a type of [[deep learning]] using multiple [[Modality (human–computer interaction)\|modalities]] of data, such as text, audio, or images. In contrast, unimodal models can process only one type of data, such as text (typically represented as [[feature vector\|feature vectors]]) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.<ref>{{Cite web \|last=Rosidi \|first=Nate \|date=March 27, 2023 \|title=Multimodal Models Explained \|url=https://www.kdnuggets.com/multimodal-models-explained \|access-date=2024-06-01 \|website=KDnuggets \|language=en-US}}</ref> Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.<ref>{{Cite web \|last=Zia \|first=Tehseen \|date=January 8, 2024 \|title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 \|url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ \|access-date=2024-06-01 \|website=Unite.ai}}</ref>

Multimodal learning: Difference between revisions