Multimodal learning: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:11, 10 March 2024 edit Alenoach (talk \| contribs) Extended confirmed users 5,864 edits multimodal LLMs are a special case of multimodal transformers Tag: Visual edit ← Previous edit		Latest revision as of 22:40, 1 June 2025 edit undo Alenoach (talk \| contribs) Extended confirmed users 5,864 edits Reformatted the examples of applications into a list Tag: Visual edit
(13 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|Machine learning methods using multiple input modalities}} ~~{{multiple issues\|~~ ~~{{more footnotes\|date=June 2015}}~~ ~~{{technical\|date=June 2015}}~~ ~~{{tone\|date=June 2015}}~~ }} {{machine learning}} '''Multimodal learning''' is a type of [[deep learning]] that integrates and processes multiple types of data, referred to as [[Modality (human–computer interaction)\|modalities]], such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval,<ref>{{Cite arXiv \|last1=Hendriksen \|first1=Mariya \|last2=Bleeker \|first2=Maurits \|last3=Vakulenko \|first3=Svitlana \|last4=van Noord \|first4=Nanne \|last5=Kuiper \|first5=Ernst \|last6=de Rijke \|first6=Maarten \|date=2021 \|title=Extending CLIP for Category-to-image Retrieval in E-commerce \|class=cs.CV \|eprint=2112.11294}}</ref> text-to-image generation,<ref name="stable-diffusion-github">{{cite web \|date=17 September 2022 \|title=Stable Diffusion Repository on GitHub \|url=https://github.com/CompVis/stable-diffusion \|url-status=live \|archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion \|archive-date=January 18, 2023 \|access-date=17 September 2022 \|publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich}}</ref> aesthetic ranking,<ref>{{Citation \|title=LAION-AI/aesthetic-predictor \|date=2024-09-06 \|url=https://github.com/LAION-AI/aesthetic-predictor \|access-date=2024-09-08 \|publisher=LAION AI}}</ref> and image captioning.<ref>{{Cite arXiv \|last1=Mokady \|first1=Ron \|last2=Hertz \|first2=Amir \|last3=Bermano \|first3=Amit H. \|date=2021 \|title=ClipCap: CLIP Prefix for Image Captioning \|class=cs.CV \|eprint=2111.09734}}</ref> ~~<!--- Don't mess with this line! ---><!--- Write your article below this line --->~~ '''Multimodal learning''', in the context of [[machine learning]], is a type of [[deep learning]] using a combination of various [[Modality (human–computer interaction)\|modalities]] of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as [[feature vector]]) with imaging data consisting of [[pixel]] intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required. The model is then trained to able to understand and work with multiple forms of data. Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.<ref>{{Cite web \|last=Zia \|first=Tehseen \|date=January 8, 2024 \|title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 \|url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ \|access-date=2024-06-01 \|website=Unite.ai}}</ref> ==Motivation== Many models and algorithms have been implemented to retrieve and classify certain types of data, e.g. image or text (where humans who interact with machines can extract images in the form of pictures and texts that could be any message etc.). However, dataData usually ~~come~~comes with different modalities ~~(it is the degree to which a system's components may be separated or combined)~~ which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe ~~the~~ information which may not be obvious from ~~texts~~text. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the ~~correlation~~combined ~~structure~~information ~~between~~from different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes. == Multimodal transformers == Line 18 ⟶ 14: {{excerpt\|Large language model\|Multimodality}} == Multimodal deep Boltzmann machines == A [[Boltzmann machine]] is a type of [[stochastic neural network]] invented by [[Geoffrey Hinton]] and [[Terry Sejnowski]] in 1985. Boltzmann machines can be seen as the [[stochastic process\|stochastic]], [[generative model\|generative]] counterpart of [[Hopfield net]]s. They are named after the [[Boltzmann distribution]] in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. Each unit is like a neuron with a binary output that represents whether it's is activated or not.<ref>{{Cite web \|last=Dey \|first=Victor \|date=2021-09-03 \|title=Beginners Guide to Boltzmann Machine \|url=https://analyticsindiamag.com/beginners-guide-to-boltzmann-machines/ \|access-date=2024-03-02 \|website=Analytics India Magazine \|language=en-US}}</ref> General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine{{Citation needed\|date=November 2022}}. A more efficient architecture is called [[restricted Boltzmann machine]] where connection is only allowed between hidden unit and visible unit, which is described in the next section. Multimodal deep Boltzmann machines can process and learn from different types of information, such as images and text, simultaneously. This can notably be done by having a separate deep Boltzmann machine for each modality, for example one for images and one for text, joined at an additional top hidden layer.<ref>{{cite web \|year=2014 \|title=Multimodal Learning with Deep Boltzmann Machine \|url=http://www.jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf \|url-status=live \|archive-url=https://web.archive.org/web/20150621055730/http://jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf \|archive-date=2015-06-21 \|access-date=2015-06-14}}</ref> ==~~Application~~ Applications == Multimodal machine learning has numerous applications across various domains: Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms [[support vector machine]]s, [[latent Dirichlet allocation]] and [[deep belief network]], when models are tested on data with both image-text modalities or with single modality.{{Citation needed\|date=November 2022}} Multimodal deep Boltzmann machines are also able to predict missing modalities given the observed ones with reasonably good precision.{{Citation needed\|date=November 2022}} Self Supervised Learning brings a more interesting and powerful model for multimodality. [[OpenAI]] developed CLIP and [[DALL-E]] models that revolutionized multimodality. * '''Cross-modal retrieval''': cross-modal retrieval allows users to search for data across different modalities (e.g., retrieving images based on text descriptions), improving multimedia search engines and content recommendation systems. Models like [[Contrastive Language-Image Pre-training\|CLIP]] facilitate efficient, accurate retrieval by embedding data in a shared space, demonstrating strong performance even in zero-shot settings.<ref>{{Cite arXiv \|last1=Hendriksen \|first1=Mariya \|last2=Vakulenko \|first2=Svitlana \|last3=Kuiper \|first3=Ernst \|last4=de Rijke \|first4=Maarten \|date=2023 \|title=Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study \|class=cs.CV \|eprint=2301.05174}}</ref> Multimodal deep learning is used for [[cancer screening]] – at least one system under development [[Data integration#Medicine and Life Sciences\|integrates]] such different types of data.<ref>{{cite news \|last1=Quach \|first1=Katyanna \|title=Harvard boffins build multimodal AI system to predict cancer \|url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|access-date=16 September 2022 \|work=The Register \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920163859/https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|url-status=live }}</ref><ref>{{cite journal \|last1=Chen \|first1=Richard J. \|last2=Lu \|first2=Ming Y. \|last3=Williamson \|first3=Drew F. K. \|last4=Chen \|first4=Tiffany Y. \|last5=Lipkova \|first5=Jana \|last6=Noor \|first6=Zahra \|last7=Shaban \|first7=Muhammad \|last8=Shady \|first8=Maha \|last9=Williams \|first9=Mane \|last10=Joo \|first10=Bumjin \|last11=Mahmood \|first11=Faisal \|title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning \|journal=Cancer Cell \|date=8 August 2022 \|volume=40 \|issue=8 \|pages=865–878.e6 \|doi=10.1016/j.ccell.2022.07.004 \|pmid=35944502 \|s2cid=251456162 \|language=English \|issn=1535-6108\|doi-access=free \|pmc=10397370 }}▼ * '''Classification and missing data retrieval''': multimodal Deep Boltzmann Machines outperform traditional models like [[support vector machine]]s and [[latent Dirichlet allocation]] in classification tasks and can predict missing data in multimodal datasets, such as images and text. * Teaching hospital press release: {{cite news \|title=New AI technology integrates multiple data types to predict cancer outcomes \|url=https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html \|access-date=18 September 2022 \|work=[[Brigham and Women's Hospital]] via medicalxpress.com \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920172825/https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html \|url-status=live }}</ref>▼ ▲~~Multimodal~~* ~~deep~~'''Healthcare ~~learning~~diagnostics''': ismultimodal ~~used~~models ~~for~~integrate ~~[[cancer~~medical ~~screening]]~~imaging, –genomic atdata, ~~least~~and ~~one~~patient ~~system~~records ~~under~~to ~~development~~improve ~~[[Data~~diagnostic ~~integration#Medicine~~accuracy and ~~Life~~early ~~Sciences\|integrates]]~~disease ~~such~~detection, ~~different~~especially ~~types~~in ofcancer ~~data~~screening.<ref>{{cite news \|last1=Quach \|first1=Katyanna \|title=Harvard boffins build multimodal AI system to predict cancer \|url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|access-date=16 September 2022 \|work=The Register \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920163859/https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|url-status=live }}</ref><ref>{{cite journal \|last1=Chen \|first1=Richard J. \|last2=Lu \|first2=Ming Y. \|last3=Williamson \|first3=Drew F. K. \|last4=Chen \|first4=Tiffany Y. \|last5=Lipkova \|first5=Jana \|last6=Noor \|first6=Zahra \|last7=Shaban \|first7=Muhammad \|last8=Shady \|first8=Maha \|last9=Williams \|first9=Mane \|last10=Joo \|first10=Bumjin \|last11=Mahmood \|first11=Faisal \|title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning \|journal=Cancer Cell \|date=8 August 2022 \|volume=40 \|issue=8 \|pages=865–878.e6 \|doi=10.1016/j.ccell.2022.07.004 \|pmid=35944502 \|s2cid=251456162 \|language=English \|issn=1535-6108\|doi-access=free \|pmc=10397370 }} ▲* Teaching hospital press release: {{cite news \|title=New AI technology integrates multiple data types to predict cancer outcomes \|url=https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html \|access-date=18 September 2022 \|work=[[Brigham and Women's Hospital]] via medicalxpress.com \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920172825/https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html \|url-status=live }}</ref><ref>{{Cite arXiv \|last1=Shi \|first1=Yuge \|last2=Siddharth \|first2=N. \|last3=Paige \|first3=Brooks \|last4=Torr \|first4=Philip HS \|year=2019 \|title=Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models \|eprint=1911.03393 \|class=cs.LG}}</ref> * '''Content generation''': models like [[DALL·E]] generate images from textual descriptions, benefiting creative industries, while cross-modal retrieval enables dynamic multimedia searches.<ref>{{Cite arXiv \|last1=Shi \|first1=Yuge \|last2=Siddharth \|first2=N. \|last3=Paige \|first3=Brooks \|last4=Torr \|first4=Philip HS \|date=2019 \|title=Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models \|class=cs.LG \|eprint=1911.03393}}</ref> * '''Robotics and human-computer interaction''': multimodal learning improves interaction in robotics and AI by integrating sensory inputs like speech, vision, and touch, aiding autonomous systems and [[Human–computer interaction\|human-computer interaction]]. * '''Emotion recognition''': combining visual, audio, and text data, multimodal systems enhance [[sentiment analysis]] and [[emotion recognition]], applied in customer service, social media, and marketing. ==See also== Line 35 ⟶ 36: ==References== {{reflist}} <!--- After listing your sources please cite them using inline citations and place them after the information they cite. Please see http://en.wikipedia.org/wiki/Wikipedia:REFB for instructions on how to add citations. ---> <!--- STOP! Be warned that by using this process instead of Articles for Creation, this article is subject to scrutiny. As an article in "mainspace", it will be DELETED if there are problems, not just declined. If you wish to use AfC, please return to the Wizard and continue from there. ---> [[Category:Artificial neural networks]] [[Category:Multimodal interaction]]