Multimodal learning: Difference between revisions

Content deleted Content added
Reworked the introduction, and removed the template "Multiple issues" at the top and the obsolete invisible comments
copyedit and modified content in the section "Motivations" that is redundant (the beginning) or too specific to Boltzmann machines (the end)
Line 5:
In contrast, unimodal models can process only one type of data, such as text (typically represented as [[feature vector|feature vectors]]) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.<ref>{{Cite web |last=Rosidi |first=Nate |date=March 27, 2023 |title=Multimodal Models Explained |url=https://www.kdnuggets.com/multimodal-models-explained |access-date=2024-06-01 |website=KDnuggets |language=en-US}}</ref>
 
Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a morebroader robust modelunderstanding of real-world phenomena.<ref>{{Cite web |last=Zia |first=Tehseen |date=January 8, 2024 |title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 |url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ |access-date=2024-06-01 |website=Unite.ai}}</ref>
 
==Motivation==
Many models and algorithms have been implemented to retrieve and classify certain types of data, e.g. image or text (where humans who interact with machines can extract images in the form of pictures and texts that could be any message etc.). However, dataData usually comecomes with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from textstext. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the correlationcombined structureinformation betweenfrom different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes.
 
== Multimodal transformers ==
Line 21:
 
==Application==
Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms [[support vector machine]]s, [[latent Dirichlet allocation]] and [[deep belief network]], when models are tested on data with both image-text modalities or with single modality.{{Citation needed|date=November 2022}} Multimodal deep Boltzmann machines are also able to predict missing modalities given the observed ones with reasonably good precision.{{Citation needed|date=November 2022}} [[Self-supervised Supervised Learninglearning]] brings a more interesting and powerful model for multimodality. [[OpenAI]] developed [[Contrastive Language-Image Pre-training|CLIP]] and [[DALL-E]] models that revolutionized multimodality.
 
Multimodal deep learning is used for [[cancer screening]] – at least one system under development [[Data integration#Medicine and Life Sciences|integrates]] such different types of data.<ref>{{cite news |last1=Quach |first1=Katyanna |title=Harvard boffins build multimodal AI system to predict cancer |url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ |access-date=16 September 2022 |work=The Register |language=en |archive-date=20 September 2022 |archive-url=https://web.archive.org/web/20220920163859/https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ |url-status=live }}</ref><ref>{{cite journal |last1=Chen |first1=Richard J. |last2=Lu |first2=Ming Y. |last3=Williamson |first3=Drew F. K. |last4=Chen |first4=Tiffany Y. |last5=Lipkova |first5=Jana |last6=Noor |first6=Zahra |last7=Shaban |first7=Muhammad |last8=Shady |first8=Maha |last9=Williams |first9=Mane |last10=Joo |first10=Bumjin |last11=Mahmood |first11=Faisal |title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning |journal=Cancer Cell |date=8 August 2022 |volume=40 |issue=8 |pages=865–878.e6 |doi=10.1016/j.ccell.2022.07.004 |pmid=35944502 |s2cid=251456162 |language=English |issn=1535-6108|doi-access=free |pmc=10397370 }}