Revision as of 13:55, 1 June 2024 edit Alenoach (talk \| contribs) Extended confirmed users 5,864 edits Reworked the introduction, and removed the template "Multiple issues" at the top and the obsolete invisible comments Tag: Visual edit ← Previous edit		Revision as of 14:24, 1 June 2024 edit undo Alenoach (talk \| contribs) Extended confirmed users 5,864 edits copyedit and modified content in the section "Motivations" that is redundant (the beginning) or too specific to Boltzmann machines (the end) Tag: Visual edit Next edit →
Line 5: In contrast, unimodal models can process only one type of data, such as text (typically represented as [[feature vector\|feature vectors]]) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.<ref>{{Cite web \|last=Rosidi \|first=Nate \|date=March 27, 2023 \|title=Multimodal Models Explained \|url=https://www.kdnuggets.com/multimodal-models-explained \|access-date=2024-06-01 \|website=KDnuggets \|language=en-US}}</ref> Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a ~~more~~broader ~~robust model~~understanding of real-world phenomena.<ref>{{Cite web \|last=Zia \|first=Tehseen \|date=January 8, 2024 \|title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 \|url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ \|access-date=2024-06-01 \|website=Unite.ai}}</ref> ==Motivation== Many models and algorithms have been implemented to retrieve and classify certain types of data, e.g. image or text (where humans who interact with machines can extract images in the form of pictures and texts that could be any message etc.). However, dataData usually ~~come~~comes with different modalities ~~(it is the degree to which a system's components may be separated or combined)~~ which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe ~~the~~ information which may not be obvious from ~~texts~~text. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the ~~correlation~~combined ~~structure~~information ~~between~~from different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes. == Multimodal transformers == Line 21: ==Application== Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms [[support vector machine]]s, [[latent Dirichlet allocation]] and [[deep belief network]], when models are tested on data with both image-text modalities or with single modality.{{Citation needed\|date=November 2022}} Multimodal deep Boltzmann machines are also able to predict missing modalities given the observed ones with reasonably good precision.{{Citation needed\|date=November 2022}} [[Self-supervised ~~Supervised Learning~~learning]] brings a more interesting and powerful model for multimodality. [[OpenAI]] developed [[Contrastive Language-Image Pre-training\|CLIP]] and [[DALL-E]] models that revolutionized multimodality. Multimodal deep learning is used for [[cancer screening]] – at least one system under development [[Data integration#Medicine and Life Sciences\|integrates]] such different types of data.<ref>{{cite news \|last1=Quach \|first1=Katyanna \|title=Harvard boffins build multimodal AI system to predict cancer \|url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|access-date=16 September 2022 \|work=The Register \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920163859/https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|url-status=live }}</ref><ref>{{cite journal \|last1=Chen \|first1=Richard J. \|last2=Lu \|first2=Ming Y. \|last3=Williamson \|first3=Drew F. K. \|last4=Chen \|first4=Tiffany Y. \|last5=Lipkova \|first5=Jana \|last6=Noor \|first6=Zahra \|last7=Shaban \|first7=Muhammad \|last8=Shady \|first8=Maha \|last9=Williams \|first9=Mane \|last10=Joo \|first10=Bumjin \|last11=Mahmood \|first11=Faisal \|title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning \|journal=Cancer Cell \|date=8 August 2022 \|volume=40 \|issue=8 \|pages=865–878.e6 \|doi=10.1016/j.ccell.2022.07.004 \|pmid=35944502 \|s2cid=251456162 \|language=English \|issn=1535-6108\|doi-access=free \|pmc=10397370 }}

Multimodal learning: Difference between revisions