Multimodal learning: Difference between revisions

Content deleted Content added
No edit summary
Line 18:
Multimodal deep Boltzmann machines can process and learn from different types of information, such as images and text, simultaneously. This can notably be done by having a separate deep Boltzmann machine for each modality, for example one for images and one for text, joined at an additional top hidden layer.<ref>{{cite web |year=2014 |title=Multimodal Learning with Deep Boltzmann Machine |url=http://www.jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf |url-status=live |archive-url=https://web.archive.org/web/20150621055730/http://jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf |archive-date=2015-06-21 |access-date=2015-06-14}}</ref>
 
==Application Applications ==
Multimodal machine learning has numerous applications across various domains:
 
=== Cross-modal retrieval ===
* '''Cross-Modal Retrieval''': Cross-modal retrieval allows users to search for data across different modalities (e.g., retrieving images based on text descriptions), improving multimedia search engines and content recommendation systems. Models like [[Contrastive Language-Image Pre-training|CLIP]] facilitate efficient, accurate retrieval by embedding data in a shared space, demonstrating strong performance even in zero-shot settings.<ref>{{Cite arXiv |last1=Hendriksen |first1=Mariya |last2=Vakulenko |first2=Svitlana |last3=Kuiper |first3=Ernst |last4=de Rijke |first4=Maarten |date=2023 |title=Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study |class=cs.CV |eprint=2301.05174}}</ref>
 
*=== '''Classification and Missing Data Retrieval''': ===
Multimodal Deep Boltzmann Machines outperform traditional models like [[support vector machine]]s and [[latent Dirichlet allocation]] in classification tasks and can predict missing data in multimodal datasets, such as images and text.
 
=== Healthcare Diagnostics ===
* '''Healthcare Diagnostics''': Multimodal models integrate medical imaging, genomic data, and patient records to improve diagnostic accuracy and early disease detection, especially in cancer screening.<ref>{{cite news |last1=Quach |first1=Katyanna |title=Harvard boffins build multimodal AI system to predict cancer |url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ |access-date=16 September 2022 |work=The Register |language=en |archive-date=20 September 2022 |archive-url=https://web.archive.org/web/20220920163859/https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ |url-status=live }}</ref><ref>{{cite journal |last1=Chen |first1=Richard J. |last2=Lu |first2=Ming Y. |last3=Williamson |first3=Drew F. K. |last4=Chen |first4=Tiffany Y. |last5=Lipkova |first5=Jana |last6=Noor |first6=Zahra |last7=Shaban |first7=Muhammad |last8=Shady |first8=Maha |last9=Williams |first9=Mane |last10=Joo |first10=Bumjin |last11=Mahmood |first11=Faisal |title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning |journal=Cancer Cell |date=8 August 2022 |volume=40 |issue=8 |pages=865–878.e6 |doi=10.1016/j.ccell.2022.07.004 |pmid=35944502 |s2cid=251456162 |language=English |issn=1535-6108|doi-access=free |pmc=10397370 }}
* Teaching hospital press release: {{cite news |title=New AI technology integrates multiple data types to predict cancer outcomes |url=https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html |access-date=18 September 2022 |work=[[Brigham and Women's Hospital]] via medicalxpress.com |language=en |archive-date=20 September 2022 |archive-url=https://web.archive.org/web/20220920172825/https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html |url-status=live }}</ref><ref>{{Cite arXiv |last1=Shi |first1=Yuge |last2=Siddharth |first2=N. |last3=Paige |first3=Brooks |last4=Torr |first4=Philip HS |year=2019 |title=Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models |eprint=1911.03393 |class=cs.LG}}</ref>
 
=== Content Generation ===
* '''Content Generation''': Models like DALL·E generate images from textual descriptions, benefiting creative industries, while cross-modal retrieval enables dynamic multimedia searches.<ref>{{Cite arXiv |last1=Shi |first1=Yuge |last2=Siddharth |first2=N. |last3=Paige |first3=Brooks |last4=Torr |first4=Philip HS |date=2019 |title=Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models |class=cs.LG |eprint=1911.03393}}</ref>
 
=== Robotics and HCI ===
* '''Robotics and HCI''': Multimodal learning improves interaction in robotics and AI by integrating sensory inputs like speech, vision, and touch, aiding autonomous systems and human-computer interaction.
 
*=== '''Emotion Recognition''': ===
Combining visual, audio, and text data, multimodal systems enhance sentiment analysis and emotion recognition, applied in customer service, social media, and marketing.
 
==See also==