Multimodal representation learning: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: title, template type, url, issue. URLs might have been anonymized. Add: arxiv, chapter-url, chapter, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
Citation bot (talk | contribs)
Removed URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 826/990
 
(3 intermediate revisions by 3 users not shown)
Line 1:
{{Orphan|date=April 2025}}
 
'''Multimodal representation learning''' is a subfield of [[Feature learning|representation learning]] focused on integrating and interpreting information from different [[Modality (human–computer interaction)|modalities]], such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types.<ref name=":0">{{Cite journal |last1=Guo |first1=Wenzhong |last2=Wang |first2=Jianwen |last3=Wang |first3=Shiping |date=2019 |title=Deep Multimodal Representation Learning: A Survey |journal=IEEE Access |volume=7 |pages=63373–63394 |doi=10.1109/ACCESS.2019.2916887 |issn=2169-3536|doi-access=free |bibcode=2019IEEEA...763373G }}</ref> By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis.
 
== Motivation ==
The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality.<ref name=":0" /> A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts.
 
These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities.<ref>{{Cite book |last1=Zhang |first1=Su-Fang |last2=Zhai |first2=Jun-Hai |last3=Xie |first3=Bo-Jun |last4=Zhan |first4=Yan |last5=Wang |first5=Xin |chapter=Multimodal Representation Learning: Advances, Trends and Challenges |date=July 2019 |title=2019 International Conference on Machine Learning and Cybernetics (ICMLC) |chapter-url=https://ieeexplore.ieee.org/document/8949228 |publisher=IEEE |pages=1–6 |doi=10.1109/ICMLC48188.2019.8949228 |isbn=978-1-7281-2816-0}}</ref> Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques.<ref>{{Cite journal |last1=Zhang |first1=Chao |last2=Yang |first2=Zichao |last3=He |first3=Xiaodong |last4=Deng |first4=Li |date=March 2020 |title=Multimodal Intelligence: Representation Learning, Information Fusion, and Applications |url=https://ieeexplore.ieee.org/document/9068414 |journal=IEEE Journal of Selected Topics in Signal Processing |volume=14 |issue=3 |pages=478–493 |doi=10.1109/JSTSP.2020.2987728 |issn=1932-4553|arxiv=1911.03977 |bibcode=2020ISTSP..14..478Z }}</ref>
 
== Approaches and methods ==
 
=== Canonical-correlation analysis based methods ===
[[Canonical correlation|Canonical-correlation analysis]] (CCA) was first introduced in 1936 by [[Harold Hotelling]]<ref>{{Cite journal |last=Hotelling |first=H. |date=1936-12-01 |title=Relations Between Two Sets of Variates |url=https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/28.3-4.321 |journal=Biometrika |language=en |volume=28 |issue=3–4 |pages=321–377 |doi=10.1093/biomet/28.3-4.321 |issn=0006-3444|url-access=subscription }}</ref> and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data [[Matrix (mathematics)|matrices]] <math>X \in \mathbb{R}^{n \times p} </math> and <math>Y \in \mathbb{R}^{n \times q}</math> representing different modalities, CCA finds projection vectors <math>w_x\in\mathbb{R}^p
</math> and <math>w_y\in\mathbb{R}^q </math> that maximizes the correlation between the projected variables:
 
Line 34:
</math> memory requirement for sorting kernel matrices.
 
KCCA was proposed independently by several researchers.<ref>{{Cite journal |last=Lai |first=P |date=October 2000 |title=Kernel and Nonlinear Canonical Correlation Analysis |url=http://linkinghub.elsevier.com/retrieve/pii/S012906570000034X |journal=International Journal of Neural Systems |volume=10 |issue=5 |pages=365–377 |doi=10.1016/S0129-0657(00)00034-X|pmid=11195936 |url-access=subscription }}</ref><ref>{{Cite web |title=Kernel Independent Component Analysis {{!}} EECS at UC Berkeley |url=https://www2.eecs.berkeley.edu/Pubs/TechRpts/2001/5721.html |access-date=2025-04-16 |website=www2.eecs.berkeley.edu}}</ref><ref>{{Cite book |last1=Dorffner |first1=Georg |title=Artificial Neural Networks -- ICANN 2001: International Conference Vienna, Austria, August 21-25, 2001 Proceedings |last2=Bischof |first2=Horst |last3=Hornik |first3=Kurt |date=2001 |publisher=Springer-Verlag Berlin Heidelberg Springer e-books |isbn=978-3-540-44668-2 |series=Lecture Notes in Computer Science |___location=Berlin, Heidelberg}}</ref><ref>{{Citation |last=Akaho |first=Shotaro |title=A kernel method for canonical correlation analysis |date=2007-02-14 |arxiv=cs/0609071 |id=arXiv:cs/0609071}}</ref>
 
==== Deep CCA ====
Line 53:
Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data.<ref>{{Cite journal |last1=Ektefaie |first1=Yasha |last2=Dasoulas |first2=George |last3=Noori |first3=Ayush |last4=Farhat |first4=Maha |last5=Zitnik |first5=Marinka |date=2023-04-03 |title=Multimodal learning with graphs |journal=Nature Machine Intelligence |language=en |volume=5 |issue=4 |pages=340–350 |doi=10.1038/s42256-023-00624-6 |issn=2522-5839 |pmc=10704992 |pmid=38076673}}</ref>
 
One such method is '''cross-modal graph neural networks''' (CMGNNs) that extend traditional [[graph neural network]]s (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as [[Vertex (graph theory)|nodes]] and their relationships as edges.<ref>{{Cite book |last1=Liu |first1=Shubao |last2=Xie |first2=Yuan |last3=Yuan |first3=Wang |last4=Ma |first4=Lizhuang |chapter=Cross-Modality Graph Neural Network for Few-Shot Learning |date=2021-07-05 |title=2021 IEEE International Conference on Multimedia and Expo (ICME) |chapter-url=https://ieeexplore.ieee.org/document/9428405 |publisher=IEEE |pages=1–6 |doi=10.1109/ICME51207.2021.9428405 |isbn=978-1-6654-3864-3}}</ref>
 
Other graph-based methods include [[Graphical model|'''Probabilistic Graphical Models''']] (PGMs) such as [[deep belief network]]s (DBN) and deep [[Boltzmann machine]]s (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs.<ref name=":0" /> Additionally, the structure of data in some domains like [[Human–computer interaction|Human-Computer Interaction]] (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks.<ref>{{Cite journal |last1=Chen |first1=Hongruixuan |last2=Yokoya |first2=Naoto |last3=Wu |first3=Chen |last4=Du |first4=Bo |date=2022 |title=Unsupervised Multimodal Change Detection Based on Structural Relationship Graph Representation Learning |url=https://ieeexplore.ieee.org/document/9984688 |journal=IEEE Transactions on Geoscience and Remote Sensing |volume=60 |pages=1–18 |doi=10.1109/TGRS.2022.3229027 |issn=0196-2892|arxiv=2210.00941 |bibcode=2022ITGRS..6029027C }}</ref>
 
=== Diffusion maps ===
Line 64:
 
==== Alternating diffusion ====
Alternating diffusion based methods provide another strategy for multimodal representation learning by focusing on extracting the common underlying sources of variability present across multiple views or sensors. These methods aim to filter out sensor-specific or nuisance components, assuming that the phenomenon of interest is captured by two or more sensors. The core idea involves constructing an alternating diffusion operator by sequentially applying diffusion processes derived from each modality, typically through their product or intersection. This process allows the method to capture the structure related to common hidden variables that drive the observed multimodal data.<ref>{{Cite journal |last1=Katz |first1=Ori |last2=Talmon |first2=Ronen |last3=Lo |first3=Yu-Lun |last4=Wu |first4=Hau-Tieng |date=January 2019 |title=Alternating diffusion maps for multimodal data fusion |url=https://linkinghub.elsevier.com/retrieve/pii/S1566253517300192 |journal=Information Fusion |language=en |volume=45 |pages=346–360 |doi=10.1016/j.inffus.2018.01.007|url-access=subscription }}</ref>
 
== See also ==