Content deleted Content added
Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes, added orphan, uncategorised tags |
MOS:HEAD |
||
Line 4:
== Motivation ==
The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality.<ref name=":0" /> A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts.
These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities.<ref>{{Cite journal |last=Zhang |first=Su-Fang |last2=Zhai |first2=Jun-Hai |last3=Xie |first3=Bo-Jun |last4=Zhan |first4=Yan |last5=Wang |first5=Xin |date=July 2019 |title=Multimodal Representation Learning: Advances, Trends and Challenges |url=https://ieeexplore.ieee.org/document/8949228/ |publisher=IEEE |pages=1–6 |doi=10.1109/ICMLC48188.2019.8949228 |isbn=978-1-7281-2816-0}}</ref> Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques.<ref>{{Cite journal |last=Zhang |first=Chao |last2=Yang |first2=Zichao |last3=He |first3=Xiaodong |last4=Deng |first4=Li |date=March 2020 |title=Multimodal Intelligence: Representation Learning, Information Fusion, and Applications |url=https://ieeexplore.ieee.org/document/9068414/ |journal=IEEE Journal of Selected Topics in Signal Processing |volume=14 |issue=3 |pages=478–493 |doi=10.1109/JSTSP.2020.2987728 |issn=1932-4553}}</ref>
== Approaches and
=== Canonical-correlation analysis based methods ===
[[Canonical correlation|Canonical-correlation analysis]] (CCA) was first introduced in 1936 by [[Harold Hotelling]]<ref>{{Cite journal |last=Hotelling |first=H. |date=1936-12-01 |title=RELATIONS BETWEEN TWO SETS OF VARIATES |url=https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/28.3-4.321 |journal=Biometrika |language=en |volume=28 |issue=3-4 |pages=321–377 |doi=10.1093/biomet/28.3-4.321 |issn=0006-3444}}</ref> and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data [[matrices]] <math>X \in \mathbb{R}^{n \times p} </math> and <math>Y \in \mathbb{R}^{n \times q}</math> representing different modalities, CCA finds projection vectors <math>w_x\in\mathbb{R}^p
</math> and <math>w_y\in\mathbb{R}^q </math> that maximizes the correlation between the projected variables:
Line 22 ⟶ 19:
==== Kernel CCA ====
Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using [[Kernel method|kernel functions]]. Given kernel functions <math>K_x
</math> and <math>K_y</math> with corresponding [[Gram matrix|Gram matrices]] <math>K_x\in\mathbb{R}^{n \times n}</math> and <math>K_y\in\mathbb{R}^{n \times n}
Line 41 ⟶ 37:
==== Deep CCA ====
Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities.<ref name=":0" /> DCCA uses separate neural networks <math>f_x</math> and <math>f_y</math> for each modality to transform the original data before applying CCA:
Line 55 ⟶ 50:
</math> and <math>r_x, r_y</math> are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization.<ref>{{Cite journal |last=Andrew |first=Galen |last2=Arora |first2=Raman |last3=Bilmes |first3=Jeff |last4=Livescu |first4=Karen |date=2013-05-26 |title=Deep Canonical Correlation Analysis |url=https://proceedings.mlr.press/v28/andrew13.html |journal=Proceedings of the 30th International Conference on Machine Learning |language=en |publisher=PMLR |pages=1247–1255}}</ref>
=== Graph
Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data.<ref>{{Cite journal |last=Ektefaie |first=Yasha |last2=Dasoulas |first2=George |last3=Noori |first3=Ayush |last4=Farhat |first4=Maha |last5=Zitnik |first5=Marinka |date=2023-04-03 |title=Multimodal learning with graphs |url=https://www.nature.com/articles/s42256-023-00624-6 |journal=Nature Machine Intelligence |language=en |volume=5 |issue=4 |pages=340–350 |doi=10.1038/s42256-023-00624-6 |issn=2522-5839 |pmc=10704992 |pmid=38076673}}</ref>
Line 63 ⟶ 57:
Other graph-based methods include [[Graphical model|'''Probabilistic Graphical Models''']] (PGMs) such as [[deep belief network]]s (DBN) and deep [[Boltzmann machine]]s (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs.<ref name=":0" /> Additionally, the structure of data in some domains like [[Human–computer interaction|Human-Computer Interaction]] (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks.<ref>{{Cite journal |last=Chen |first=Hongruixuan |last2=Yokoya |first2=Naoto |last3=Wu |first3=Chen |last4=Du |first4=Bo |date=2022 |title=Unsupervised Multimodal Change Detection Based on Structural Relationship Graph Representation Learning |url=https://ieeexplore.ieee.org/document/9984688/ |journal=IEEE Transactions on Geoscience and Remote Sensing |volume=60 |pages=1–18 |doi=10.1109/TGRS.2022.3229027 |issn=0196-2892}}</ref>
=== Diffusion
Another set of methods relevant to multimodal representation learning are based on [[diffusion map]]s and their extensions to handle multiple modalities.
Line 73 ⟶ 67:
== See also ==
* [[Feature learning|Representation learning]]
* [[Canonical correlation]]
|