Revision as of 21:57, 16 April 2025 edit YoniNewman (talk \| contribs) 14 edits Created need page for multimodal representation learning Tags: Visual edit Disambiguation links added		Revision as of 01:40, 17 April 2025 edit undo Ira Leviton (talk \| contribs) Extended confirmed users 358,192 edits Fixed references. Please see Category:CS1 errors: dates. Next edit →
Line 5: The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality.<ref name=":0" /> A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities.<ref>{{Cite journal \|last=Zhang \|first=Su-Fang \|last2=Zhai \|first2=Jun-Hai \|last3=Xie \|first3=Bo-Jun \|last4=Zhan \|first4=Yan \|last5=Wang \|first5=Xin \|date=July 2019~~-07~~ \|title=Multimodal Representation Learning: Advances, Trends and Challenges \|url=https://ieeexplore.ieee.org/document/8949228/ \|publisher=IEEE \|pages=1–6 \|doi=10.1109/ICMLC48188.2019.8949228 \|isbn=978-1-7281-2816-0}}</ref> Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques.<ref>{{Cite journal \|last=Zhang \|first=Chao \|last2=Yang \|first2=Zichao \|last3=He \|first3=Xiaodong \|last4=Deng \|first4=Li \|date=March 2020~~-03~~ \|title=Multimodal Intelligence: Representation Learning, Information Fusion, and Applications \|url=https://ieeexplore.ieee.org/document/9068414/ \|journal=IEEE Journal of Selected Topics in Signal Processing \|volume=14 \|issue=3 \|pages=478–493 \|doi=10.1109/JSTSP.2020.2987728 \|issn=1932-4553}}</ref> == Approaches and Methods == Line 36: </math> memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers.<ref>{{Cite journal \|last=Lai \|first=P \|date=October 2000~~-10~~ \|title=Kernel and Nonlinear Canonical Correlation Analysis \|url=http://linkinghub.elsevier.com/retrieve/pii/S012906570000034X \|journal=International Journal of Neural Systems \|volume=10 \|issue=5 \|pages=365–377 \|doi=10.1016/S0129-0657(00)00034-X}}</ref><ref>{{Cite web \|title=Kernel Independent Component Analysis {{!}} EECS at UC Berkeley \|url=https://www2.eecs.berkeley.edu/Pubs/TechRpts/2001/5721.html \|access-date=2025-04-16 \|website=www2.eecs.berkeley.edu}}</ref><ref>{{Cite book \|last=Dorffner \|first=Georg \|title=Artificial Neural Networks -- ICANN 2001: International Conference Vienna, Austria, August 21-25, 2001 Proceedings \|last2=Bischof \|first2=Horst \|last3=Hornik \|first3=Kurt \|date=2001 \|publisher=Springer-Verlag Berlin Heidelberg Springer e-books \|isbn=978-3-540-44668-2 \|series=Lecture Notes in Computer Science \|___location=Berlin, Heidelberg}}</ref><ref>{{Citation \|last=Akaho \|first=Shotaro \|title=A kernel method for canonical correlation analysis \|date=2007-02-14 \|url=https://arxiv.org/abs/cs/0609071 \|access-date=2025-04-16 \|publisher=arXiv \|doi=10.48550/arXiv.cs/0609071 \|id=arXiv:cs/0609071}}</ref> ==== Deep CCA ==== Line 65: ==== Multi-view diffusion maps ==== Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a [[random walk]] process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associated diffusion distances. The [[Spectral decomposition (Matrix)\|spectral decomposition]] of this kernel enables the discovery of an embedding that better leverages the information from all views. This method has demonstrated utility in various machine learning tasks, including classification, clustering, and manifold learning. <ref>{{Cite journal \|last=Lindenbaum \|first=Ofir \|last2=Yeredor \|first2=Arie \|last3=Salhov \|first3=Moshe \|last4=Averbuch \|first4=Amir \|date=March 2020~~-03~~ \|title=Multi-view diffusion maps \|url=https://linkinghub.elsevier.com/retrieve/pii/S1566253518303877 \|journal=Information Fusion \|language=en \|volume=55 \|pages=127–149 \|doi=10.1016/j.inffus.2019.08.005}}</ref> ==== Alternating diffusion ==== Alternating diffusion based methods provide another strategy for multimodal representation learning by focusing on extracting the common underlying sources of variability present across multiple views or sensors. These methods aim to filter out sensor-specific or nuisance components, assuming that the phenomenon of interest is captured by two or more sensors. The core idea involves constructing an alternating diffusion operator by sequentially applying diffusion processes derived from each modality, typically through their product or intersection. This process allows the method to capture the structure related to common hidden variables that drive the observed multimodal data.<ref>{{Cite journal \|last=Katz \|first=Ori \|last2=Talmon \|first2=Ronen \|last3=Lo \|first3=Yu-Lun \|last4=Wu \|first4=Hau-Tieng \|date=January 2019~~-01~~ \|title=Alternating diffusion maps for multimodal data fusion \|url=https://linkinghub.elsevier.com/retrieve/pii/S1566253517300192 \|journal=Information Fusion \|language=en \|volume=45 \|pages=346–360 \|doi=10.1016/j.inffus.2018.01.007}}</ref> == See also ==

Multimodal representation learning: Difference between revisions