Multimodal learning: Difference between revisions

Content deleted Content added
Cartgr (talk | contribs)
Copy editing, added a few "citation needed"s, and linked other wikipedia pages where relevant.
Line 6:
}}
<!--- Don't mess with this line! ---><!--- Write your article below this line --->
'''Multimodal learning''' attempts to model the combination of different [[Modality (human–computer interaction)|modalities]] of data, often risingarising in real-world applications. An example of jointmulti-modal data is combiningdata that combines text (typically represented as discrete word count vectors) with imaging data consisting of [[pixel]] intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required.
 
==Motivation==
AMany lotmodels ofand models/algorithms have been implemented to retrieve and classify a certain type of data, e.g. image or text (where humans who interact with machines can extract images in a form of pictures and text that could be any message etc.). However, data usually comes with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented byin thisthe image itself. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from texts. As a result, if some different words appear in similar images, then these words are very likely used to describe the same thing. Conversely, if somea wordsword areis used into differentdescribe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to inviteuse a novel model which is able to jointly represent the information such that the model can capture the correlation structure between different modalities. Moreover, it should also be able to recover missing modalities given observed ones, (e.g. predicting possible image object according to text description). The '''Multimodal Deep Boltzmann Machine model''' satisfies the above purposes.
 
==Background: Boltzmann machine==
A [[Boltzmann machine]] is a type of stochastic neural network invented by [[Geoffrey Hinton]] and [[Terry Sejnowski]] in 1985. Boltzmann machines can be seen as the [[stochastic process|stochastic]], [[generative model|generative]] counterpart of [[Hopfield net]]s. They are named after the [[Boltzmann distribution]] in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine{{Citation needed}}. A more efficient architecture is called '''[[restricted Boltzmann machine]]''' where connection is only allowed between hidden unit and visible unit, which is described in the next section.
 
===Restricted Boltzmann machine===
Line 83:
 
==Application==
Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms [[support vector machine]]s, [[latent Dirichlet allocation]] and [[deep belief network]], when models are tested on data with both image-text modalities or with single modality{{Citation needed}}. Multimodal deep Boltzmann machine is also able to predict the missing modalitymodalities given the observed ones with reasonably good precision{{Citation needed}}.
Self Supervised Learning brings a more interesting and powerful model for multimodality. [[OpenAI]] developed CLIP and [[DALL-E]] models that revolutionized multimodality.
 
Multimodal deep learning is used for [[cancer screening]] – at least one system under development [[Data integration#Medicine and Life Sciences|integrates]] such different types of data.<ref>{{cite news |last1=Quach |first1=Katyanna |title=Harvard boffins build multimodal AI system to predict cancer |url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ |access-date=16 September 2022 |work=The Register |language=en}}</ref><ref>{{cite journal |last1=Chen |first1=Richard J. |last2=Lu |first2=Ming Y. |last3=Williamson |first3=Drew F. K. |last4=Chen |first4=Tiffany Y. |last5=Lipkova |first5=Jana |last6=Noor |first6=Zahra |last7=Shaban |first7=Muhammad |last8=Shady |first8=Maha |last9=Williams |first9=Mane |last10=Joo |first10=Bumjin |last11=Mahmood |first11=Faisal |title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning |journal=Cancer Cell |date=8 August 2022 |volume=40 |issue=8 |pages=865–878.e6 |doi=10.1016/j.ccell.2022.07.004 |pmid=35944502 |s2cid=251456162 |language=English |issn=1535-6108}}