Multimodal learning: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:32, 18 April 2022 edit A.VPup (talk \| contribs) 2 edits →Motivation: I improved the page by expressing myself to people's understanding Tags: Reverted Mobile edit Mobile web edit ← Previous edit		Latest revision as of 22:40, 1 June 2025 edit undo Alenoach (talk \| contribs) Extended confirmed users 5,864 edits Reformatted the examples of applications into a list Tag: Visual edit
(37 intermediate revisions by 23 users not shown)
Line 1: {{Short description\|Machine learning methods using multiple input modalities}} {{machine learning}} ~~{{multiple issues\|~~ '''Multimodal learning''' is a type of [[deep learning]] that integrates and processes multiple types of data, referred to as [[Modality (human–computer interaction)\|modalities]], such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval,<ref>{{Cite arXiv \|last1=Hendriksen \|first1=Mariya \|last2=Bleeker \|first2=Maurits \|last3=Vakulenko \|first3=Svitlana \|last4=van Noord \|first4=Nanne \|last5=Kuiper \|first5=Ernst \|last6=de Rijke \|first6=Maarten \|date=2021 \|title=Extending CLIP for Category-to-image Retrieval in E-commerce \|class=cs.CV \|eprint=2112.11294}}</ref> text-to-image generation,<ref name="stable-diffusion-github">{{cite web \|date=17 September 2022 \|title=Stable Diffusion Repository on GitHub \|url=https://github.com/CompVis/stable-diffusion \|url-status=live \|archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion \|archive-date=January 18, 2023 \|access-date=17 September 2022 \|publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich}}</ref> aesthetic ranking,<ref>{{Citation \|title=LAION-AI/aesthetic-predictor \|date=2024-09-06 \|url=https://github.com/LAION-AI/aesthetic-predictor \|access-date=2024-09-08 \|publisher=LAION AI}}</ref> and image captioning.<ref>{{Cite arXiv \|last1=Mokady \|first1=Ron \|last2=Hertz \|first2=Amir \|last3=Bermano \|first3=Amit H. \|date=2021 \|title=ClipCap: CLIP Prefix for Image Captioning \|class=cs.CV \|eprint=2111.09734}}</ref> ~~{{more footnotes\|date=June 2015}}~~ ~~{{technical\|date=June 2015}}~~ Large multimodal models, such as [[Google Gemini]] and [[GPT-4o]], have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.<ref>{{Cite web \|last=Zia \|first=Tehseen \|date=January 8, 2024 \|title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 \|url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ \|access-date=2024-06-01 \|website=Unite.ai}}</ref> ~~{{tone\|date=June 2015}}~~ }} ~~<!--- Don't mess with this line! ---><!--- Write your article below this line --->~~ Information in the real world usually comes as different [[Modality (human–computer interaction)\|modalities]]. For example, images are usually associated with tags and text explanations; text contains images to more clearly express the main idea of the article. Different modalities are characterized by different statistical properties. For instance, images are usually represented as [[pixel]] intensities or outputs of [[Feature extraction\|feature extractors]], while texts are represented as discrete word count vectors. Due to the distinct statistical properties of different information resources, it is important to discover the relationship between different modalities. '''Multimodal learning''' is a good model to represent the joint representations of different modalities. The '''multimodal learning model''' is also capable of supplying a missing modality based on observed ones. The multimodal learning model combines two [[Deep Boltzmann Machines\|deep Boltzmann machines]], each corresponding to one modality. An additional hidden layer is placed on top of the two Boltzmann Machines to produce the joint representation. ==Motivation== A lot of models/algorithms have been implemented to retrieve and classify a certain type of data, e.g. image or text (where humans who interact with machines can extract images in a form of pictures and text that could be any message etc). However, dataData usually comes with different modalities ~~(it is the degree to which a system's components may be separated or combined)~~ which carry different information. For example, it is very common to caption an image to convey the information not presented byin ~~this~~the image itself. Similarly, sometimes it is more straightforward to use an image to describe ~~the~~ information which may not be obvious from ~~texts~~text. As a result, if ~~some~~ different words appear in similar images, then these words ~~are very~~ likely ~~used to~~ describe the same thing. Conversely, if ~~some~~a ~~words~~word ~~are~~is used into ~~different~~describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to ~~invite~~use a ~~novel~~ model which is able to jointly represent the information such that the model can capture the ~~correlation~~combined ~~structure~~information ~~between~~from different modalities. Moreover, it should also be able to recover missing modalities given observed ones, e.g. predicting possible image object according to text description. The '''Multimodal Deep Boltzmann Machine model''' satisfies the above purposes.With all these information you can be able to learn your motivation in multimodal learning == Multimodal transformers == ~~==Background: Boltzmann machine==~~ {{excerpt\|Transformer (machine learning model)\|Multimodality}} A [[Boltzmann machine]] is a type of stochastic neural network invented by [[Geoffrey Hinton]] and [[Terry Sejnowski]] in 1985. Boltzmann machines can be seen as the [[stochastic process\|stochastic]], [[generative model\|generative]] counterpart of [[Hopfield net]]s. They are named after the [[Boltzmann distribution]] in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine. A more efficient architecture is called '''[[restricted Boltzmann machine]]''' where connection is only allowed between hidden unit and visible unit, which is described in the next section. === Multimodal large language models === ~~===Restricted Boltzmann machine===~~ {{excerpt\|Large language model\|Multimodality}} A restricted Boltzmann machine<ref>{{cite web \|url=https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Fall.2014/pdfs/Smolensky.1986.pdf\|title=Restricted Boltzmann Machine \|year=1986}}</ref> is an undirected graphical model with stochastic visible variable and stochastic hidden variables. Each visible variable is connected to each hidden variable. The energy function of the model is defined as == Multimodal deep Boltzmann machines == ~~:<math> E(\mathbf v,\mathbf h;\theta) = -\sum_{i=1}^D\sum_{j=1}^{F}W_{ij}v_ih_j -\sum_{i=1}^Db_iv_i -\sum_{j=1}^Fa_jh_j</math>~~ A [[Boltzmann machine]] is a type of [[stochastic neural network]] invented by [[Geoffrey Hinton]] and [[Terry Sejnowski]] in 1985. Boltzmann machines can be seen as the [[stochastic process\|stochastic]], [[generative model\|generative]] counterpart of [[Hopfield net]]s. They are named after the [[Boltzmann distribution]] in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. Each unit is like a neuron with a binary output that represents whether it is activated or not.<ref>{{Cite web \|last=Dey \|first=Victor \|date=2021-09-03 \|title=Beginners Guide to Boltzmann Machine \|url=https://analyticsindiamag.com/beginners-guide-to-boltzmann-machines/ \|access-date=2024-03-02 \|website=Analytics India Magazine \|language=en-US}}</ref> General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine{{Citation needed\|date=November 2022}}. A more efficient architecture is called [[restricted Boltzmann machine]] where connection is only allowed between hidden unit and visible unit, which is described in the next section. ~~where <math>\theta = \{\mathbf v,\mathbf h;~~ \theta\}</math> are model parameters: <math>W_{ij}</math> represents the symmetric interaction term between visible unit <math>i</math> and hidden unit <math>j</math>; <math>b_i</math> and <math>a_j</math> are bias terms. The joint distribution of the system is defined as ~~:<math> P(\mathbf{v};\theta) = \frac{1}{\mathcal{Z}(\theta)}\sum_{\mathbf h}\mathrm{exp}(-E(\mathbf v,\mathbf h;\theta)) </math>~~ ~~where <math>\mathcal{Z}(\theta)</math> is a normalizing constant.~~ ~~The conditional distribution over hidden <math>\mathbf h</math> and <math>\mathbf v </math> can be derived as logistic function in terms of model parameters.~~ ~~:<math>P(\mathbf h\|\mathbf v;\theta) = \prod_{j=1}^Fp(h_j\|\mathbf v)</math>, with <math>p(h_j=1\|\mathbf v) = g(\sum_{i=1}^DW_{ij}v_i + a_j)</math>~~ ~~:<math>P(\mathbf v\|\mathbf h;\theta) = \prod_{i=1}^Dp(v_i\|\mathbf h)</math>, with <math>p(v_i=1\|\mathbf h) = g(\sum_{j=1}^FW_{ij}h_j + b_i)</math>~~ ~~where <math>g(x) = \frac{1}{(1+\mathrm{exp}(-x))}</math> is the logistic function.~~ ~~The derivative of the log-likelihood with respect to the model parameters can be decomposed as the difference between the ''model's expectation'' and ''data-dependent expectation''.~~ Multimodal deep Boltzmann machines can process and learn from different types of information, such as images and text, simultaneously. This can notably be done by having a separate deep Boltzmann machine for each modality, for example one for images and one for text, joined at an additional top hidden layer.<ref>{{cite web \|year=2014 \|title=Multimodal Learning with Deep Boltzmann Machine \|url=http://www.jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf \|url-status=live \|archive-url=https://web.archive.org/web/20150621055730/http://jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf \|archive-date=2015-06-21 \|access-date=2015-06-14}}</ref> ~~===Gaussian-Bernoulli RBM===~~ '''Gaussian-Bernoulli RBMs'''<ref>{{cite web \|url=http://cseweb.ucsd.edu/~yfreund/papers/freund94unsupervised.pdf\|title=Gaussian-Bernoulli RBM \|year=1994}}</ref> are a variant of restricted Boltzmann machine used for modeling real-valued vectors such as pixel intensities. It is usually used to model the image data. The energy of the system of the Gaussian-Bernoulli RBM is defined as ~~:<math> E(\mathbf v,\mathbf h;\theta) = \sum_{i=1}^D\frac{(v_i-b_i)^2}{2\sigma_i^2} -\sum_{i=1}^D\sum_{j=1}^{F}\frac{v_i}{\sigma_i}W_{ij}v_ih_j -\sum_{i=1}^Db_iv_i -\sum_{j=1}^Fa_jh_j</math>~~ where <math>\theta = \{\mathbf a,\mathbf b,\mathbf w,\mathbf \sigma\}</math> are the model parameters. The joint distribution is defined the same as the one in [[#Restricted Boltzmann machine\|restricted Boltzmann machine]]. The conditional distributions now become ~~:<math>P(\mathbf h\|\mathbf v;\theta) = \prod_{j=1}^Fp(h_j\|\mathbf v)</math>, with <math>p(h_j=1\|\mathbf v) = g(\sum_{i=1}^DW_{ij}\frac{v_i}{\sigma_i} + a_j)</math>~~ ~~:<math>P(\mathbf v\|\mathbf h;\theta) = \prod_{i=1}^Dp(v_i\|\mathbf h)</math>, with <math>p(v_i\|\mathbf h) \sim \mathcal{N}(\sigma_i\sum_{j=1}^FW_{ij}h_j + b_i,\sigma_i^2)</math>~~ ~~In Gaussian-Bernoulli RBM, the visible unit conditioned on hidden units is modeled as a Gaussian distribution.~~ == Applications == ~~===Replicated Softmax Model===~~ Multimodal machine learning has numerous applications across various domains: The '''Replicated Softmax Model'''<ref>{{cite web \|url=http://papers.nips.cc/paper/3856-replicated-softmax-an-undirected-topic-model.pdf\|title=Replicated Softmax Model \|year=2009a}}</ref> is also an variant of restricted Boltzmann machine and commonly used to model word count vectors in a document. In a typical [[text mining]] problem, let <math>K</math> be the dictionary size, and <math>M</math> be the number of words in the document. Let <math>\mathbf V</math> be a <math>M \times K</math> binary matrix with <math>v_{ik} = 1</math> only when the <math>i^{th}</math> word in the document is the <math>k^{th}</math> word in the dictionary. <math>\hat v_k</math> denotes the count for the <math>k^{th}</math> word in the dictionary. The energy of the state <math>\{\mathbf V,\mathbf h\}</math> for a document contains <math>M</math> words is defined as ~~:<math>E(\mathbf V,\mathbf h) = -\sum_{j=1}^{F}\sum_{k=1}^{K}W_{jk}\hat v_kh_j - \sum_{k=1}^Kb_k\hat v_k - M\sum_{j=1}^{F}a_jh_j</math>~~ ~~The conditional distributions are given by~~ ~~:<math>p(h_j=1\|\mathbf V) = g(Ma_j + \sum_{k=1}^K\hat v_kW_{jk})</math>~~ ~~:<math>p(v_{ik} = 1\|\mathbf h) = \frac{\mathrm{exp}(b_k + \sum_{j=1}^Fh_jW_{jk}}{\sum_{q=1}^{K}\mathrm{exp}(b_q + \sum_{j=1}^Fh_jW_{jq}})</math>~~ * '''Cross-modal retrieval''': cross-modal retrieval allows users to search for data across different modalities (e.g., retrieving images based on text descriptions), improving multimedia search engines and content recommendation systems. Models like [[Contrastive Language-Image Pre-training\|CLIP]] facilitate efficient, accurate retrieval by embedding data in a shared space, demonstrating strong performance even in zero-shot settings.<ref>{{Cite arXiv \|last1=Hendriksen \|first1=Mariya \|last2=Vakulenko \|first2=Svitlana \|last3=Kuiper \|first3=Ernst \|last4=de Rijke \|first4=Maarten \|date=2023 \|title=Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study \|class=cs.CV \|eprint=2301.05174}}</ref> ~~==Deep Boltzmann machines==~~ * '''Classification and missing data retrieval''': multimodal Deep Boltzmann Machines outperform traditional models like [[support vector machine]]s and [[latent Dirichlet allocation]] in classification tasks and can predict missing data in multimodal datasets, such as images and text. A '''deep Boltzmann machine'''<ref>{{cite web \|url=http://www.utstat.toronto.edu/~rsalakhu/papers/dbm.pdf\|title=Deep Boltzmann Machine \|year=2009b}}</ref> has a sequence of layers of hidden units.There are only connections between adjacent hidden layers, as well as between visible units and hidden units in the first hidden layer. The energy function of the system adds layer interaction terms to the energy function of general restricted Boltzmann machine and is defined by * '''Healthcare diagnostics''': multimodal models integrate medical imaging, genomic data, and patient records to improve diagnostic accuracy and early disease detection, especially in cancer screening.<ref>{{cite news \|last1=Quach \|first1=Katyanna \|title=Harvard boffins build multimodal AI system to predict cancer \|url=https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|access-date=16 September 2022 \|work=The Register \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920163859/https://www.theregister.com/2022/08/09/ai_cancer_multimodal/ \|url-status=live }}</ref><ref>{{cite journal \|last1=Chen \|first1=Richard J. \|last2=Lu \|first2=Ming Y. \|last3=Williamson \|first3=Drew F. K. \|last4=Chen \|first4=Tiffany Y. \|last5=Lipkova \|first5=Jana \|last6=Noor \|first6=Zahra \|last7=Shaban \|first7=Muhammad \|last8=Shady \|first8=Maha \|last9=Williams \|first9=Mane \|last10=Joo \|first10=Bumjin \|last11=Mahmood \|first11=Faisal \|title=Pan-cancer integrative histology-genomic analysis via multimodal deep learning \|journal=Cancer Cell \|date=8 August 2022 \|volume=40 \|issue=8 \|pages=865–878.e6 \|doi=10.1016/j.ccell.2022.07.004 \|pmid=35944502 \|s2cid=251456162 \|language=English \|issn=1535-6108\|doi-access=free \|pmc=10397370 }} ~~<math>~~ Teaching hospital press release: {{cite news \|title=New AI technology integrates multiple data types to predict cancer outcomes \|url=https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html \|access-date=18 September 2022 \|work=[[Brigham and Women's Hospital]] via medicalxpress.com \|language=en \|archive-date=20 September 2022 \|archive-url=https://web.archive.org/web/20220920172825/https://medicalxpress.com/news/2022-08-ai-technology-multiple-cancer-outcomes.html \|url-status=live }}</ref><ref>{{Cite arXiv \|last1=Shi \|first1=Yuge \|last2=Siddharth \|first2=N. \|last3=Paige \|first3=Brooks \|last4=Torr \|first4=Philip HS \|year=2019 \|title=Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models \|eprint=1911.03393 \|class=cs.LG}}</ref> ~~\begin{align}~~ '''Content generation''': models like [[DALL·E]] generate images from textual descriptions, benefiting creative industries, while cross-modal retrieval enables dynamic multimedia searches.<ref>{{Cite arXiv \|last1=Shi \|first1=Yuge \|last2=Siddharth \|first2=N. \|last3=Paige \|first3=Brooks \|last4=Torr \|first4=Philip HS \|date=2019 \|title=Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models \|class=cs.LG \|eprint=1911.03393}}</ref> ~~E({\mathbf v, \mathbf h; \theta} ) = & -\sum_{i=1}^D\sum_{j=1}^{F_1}W_{ij}^{(1)}v_ih_j^{(1)} -\sum_{j=1}^{F_1}\sum_{l=1}^{F_2}W_{jl}^{(2)}h_j^{(1)}h_{l}^{(2)}\\~~ * '''Robotics and human-computer interaction''': multimodal learning improves interaction in robotics and AI by integrating sensory inputs like speech, vision, and touch, aiding autonomous systems and [[Human–computer interaction\|human-computer interaction]]. ~~& -\sum_{l=1}^{F_2}\sum_{p=1}^{F_3}W_{lp}^{(3)}h_l^{(2)}h_p^{(3)}~~ * '''Emotion recognition''': combining visual, audio, and text data, multimodal systems enhance [[sentiment analysis]] and [[emotion recognition]], applied in customer service, social media, and marketing. ~~- \sum_{i = 1}^Db_iv_i - \sum_{j=1}^{F_1}b_j^{(1)}h_j^{(1)} - \sum_{l=1}^{F_2}b_l^{(2)}h_l^{(2)} - \sum_{p=1}^{F_3}b_p^{(3)}h_p^{(3)}~~ ~~\end{align}~~ ~~</math>~~ ~~The joint distribution is~~ ~~:<math>P(\mathbf{v};\theta) = \frac{1}{\mathcal{Z}(\theta)}\sum_{\mathbf h}\mathrm{exp}(-E(\mathbf v,\mathbf h^{(1)},\mathbf h^{(2)},\mathbf h^{(3)};\theta))</math>~~ ~~==Multimodal deep Boltzmann machines==~~ '''Multimodal deep Boltzmann machine'''<ref>{{cite web \|url=http://papers.nips.cc/paper/4683-multimodal-learning-with-deep-boltzmann-machines.pdf\|title=Multimodal Learning with Deep Boltzmann Machine \|year=2012}}</ref><ref>{{cite web \|url=http://www.jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf\|title=Multimodal Learning with Deep Boltzmann Machine \|year=2014}}</ref> uses an image-text bi-modal DBM where the image pathway is modeled as Gaussian-Bernoulli DBM and text pathway as Replicated Softmax DBM, and each DBM has two hidden layers and one visible layer. The two DBMs join together at an additional top hidden layer. The joint distribution over the multi-modal inputs defined as ~~<math>~~ ~~\begin{align}~~ P(\mathbf v^m,\mathbf v^t;\theta) & = \sum_{\mathbf h^{(2m)},\mathbf h^{(2t)},\mathbf h^{(3)}}P(\mathbf h^{(2m)},\mathbf h^{(2t)},\mathbf h^{(3)})(\sum_{\mathbf h^{(1m)}}P(\mathbf v_m,\mathbf h^{(1m)}\|\mathbf h^{(2m)}))(\sum_{\mathbf h^{(1t)}}P(\mathbf v^t,\mathbf h^{(1t)}\|\mathbf h^{(2t)}))\\ ~~& = \frac{1}{\mathcal{Z}_M(\theta)}\sum_{\mathbf h}\mathrm{exp}(\sum_{kj}W_{kj}^{(1t)}v_k^th_j^{(1t)} \\~~ ~~&+ \sum_{jl}W_{jl}^{(2t)}h_j^{(1t)}h_l^{(2t)}+\sum_kb_k^tv_k^t+M\sum_jb_j^{(1t)}h_j^{(1t)}+\sum_lb_l^{(2t)}h_l^{(2t)}\\~~ ~~& - \sum_i\frac{(v_i^m-b_i^m)^2}{2\sigma^2} + \sum_{ij}\frac{v_i^m}{\sigma_i}W_{ij}^{(1m)}h_j^{(1m)} \\~~ ~~&+ \sum_{jl}W_{jl}^{(2m)}h_j^{(1m)}h_l^{(2m)}+\sum_jb_j^{(1m)}h_j^{(1m)}+\sum_lb_l^{(2m)}h_l{(2m)}\\~~ ~~& + \sum_{lp}W^{(3t)}h_l^{(2t)}h_p^{(3)} + \sum_{lp}W^{(3m)}h_l^{(2m)}h_p^{(3)} + \sum_pb_p^{(3)}h_p^{(3)}~~ ~~\end{align}~~ ~~</math>~~ ~~The conditional distributions over the visible and hidden units are~~ ~~:<math>p(h_j^{(1m)}=1\|\mathbf v^m,\mathbf h^{(2m)}) = g(\sum_{i=1}^DW_{ij}^{(1m)}\frac{v_i^m}{\sigma_i} + \sum_{l=1}^{F_2^m}W_{jl}^{(2m)}h_l^{(2m)}+b_j^{(1m)})</math>~~ ~~:<math>p(h_l^{(2m)}=1\|\mathbf h^{(1m)},\mathbf h^{(3)}) = g(\sum_{j=1}^{F_1^m}W_{jl}^{(2m)}h_j^{(1m)} + \sum_{p=1}^{F_3}W_{lp}^{(3m)}h_p^{(3)}+b_l^{(2m)})</math>~~ ~~:<math>p(h_j^{(1t)}=1\|\mathbf v^t,\mathbf h^{(2t)}) = g(\sum_{k=1}^{K}W_{kl}^{(1t)}v_k^{(t)} + \sum_{l=1}^{F_2^t}W_{jl}^{(2t)}h_l^{(2t)}+Mb_j^{(1t)})</math>~~ ~~:<math>p(h_l^{(2t)}=1\|\mathbf h^{(1t)},\mathbf h^{(3)}) = g(\sum_{j=1}^{F_1^t}W_{jl}^{(2t)}h_j^{(1t)} + \sum_{p=1}^{F_3}W_{lp}^{(3t)}h_p^{(3)}+b_l^{(2t)})</math>~~ ~~:<math>p(h_p^{3)}=1\|\mathbf h^{(2)}) = g(\sum_{l=1}^{F_2^m}W_{lp}^{(3m)}h_l^{(2m)} + \sum_{l=1}^{F_2^t}W_{lp}^{(3t)}h_l^{(2t)}+b_p^{(3)})</math>~~ ~~:<math>p(v_{ik}^t = 1\|\mathbf h^{(1t)}) = \frac{\mathrm{exp}(\sum_{j=1}^{F_1^t}h_j^{(1t)}W_{jk}^{(1t)} + b_k^t)}{\sum_{q=1}^K\mathrm{exp}(\sum_{j=1}^{F_1^t}h_j^{(1t)}W_{jq}^{(1t)} + b_k^t)}</math>~~ ~~:<math>p(v_i^m\|\mathbf h^{(1m)}) \sim \mathcal{N}(\sigma_i\sum_{j=1}^{F_1^m}W_{ij}^{(1m)}h_j^{(1m)} + b_i^m,\sigma_i^2)</math>~~ ~~===Inference and learning===~~ Exact maximum likelihood learning in this model is intractable, but approximate learning of DBMs can be carried out by using a variational approach, where mean-field inference is used to estimate data-dependent expectations and an MCMC based stochastic approximation procedure is used to approximate the model’s expected sufficient statistics.<ref>{{cite web \|url=http://icml2008.cs.helsinki.fi/papers/638.pdf ~~\|title= Approximations to the Likelihood Gradient \|year=2008}}</ref>~~ ~~==Application==~~ Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms [[support vector machine]]s, [[latent Dirichlet allocation]] and [[deep belief network]], when models are tested on data with both image-text modalities or with single modality. Multimodal deep Boltzmann machine is also able to predict the missing modality given the observed ones with reasonably good precision. ~~Self Supervised Learning brings more interesting and powerful model for multimodality. OpenAI developed CLIP and DALL-E models that revolutionized multimodality.~~ ==See also== Line 93 ⟶ 36: ==References== {{reflist}} <!--- After listing your sources please cite them using inline citations and place them after the information they cite. Please see http://en.wikipedia.org/wiki/Wikipedia:REFB for instructions on how to add citations. ---> <!--- STOP! Be warned that by using this process instead of Articles for Creation, this article is subject to scrutiny. As an article in "mainspace", it will be DELETED if there are problems, not just declined. If you wish to use AfC, please return to the Wizard and continue from there. ---> [[Category:Artificial neural networks]] [[Category:Multimodal interaction]]