Multimodal sentiment analysis: Difference between revisions

Content deleted Content added
avoid over-linking
avoid over-linking
Line 1:
'''Multimodal sentiment analysis''' is a new dimension of the traditional text-based sentiment analysis, which goes beyond the analysis of texts, and includes other [[Modality (human–computer interaction)|modalities]] such as audio and visual data.<ref>{{cite journal |last1=Soleymani |first1=Mohammad |last2=Garcia |first2=David |last3=Jou |first3=Brendan |last4=Schuller |first4=Björn |last5=Chang |first5=Shih-Fu |last6=Pantic |first6=Maja |title=A survey of multimodal sentiment analysis |journal=Image and Vision Computing |date=September 2017 |volume=65 |pages=3–14 |doi=10.1016/j.imavis.2017.08.003}}</ref> It can be bimodal, which includes different combinations of two [[Modality (human–computer interaction)|modalities]], or trimodal, which incorporates three [[Modality (human–computer interaction)|modalities]].<ref>{{cite journal |last1=Karray |first1=Fakhreddine |last2=Milad |first2=Alemzadeh |last3=Saleh |first3=Jamil Abou |last4=Mo Nours |first4=Arab |title=Human-Computer Interaction: Overview on State of the Art |journal=International Journal on Smart Sensing and Intelligent Systems |date=2008 |url=http://s2is.org/Issues/v1/n1/papers/paper9.pdf}}</ref> With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis<ref name="s1">{{cite journal |last1=Poria |first1=Soujanya |last2=Cambria |first2=Erik |last3=Bajpai |first3=Rajiv |last4=Hussain |first4=Amir |title=A review of affective computing: From unimodal analysis to multimodal fusion |journal=Information Fusion |date=September 2017 |volume=37 |pages=98–125 |doi=10.1016/j.inffus.2017.02.003}}</ref>, which can be applied in the development of [[virtual assistant]]s<ref name ="s5">{{cite web |title=Google AI to make phone calls for you |url=https://www.bbc.com/news/technology-44045424 |website=BBC News |accessdate=12 June 2018 |date=8 May 2018}}</ref>, analysis of YouTube movie reviews<ref name="s4">{{cite journal |last1=Wollmer |first1=Martin |last2=Weninger |first2=Felix |last3=Knaup |first3=Tobias |last4=Schuller |first4=Bjorn |last5=Sun |first5=Congkai |last6=Sagae |first6=Kenji |last7=Morency |first7=Louis-Philippe |title=YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context |journal=IEEE Intelligent Systems |date=May 2013 |volume=28 |issue=3 |pages=46–53 |doi=10.1109/MIS.2013.34}}</ref>, analysis of news videos<ref>{{cite journal |last1=Pereira |first1=Moisés H. R. |last2=Pádua |first2=Flávio L. C. |last3=Pereira |first3=Adriano C. M. |last4=Benevenuto |first4=Fabrício |last5=Dalip |first5=Daniel H. |title=Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos |journal=arXiv:1604.02612 [cs] |date=9 April 2016 |url=http://arxiv.org/abs/1604.02612}}</ref>, and [[emotion recognition]] (sometimes known as [[emotion]] detection) such as [[depression]] monitoring<ref name = "s6">{{cite journal |last1=Zucco |first1=Chiara |last2=Calabrese |first2=Barbara |last3=Cannataro |first3=Mario |title=Sentiment analysis and affective computing for depression monitoring |journal=2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) |date=November 2017 |pages=1988–1995 |doi=10.1109/bibm.2017.8217966 |url=http://doi.ieeecomputersociety.org/10.1109/BIBM.2017.8217966 |publisher=IEEE |language=English}}</ref>, among others.
 
Similar to the traditional sentiment analysis, one of the most basic tasks in multimodal sentiment analysis is [[sentiment]] classification, which is classifying different [[sentiment]]s into positive, negative, or neutral<ref>{{cite book |last1=Pang |first1=Bo |last2=Lee |first2=Lillian |title=Opinion mining and sentiment analysis |date=2008 |publisher=Now Publishers |___location=Hanover, MA |isbn=1601981503}}</ref>. The complexity of analyzing text, audio, and visual features to perform such a task, requires different fusion techniques such as feature-level, decision-level, and hybrid fusion.<ref name="s1"></ref> The performance of these fusion techniques and the classification algorithms applied, are influenced by the type of textual, audio, and visual features employed in the analysis.<ref name = "s7"></ref>
Line 19:
== Fusion Techniques ==
 
Unlike the traditional text-based sentiment analysis, multimodal sentiment analysis undergoes a fusion process in which data from different [[Modality (human–computer interaction)|modalities]] (text, audio, or visual) are fused and analyzed together.<ref name ="s1"></ref> The existing approaches in multimodal sentiment analysis [[data fusion]] can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the [[sentiment]] classification depends on which type of fusion technique is employed.<ref name ="s1"></ref>
 
=== Feature-level Fusion ===
Line 27:
=== Decision-level Fusion ===
 
Decision-level fusion (sometimes known as late fusion), feeds data from each [[modality (human–computer interaction)|modality]] (text, audio, or visual) independently into its own classification algorithm, and obtains the final [[sentiment]] classification results by fusing each result into a single decision vector.<ref name="s3"></ref> One of the advantages of this fusion technique is it eliminates the need to fuse heterogeneous data, and each [[modality (human–computer interaction)|modality]] can utilize its most appropriate classification algorithm.<ref name="s1"></ref>
 
=== Hybrid Fusion ===
 
Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process.<ref name="s4"></ref> It usually involves a two-step procedure wherein feature-level fusion is initially performed between two [[Modality (human–computer interaction)|modalities]], and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining [[Modality (human–computer interaction)|modality]].<ref>{{cite journal |last1=Shahla |first1=Shahla |last2=Naghsh-Nilchi |first2=Ahmad Reza |title=Exploiting evidential theory in the fusion of textual, audio, and visual modalities for affective music video retrieval - IEEE Conference Publication |journal=ieeexplore.ieee.org |date=2017 |url=https://ieeexplore.ieee.org/abstract/document/7983051/}}</ref><ref>{{cite journal |last1=Poria |first1=Soujanya |last2=Peng |first2=Haiyun |last3=Hussain |first3=Amir |last4=Howard |first4=Newton |last5=Cambria |first5=Erik |title=Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis |journal=Neurocomputing |date=October 2017 |volume=261 |pages=217–230 |doi=10.1016/j.neucom.2016.09.117}}</ref>
 
== Applications ==