Multimodal sentiment analysis: Difference between revisions

Browse history interactively

Content deleted Content added

VisualWikitext

Revision as of 09:12, 15 June 2018 edit Keith arano (talk \| contribs) 331 edits new article		Latest revision as of 15:48, 18 November 2024 edit undo Citation bot (talk \| contribs) Bots 5,868,191 edits Altered template type. Add: class, eprint. Removed URL that duplicated identifier. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar
(47 intermediate revisions by 26 users not shown)
Line 1: {{Short description\|Technology for sentiment analysis}} '''Multimodal [[sentiment analysis]]''' is a ~~new~~technology ~~dimension of the~~for traditional text-based [[sentiment analysis]], which ~~goes beyond the analysis of texts, and~~ includes ~~other~~ [[Modality (human–computer interaction)\|modalities]] such as audio and visual data.<ref>{{cite journal \|last1=Soleymani \|first1=Mohammad \|last2=Garcia \|first2=David \|last3=Jou \|first3=Brendan \|last4=Schuller \|first4=Björn \|last5=Chang \|first5=Shih-Fu \|last6=Pantic \|first6=Maja \|title=A survey of multimodal sentiment analysis \|journal=Image and Vision Computing \|date=September 2017 \|volume=65 \|pages=3–14 \|doi=10.1016/j.imavis.2017.08.003\|s2cid=19491070 \|url=https://zenodo.org/record/3449163 }}</ref> It can be bimodal, which includes different combinations of two ~~[[Modality (human–computer interaction)\|~~modalities]], or trimodal, which incorporates three ~~[[Modality (human–computer interaction)\|~~modalities]].<ref>{{cite journal \|last1=Karray \|first1=Fakhreddine \|last2=Milad \|first2=Alemzadeh \|last3=Saleh \|first3=Jamil Abou \|last4=Mo Nours \|first4=Arab \|title=Human-Computer Interaction: Overview on State of the Art \|journal=International Journal on Smart Sensing and Intelligent Systems \|volume=1 \|pages=137–159 \|date=2008 \|url=http://s2is.org/Issues/v1/n1/papers/paper9.pdf\|doi=10.21307/ijssis-2017-283 \|doi-access=free }}</ref> With the extensive amount of [[social media]] data available online in different forms such as videos and images, the conventional text-based [[sentiment analysis]] has evolved into more complex models of multimodal [[sentiment analysis]],<ref name="s1">{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Cambria \|first2=Erik \|last3=Bajpai \|first3=Rajiv \|last4=Hussain \|first4=Amir \|title=A review of affective computing: From unimodal analysis to multimodal fusion \|journal=Information Fusion \|date=September 2017 \|volume=37 \|pages=98–125 \|doi=10.1016/j.inffus.2017.02.003\|hdl=1893/25490 \|s2cid=205433041 \|url=http://researchrepository.napier.ac.uk/Output/1792429 \|hdl-access=free }}</ref><ref>{{cite arXiv \|last1=Nguyen \|first1=Quy Hoang \|title=New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis \|date=2024-05-01 \|eprint=2405.00543 \|last2=Nguyen \|first2=Minh-Van Truong \|last3=Van Nguyen \|first3=Kiet\|class=cs.CL }}</ref>, which can be applied in the development of [[virtual assistant]]s,<ref name ="s5">{{cite web \|title=Google AI to make phone calls for you \|url=https://www.bbc.com/news/technology-44045424 \|website=BBC News \|~~accessdate~~access-date=12 June 2018 \|date=8 May 2018}}</ref>, [[Social media analytics\|analysis]] of YouTube movie reviews,<ref name="s4">{{cite journal \|last1=Wollmer \|first1=Martin \|last2=Weninger \|first2=Felix \|last3=Knaup \|first3=Tobias \|last4=Schuller \|first4=Bjorn \|last5=Sun \|first5=Congkai \|last6=Sagae \|first6=Kenji \|last7=Morency \|first7=Louis-Philippe \|title=YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context \|journal=IEEE Intelligent Systems \|date=May 2013 \|volume=28 \|issue=3 \|pages=46–53 \|doi=10.1109/MIS.2013.34\|s2cid=12789201 \|url=https://opus.bibliothek.uni-augsburg.de/opus4/files/72633/72633.pdf }}</ref>, [[Social media analytics\|analysis]] of news videos,<ref>{{cite ~~journal~~ arXiv\|last1=Pereira \|first1=Moisés H. R. \|last2=Pádua \|first2=Flávio L. C. \|last3=Pereira \|first3=Adriano C. M. \|last4=Benevenuto \|first4=Fabrício \|last5=Dalip \|first5=Daniel H. \|title=Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos ~~\|journal=arXiv:1604.02612 [cs]~~ \|date=9 April 2016 \|~~url~~eprint=~~http://arxiv.org/abs/~~1604.02612\|class=cs.CL }}</ref>, and [[emotion recognition]] (sometimes known as [[emotion]] detection) such as [[depression (mood)\|depression]] monitoring,<ref name = "s6">{{cite ~~journal~~book \|last1=Zucco \|first1=Chiara \|last2=Calabrese \|first2=Barbara \|last3=Cannataro \|first3=Mario \|title~~=Sentiment analysis and affective computing for depression monitoring \|journal~~=2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) \|chapter=Sentiment analysis and affective computing for depression monitoring \|date=November 2017 \|pages=1988–1995 \|doi=10.1109/bibm~~.2017.8217966 \|url=http://doi.ieeecomputersociety.org/10.1109/BIBM~~.2017.8217966 \|publisher=IEEE \|language=~~English~~en\|isbn=978-1-5090-3050-7 \|s2cid=24408937 }}</ref>, among others. Similar to the traditional [[sentiment analysis]], one of the most basic ~~tasks~~task in multimodal [[sentiment analysis]] is [[Feeling\|sentiment]] classification, which ~~is classifying~~classifies different ~~[[sentiment]]s~~sentiments into categories such as positive, negative, or neutral.<ref>{{cite book \|last1=Pang \|first1=Bo \|last2=Lee \|first2=Lillian \|title=Opinion mining and sentiment analysis \|date=2008 \|publisher=Now Publishers \|___location=Hanover, MA \|isbn=~~1601981503~~978-1601981509}}</ref>. The complexity of [[Social media analytics\|analyzing]] text, audio, and visual features to perform such a task, requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion.<ref name="s1">< /~~ref~~> The performance of these fusion techniques and the [[classification]] ~~algorithms~~[[algorithm]]s applied, are influenced by the type of textual, audio, and visual features employed in the analysis.<ref name = "s7">< /~~ref~~> == Features == [[Feature engineering]], which involves the selection of features that are fed into [[machine learning]] algorithms, plays a key role in the [[sentiment]] classification performance.<ref name = "s7">{{cite journal \|last1=Sun \|first1=Shiliang \|last2=Luo \|first2=Chen \|last3=Chen \|first3=Junyu \|title=A review of natural language processing techniques for opinion mining systems \|journal=Information Fusion \|date=July 2017 \|volume=36 \|pages=10–25 \|doi=10.1016/j.inffus.2016.10.004}}</ref> In multimodal [[sentiment analysis]], a combination of different textual, audio, and visual features are employed.<ref name = "s1">< /~~ref~~> === Textual ~~Features~~features === Similar to the conventional text-based [[sentiment analysis]], some of the most commonly used textual features in multimodal [[sentiment analysis]] are [[n-grams\|unigrams]] and [[n-gram]]s, which are basically a sequence of words in a given textual document.<ref>{{cite journal \|last1=Yadollahi \|first1=Ali \|last2=Shahraki \|first2=Ameneh Gholipour \|last3=Zaiane \|first3=Osmar R. \|title=Current State of Text Sentiment Analysis from Opinion to Emotion Mining \|journal=ACM Computing Surveys \|date=25 May 2017 \|volume=50 \|issue=2 \|pages=1–33 \|doi=10.1145/3057270\|s2cid=5275807 }}</ref> These features are applied using [[bag-of-words]] or bag-of-concepts feature representations, in which words or concepts are represented as vectors in a suitable space.<ref name="s2">{{cite journal \|last1=Perez Rosas \|first1=Veronica \|last2=Mihalcea \|first2=Rada \|last3=Morency \|first3=Louis-Philippe \|title=Multimodal Sentiment Analysis of Spanish Online Videos \|journal=IEEE Intelligent Systems \|date=May 2013 \|volume=28 \|issue=3 \|pages=38–45 \|doi=10.1109/MIS.2013.9\|s2cid=1132247 }}</ref><ref>{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Cambria \|first2=Erik \|last3=Hussain \|first3=Amir \|last4=Huang \|first4=Guang-Bin \|title=Towards an intelligent framework for multimodal affective data analysis \|journal=Neural Networks \|date=March 2015 \|volume=63 \|pages=104–116 \|doi=10.1016/j.neunet.2014.10.005\|pmid=25523041 \|hdl=1893/21310 \|s2cid=342649 \|hdl-access=free }}</ref> === Audio ~~Features~~features === [[Feeling\|Sentiment]] and [[emotion]] characteristics are prominent in different [[phonetic]] and [[prosodic]] properties contained in audio features.<ref>{{cite journal \|last1=Chung-Hsien Wu \|last2=Wei-Bin Liang \|title=Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels \|journal=IEEE Transactions on Affective Computing \|date=January 2011 \|volume=2 \|issue=1 \|pages=10–21 \|doi=10.1109/T-AFFC.2010.16\|s2cid=52853112 }}</ref> Some of the most important audio features employed in multimodal [[sentiment analysis]] are [[mel-frequency cepstrum\| mel-frequency cepstrum (MFCC)]], [[spectral centroid]], [[spectral flux]], [[beat]] histogram, [[beat]] sum, strongest [[beat]], pause duration, and [[pitch accent\|pitch]].<ref name="s1">< /~~ref~~> [[OpenSMILE]]<ref>{{cite ~~journal~~book \|last1=Eyben \|first1=Florian \|last2=Wöllmer \|first2=Martin \|last3=Schuller \|first3=Björn \|title=OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit - IEEE Conference Publication \|~~journal~~pages=~~ieeexplore.ieee.org~~1 \|date=2009 \|doi=10.1109/ACII.2009.5349350 \|isbn=978-1-4244-4800-5 \|chapter=OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit \|s2cid=2081569 \|url=~~http~~https://~~ieeexplore.ieee~~nbn-resolving.org/~~document/5349350~~urn:nbn:de:bvb:384-opus4-766112 }}</ref> and [[Praat]] are popular open-source toolkits for extracting such audio features.<ref>{{cite ~~journal~~ book\|last1=Morency \|first1=Louis-Philippe \|last2=Mihalcea \|first2=Rada \|last3=Doshi \|first3=Payal \|title=Towards multimodal sentiment analysis: harvesting opinions from the web \|date=14 November 2011 \|pages=169–176 \|doi=10.1145/2070481.2070509 \|~~url~~publisher=~~https://dl.acm.org/citation.cfm?id~~ACM\|chapter=~~2070509~~Towards multimodal sentiment analysis \|~~publisher~~isbn=~~ACM~~9781450306416 \|s2cid=1257599 }}</ref>▼ === Visual features === ▲[[Sentiment]] and [[emotion]] characteristics are prominent in different [[phonetic]] and [[prosodic]] properties contained in audio features.<ref>{{cite journal \|last1=Chung-Hsien Wu \|last2=Wei-Bin Liang \|title=Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels \|journal=IEEE Transactions on Affective Computing \|date=January 2011 \|volume=2 \|issue=1 \|pages=10–21 \|doi=10.1109/T-AFFC.2010.16}}</ref> Some of the most important audio features employed in multimodal [[sentiment analysis]] are [[mel-frequency cepstrum\| mel-frequency cepstrum (MFCC)]], [[spectral centroid]], [[spectral flux]], [[beat]] histogram, [[beat]] sum, strongest [[beat]], pause duration, and [[pitch]].<ref name="s1"></ref> [[OpenSMILE]]<ref>{{cite journal \|last1=Eyben \|first1=Florian \|last2=Wöllmer \|first2=Martin \|last3=Schuller \|first3=Björn \|title=OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit - IEEE Conference Publication \|journal=ieeexplore.ieee.org \|date=2009 \|doi=10.1109/ACII.2009.5349350 \|url=http://ieeexplore.ieee.org/document/5349350}}</ref> and [[Praat]] are popular open-source toolkits for extracting such audio features.<ref>{{cite journal \|last1=Morency \|first1=Louis-Philippe \|last2=Mihalcea \|first2=Rada \|last3=Doshi \|first3=Payal \|title=Towards multimodal sentiment analysis: harvesting opinions from the web \|date=14 November 2011 \|pages=169–176 \|doi=10.1145/2070481.2070509 \|url=https://dl.acm.org/citation.cfm?id=2070509 \|publisher=ACM}}</ref> One of the main advantages of analyzing videos ~~over~~with respect to texts alone, ~~are~~is the presence of rich [[sentiment]] cues in visual data.<ref>{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Cambria \|first2=Erik \|last3=Hazarika \|first3=Devamanyu \|last4=Majumder \|first4=Navonil \|last5=Zadeh \|first5=Amir \|last6=Morency \|first6=Louis-Philippe \|title=Context-Dependent Sentiment Analysis in User-Generated Videos \|journal=Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|pages=873–883 \|date=2017 \|doi=10.18653/v1/p17-1081 \|~~url=https://~~doi~~.org/10.18653/v1/P17~~-~~1081 \|publisher~~access=~~Association~~free ~~for Computational Linguistics~~}}</ref> Visual features include [[facial expression]]s, which are ~~principal~~of ~~signs~~paramount ofimportance ~~understanding~~in ~~[[sentiment]]s~~capturing sentiments and [[emotion]]s, as they are a main channel of forming a person's present state of mind.<ref name="s1">< /~~ref~~> Specifically, [[smile]], is considered to be one of the most predictive visual cues in multimodal [[sentiment analysis]].<ref name="s2">< /~~ref~~> OpenFace is an open-source facial analysis toolkit available for extracting and understanding such visual features.<ref>{{cite ~~journal~~book \|title=OpenFace: An open source facial behavior analysis toolkit - IEEE Conference Publication \|~~journal~~date=~~ieeexplore~~ March 2016\|doi= 10.~~ieee~~1109/WACV.~~org~~2016.7477553\|isbn= 978-1-5090-0641-0\|s2cid= 1919851\|url= https://~~ieeexplore~~www.~~ieee~~repository.~~org~~cam.ac.uk/~~document~~handle/~~7477553~~1810/280724}}</ref>▼ === ~~Visual~~Fusion ~~Features~~techniques === Unlike the traditional text-based [[sentiment analysis]], multimodal [[sentiment analysis]] ~~undergoes~~undergo a fusion process in which data from different ~~[[Modality (human–computer interaction)\|~~modalities]] (text, audio, or visual) are fused and analyzed together.<ref name ="s1">< /~~ref~~> The existing approaches in multimodal [[sentiment analysis]] [[data fusion]] can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the [[sentiment]] classification depends on which type of fusion technique is employed.<ref name ="s1">< /~~ref~~>▼ === Feature-level ~~Fusion~~fusion ===▼ ▲One of the main advantages of analyzing videos over texts alone, are the presence of rich [[sentiment]] cues in visual data.<ref>{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Cambria \|first2=Erik \|last3=Hazarika \|first3=Devamanyu \|last4=Majumder \|first4=Navonil \|last5=Zadeh \|first5=Amir \|last6=Morency \|first6=Louis-Philippe \|title=Context-Dependent Sentiment Analysis in User-Generated Videos \|journal=Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|date=2017 \|doi=10.18653/v1/p17-1081 \|url=https://doi.org/10.18653/v1/P17-1081 \|publisher=Association for Computational Linguistics}}</ref> Visual features include [[facial expression]]s, which are principal signs of understanding [[sentiment]]s and [[emotion]]s, as they are a main channel of forming a person's present state of mind.<ref name="s1"></ref> Specifically, [[smile]], is considered to be one of the most predictive visual cues in multimodal [[sentiment analysis]].<ref name="s2"></ref> OpenFace is an open-source facial analysis toolkit available for extracting and understanding such visual features.<ref>{{cite journal \|title=OpenFace: An open source facial behavior analysis toolkit - IEEE Conference Publication \|journal=ieeexplore.ieee.org \|url=https://ieeexplore.ieee.org/document/7477553/}}</ref> Feature-level fusion (sometimes known as early fusion), gathers all the features from each [[modality (human–computer interaction)\|modality]] (text, audio, or visual) and joins them together into a single feature vector, which is eventually fed into a classification algorithm.<ref name="s3">{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Cambria \|first2=Erik \|last3=Howard \|first3=Newton \|last4=Huang \|first4=Guang-Bin \|last5=Hussain \|first5=Amir \|title=Fusing audio, visual and textual clues for sentiment analysis from multimodal content \|journal=Neurocomputing \|date=January 2016 \|volume=174 \|pages=50–59 \|doi=10.1016/j.neucom.2015.01.095\|s2cid=15287807 }}</ref>. One of the difficulties in implementing this technique is the integration of the heterogeneous features.<ref name="s1">< /~~ref~~>▼ === ~~Fusion~~Decision-level ~~Techniques~~fusion === Decision-level fusion (sometimes known as late fusion), feeds data from each [[modality ~~(human–computer interaction)\|modality]]~~ (text, audio, or visual) independently into its own classification algorithm, and obtains the final [[sentiment]] classification results by fusing each result into a single decision vector.<ref name="s3">< /~~ref~~> One of the advantages of this fusion technique is that it eliminates the need to fuse heterogeneous data, and each [[modality (human–computer interaction)\|modality]] can utilize its most appropriate [[classification]] [[algorithm]].<ref name="s1">< /~~ref~~>▼ === Hybrid ~~Fusion~~fusion ===▼ ▲Unlike the traditional text-based [[sentiment analysis]], multimodal [[sentiment analysis]] undergoes a fusion process in which data from different [[Modality (human–computer interaction)\|modalities]] (text, audio, or visual) are fused and analyzed together.<ref name ="s1"></ref> The existing approaches in multimodal [[sentiment analysis]] [[data fusion]] can be grouped into three main categories: feature-level, decision-level, and hybrid fusion, and the performance of the [[sentiment]] classification depends on which type of fusion technique is employed.<ref name ="s1"></ref> Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process.<ref name="s4">< /~~ref~~> It usually involves a two-step procedure wherein feature-level fusion is initially performed between two ~~[[Modality (human–computer interaction)\|~~modalities]], and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining [[Modality (human–computer interaction)\|modality]].<ref>{{cite journal \|last1=Shahla \|first1=Shahla \|last2=Naghsh-Nilchi \|first2=Ahmad Reza \|title=Exploiting evidential theory in the fusion of textual, audio, and visual modalities for affective music video retrieval - IEEE Conference Publication ~~\|journal=ieeexplore.ieee.org~~ \|date=2017 \|~~url~~doi=~~https:~~10.1109/~~/ieeexplore~~PRIA.~~ieee~~2017.~~org/abstract/document/~~7983051/ \|s2cid=24466718 }}</ref><ref>{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Peng \|first2=Haiyun \|last3=Hussain \|first3=Amir \|last4=Howard \|first4=Newton \|last5=Cambria \|first5=Erik \|title=Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis \|journal=Neurocomputing \|date=October 2017 \|volume=261 \|pages=217–230 \|doi=10.1016/j.neucom.2016.09.117}}</ref>▼ ▲=== Feature-level Fusion === ▲Feature-level fusion (sometimes known as early fusion), gathers all features from each [[modality (human–computer interaction)\|modality]] (text, audio, or visual) and joins them together into a single feature vector, which is eventually fed into a classification algorithm<ref name="s3">{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Cambria \|first2=Erik \|last3=Howard \|first3=Newton \|last4=Huang \|first4=Guang-Bin \|last5=Hussain \|first5=Amir \|title=Fusing audio, visual and textual clues for sentiment analysis from multimodal content \|journal=Neurocomputing \|date=January 2016 \|volume=174 \|pages=50–59 \|doi=10.1016/j.neucom.2015.01.095}}</ref>. One of the difficulties in implementing this technique is the integration of the heterogeneous features.<ref name="s1"></ref> ~~=== Decision-level Fusion ===~~ ▲Decision-level fusion (sometimes known as late fusion), feeds data from each [[modality (human–computer interaction)\|modality]] (text, audio, or visual) independently into its own classification algorithm, and obtains the final [[sentiment]] classification results by fusing each result into a single decision vector.<ref name="s3"></ref> One of the advantages of this fusion technique is it eliminates the need to fuse heterogeneous data, and each [[modality (human–computer interaction)\|modality]] can utilize its most appropriate classification algorithm.<ref name="s1"></ref> ▲=== Hybrid Fusion === ▲Hybrid fusion is a combination of feature-level and decision-level fusion techniques, which exploits complementary information from both methods during the classification process.<ref name="s4"></ref> It usually involves a two-step procedure wherein feature-level fusion is initially performed between two [[Modality (human–computer interaction)\|modalities]], and decision-level fusion is then applied as a second step, to fuse the initial results from the feature-level fusion, with the remaining [[Modality (human–computer interaction)\|modality]].<ref>{{cite journal \|last1=Shahla \|first1=Shahla \|last2=Naghsh-Nilchi \|first2=Ahmad Reza \|title=Exploiting evidential theory in the fusion of textual, audio, and visual modalities for affective music video retrieval - IEEE Conference Publication \|journal=ieeexplore.ieee.org \|date=2017 \|url=https://ieeexplore.ieee.org/abstract/document/7983051/}}</ref><ref>{{cite journal \|last1=Poria \|first1=Soujanya \|last2=Peng \|first2=Haiyun \|last3=Hussain \|first3=Amir \|last4=Howard \|first4=Newton \|last5=Cambria \|first5=Erik \|title=Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis \|journal=Neurocomputing \|date=October 2017 \|volume=261 \|pages=217–230 \|doi=10.1016/j.neucom.2016.09.117}}</ref> == Applications == Similar to text-based [[sentiment analysis]], multimodal [[sentiment analysis]] can be applied in the development of different forms of [[recommender system]]s such as in the analysis of user-generated videos of movie reviews<ref name="s4">< /~~ref~~> and general product reviews,<ref>{{cite journal \|last1=Pérez-Rosas \|first1=Verónica \|last2=Mihalcea \|first2=Rada \|last3=Morency \|first3=Louis Philippe \|title=Utterance-level multimodal sentiment analysis \|journal=Long Papers \|date=1 January 2013 \|url=https://experts.umich.edu/en/publications/utterance-level-multimodal-sentiment-analysis \|publisher=Association for Computational Linguistics (ACL)}}</ref>, to predict the ~~[[sentiment]]s~~sentiments of customers, and subsequently create product or service recommendations.<ref>{{cite web \|last1=Chui \|first1=Michael \|last2=Manyika \|first2=James \|last3=Miremadi \|first3=Mehdi \|last4=Henke \|first4=Nicolaus \|last5=Chung \|first5=Rita \|last6=Nel \|first6=Pieter \|last7=Malhotra \|first7=Sankalp \|title=Notes from the AI frontier. Insights from hundreds of use cases \|url=https://www.mckinsey.com/mgi/ \|website=McKinsey & Company \|~~publisher=McKinsey & Company \|accessdate~~access-date=13 June 2018 \|language=en}}</ref> Multimodal [[sentiment analysis]] also plays an important role in the advancement of [[virtual assistant]]s through the application of [[natural language processing]] (NLP) and [[machine learning]] techniques.<ref name ="s5">< /~~ref~~> In the healthcare ___domain, multimodal [[sentiment analysis]] can be utilized to detect certain medical conditions such as [[Psychological stress\|stress]], [[anxiety]], or [[Depression (mood)\|depression]].<ref name = "s6">< /~~ref~~> Multimodal [[sentiment analysis]] can also be applied in understanding the ~~[[sentiment]]s~~sentiments contained in video news programs, which is considered as a complicated and challenging ___domain, as sentiments expressed by reporters tend to be less obvious or neutral.<ref>{{cite ~~journal~~ book\|last1=Ellis \|first1=Joseph G. \|last2=Jou \|first2=Brendan \|last3=Chang \|first3=Shih-Fu \|title=Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News \|date=12 November 2014 \|pages=104–111 \|doi=10.1145/2663204.2663237 \|~~url~~publisher=~~https://dl.acm.org/citation.cfm?doid~~ACM\|chapter=~~2663204.2663237~~Why We Watch the News \|~~publisher~~isbn=~~ACM~~9781450328852 \|s2cid=14112246 }}</ref>▼ ==References==▼ ▲Similar to text-based [[sentiment analysis]], multimodal [[sentiment analysis]] can be applied in the development of different forms of [[recommender system]]s such as in the analysis of user-generated videos of movie reviews<ref name="s4"></ref> and general product reviews<ref>{{cite journal \|last1=Pérez-Rosas \|first1=Verónica \|last2=Mihalcea \|first2=Rada \|last3=Morency \|first3=Louis Philippe \|title=Utterance-level multimodal sentiment analysis \|journal=Long Papers \|date=1 January 2013 \|url=https://experts.umich.edu/en/publications/utterance-level-multimodal-sentiment-analysis \|publisher=Association for Computational Linguistics (ACL)}}</ref>, to predict the [[sentiment]]s of customers, and subsequently create product or service recommendations.<ref>{{cite web \|last1=Chui \|first1=Michael \|last2=Manyika \|first2=James \|last3=Miremadi \|first3=Mehdi \|last4=Henke \|first4=Nicolaus \|last5=Chung \|first5=Rita \|last6=Nel \|first6=Pieter \|last7=Malhotra \|first7=Sankalp \|title=Notes from the AI frontier. Insights from hundreds of use cases \|url=https://www.mckinsey.com/mgi/ \|website=McKinsey & Company \|publisher=McKinsey & Company \|accessdate=13 June 2018 \|language=en}}</ref> Multimodal [[sentiment analysis]] also plays an important role in the advancement of [[virtual assistant]]s through the application of [[natural language processing]] (NLP) and [[machine learning]] techniques.<ref name ="s5"></ref> In the healthcare ___domain, multimodal [[sentiment analysis]] can be utilized to detect certain medical conditions such as [[Psychological stress\|stress]], [[anxiety]], or [[Depression (mood)\|depression]].<ref name = "s6"></ref> Multimodal [[sentiment analysis]] can also be applied in understanding the [[sentiment]]s contained in video news programs, which is considered as a complicated and challenging ___domain, as sentiments expressed by reporters tend to be less obvious or neutral.<ref>{{cite journal \|last1=Ellis \|first1=Joseph G. \|last2=Jou \|first2=Brendan \|last3=Chang \|first3=Shih-Fu \|title=Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News \|date=12 November 2014 \|pages=104–111 \|doi=10.1145/2663204.2663237 \|url=https://dl.acm.org/citation.cfm?doid=2663204.2663237 \|publisher=ACM}}</ref> {{Reflist}} ▲==References== ~~{{Reference}}~~ [[Category:Natural language processing]] [[Category:Affective computing]] [[Category:Social media]] [[Category:Machine learning]] [[Category:Multimodal interaction]]