Text-to-video model: Difference between revisions

Content deleted Content added
removed Category:Computers using HotCat
Citation bot (talk | contribs)
Added date. Removed URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 670/967
 
(24 intermediate revisions by 21 users not shown)
Line 1:
{{shortShort description|Machine learning model}}
{{Use dmy dates|date=November 2024}}
[[File:OpenAI Sora in Action- Tokyo Walk.webm|thumb|upright=1.35|A video generated using OpenAI's [[Sora (text-to-video model)|Sora]] text-to-video model, using the prompt: <code>A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.</code>]]
Line 6:
== Models ==
{{Globalize|section|date=August 2024}}
There are different models, including [[open source]] models. Chinese-language input<ref name=":5">{{Cite web |last=Wodecki |first=Ben |date=2023-08-11 |title=Text-to-Video Generative AI Models: The Definitive List |url=https://aibusiness.com/nlp/ai-video-generation-the-supreme-list |access-date=2024-11-18 |website=AI Business |publisher=[[Informa]]}}</ref> CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on [[GitHub]] in 2022.<ref>{{Citation |title=CogVideo |date=2022-10-12 |url=https://github.com/THUDM/CogVideo |publisher=THUDM |access-date=2022-10-12}}</ref> That year, [[Meta Platforms]] released a partial text-to-video model called "Make-A-Video",<ref>{{Cite web |last=Davies |first=Teli |date=2022-09-29 |title=Make-A-Video: Meta AI's New Model For Text-To-Video Generation |url=https://wandb.ai/telidavies/ml-news/reports/Make-A-Video-Meta-AI-s-New-Model-For-Text-To-Video-Generation--VmlldzoyNzE4Nzcx |access-date=2022-10-12 |website=Weights & Biases |language=en}}</ref><ref name="Monge">{{Cite web |last=Monge |first=Jim Clyde |date=2022-08-03 |title=This AI Can Create Video From Text Prompt |url=https://betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba |access-date=2022-10-12 |website=Medium |language=en}}</ref><ref>{{Cite web |title=Meta's Make-A-Video AI creates videos from text |url=https://www.fonearena.com/blog/375627/meta-make-a-video-ai-create-videos-from-text.html |access-date=2022-10-12 |website=www.fonearena.com}}</ref> and [[Google]]'s [[Google Brain|Brain]] (later [[Google DeepMind]]) introduced Imagen Video, a text-to-video model with 3D [[U-Net]].<ref>{{Cite news |title=google: Google takes on Meta, introduces own video-generating AI |url=https://m.economictimes.com/tech/technology/google-takes-on-meta-introduces-own-video-generating-ai/articleshow/94681128.cms |access-date=2022-10-12 |website=[[The Economic Times]]| date=6 October 2022 }}</ref><ref>{{Cite web |lastname="Monge |first=Jim Clyde |date=2022-08-03 |title=This AI Can Create Video From Text Prompt |url=https:"//betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba |access-date=2022-10-12 |website=Medium |language=en}}</ref><ref>{{Cite web |title=Nuh-uh, Meta, we can do text-to-video AI, too, says Google |url=https://www.theregister.com/AMP/2022/10/06/google_ai_imagen_video/ |access-date=2022-10-12 |website=[[The Register]]}}</ref><ref>{{Cite web |title=Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction |url=https://paperswithcode.com/paper/see-plan-predict-language-guided-cognitive |access-date=2022-10-12 |website=paperswithcode.com |language=en}}</ref><ref>{{Cite web |title=Papers with Code - Text-driven Video Prediction |url=https://paperswithcode.com/paper/text-driven-video-prediction |access-date=2022-10-12 |website=paperswithcode.com |language=en}}</ref>
 
In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.<ref>{{Cite arXiv |eprint=2303.08320 |class=cs.CV |first1=Zhengxiong |last1=Luo |first2=Dayou |last2=Chen |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |date=2023 |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu}}</ref> The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the ___domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.<ref>{{Cite arXiv |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |eprint=2303.08320 |last1=Luo |first1=Zhengxiong |last2=Chen |first2=Dayou |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu |date=2023 |class=cs.CV }}</ref> In the same month, [[Adobe Inc.|Adobe]] introduced Firefly AI as part of its features.<ref>{{Cite web |date=2024-10-10 |title=Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom |url=https://news.adobe.com/news/2024/10/101424-adobe-launches-firefly-video-model |access-date=2024-11-18 |publisher=[[Adobe Inc.]]}}</ref>
 
In January 2024, [[Google]] announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.<ref>{{Cite web |last=Yirka |first=Bob |date=2024-01-26 |title=Google announces the development of Lumiere, an AI-based next-generation text-to-video generator. |url=https://techxplore.com/news/2024-01-google-lumiere-ai-based-generation.html |access-date=2024-11-18 |website=Tech Xplore}}</ref> [[Matthias Niessner]] and [[Lourdes Agapito]] at AI company [[Synthesia (company)|Synthesia]] work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.<ref>{{Cite web |title=Text to Speech for Videos |url=https://www.synthesia.io/text-to-speech |access-date=2023-10-17 |website=Synthesia.io}}</ref> In June 2024, Luma Labs launched its [[Dream Machine (text-to-video model)|Dream Machine]] video tool.<ref>{{Cite web |last=Nuñez |first=Michael |date=2024-06-12 |title=Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race |url=https://venturebeat.com/ai/luma-ai-debuts-dream-machine-for-realistic-video-generation-heating-up-ai-media-race/ |access-date=2024-11-18 |website=VentureBeat |language=en-US}}</ref><ref>{{Cite web |last=Fink |first=Charlie |title=Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video |url=https://www.forbes.com/sites/charliefink/2024/06/13/apple-debuts-intelligence-mistral-raises-600-million-new-ai-text-to-video/ |access-date=2024-11-18 |website=Forbes |language=en}}</ref> That same month,<ref>{{Cite web |last=Franzen |first=Carl |date=2024-06-12 |title=What you need to know about Kling, the AI video generator rival to Sora that's wowing creators |url=https://venturebeat.com/ai/what-you-need-to-know-about-kling-the-ai-video-generator-rival-to-sora-thats-wowing-creators/ |access-date=2024-11-18 |website=VentureBeat |language=en-US}}</ref> [[Kuaishou]] extended its Kling AI text-to-video model to international users. In July 2024, [[TikTok]] owner [[ByteDance]] released Jimeng AI in China, through its subsidiary, Faceu Technology.<ref>{{Cite web |date=2024-08-06 |title=ByteDance joins OpenAI's Sora rivals with AI video app launch |url=https://www.reuters.com/technology/artificial-intelligence/bytedance-joins-openais-sora-rivals-with-ai-video-app-launch-2024-08-06/ |access-date=2024-11-18 |publisher=[[Reuters]]}}</ref> By September 2024, the Chinese AI company [[MiniMax (company)|MiniMax]] debuted its video-01 model, joining other established AI model companies like [[Zhipu AI]], [[Baichuan]], and [[Moonshot AI]], which contribute to China’sChina's involvement in AI technology.<ref>{{Cite web |date=2024-09-02 |title=Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora |url=https://finance.yahoo.com/news/chinese-ai-tiger-minimax-launches-093000322.html |access-date=2024-11-18 |website=Yahoo! Finance}}</ref> In December 2024 [[Lightricks]] launched LTX Video as an open source model.<ref>{{Cite web |last=Requiroso |first=Kelvene |date=2024-12-15 |title=Lightricks' LTXV Model Breaks Speed Records, Generating 5-Second AI Video Clips in 4 Seconds |url=https://www.eweek.com/news/lightricks-open-source-ai-video-generator/ |access-date=2025-07-24 |website=eWEEK |language=en-US}}</ref>
 
Alternative approaches to text-to-video models include<ref>{{Citation |title=Text2Video-Zero |date=2023-08-12 |url=https://github.com/Picsart-AI-Research/Text2Video-Zero |access-date=2023-08-12 |publisher=Picsart AI Research (PAIR)}}</ref> Google's Phenaki, Hour One, [[Colossyan]],<ref name=":5" /> [[Runway (company)|Runway]]'s Gen-3 Alpha,<ref>{{Cite web |last=Kemper |first=Jonathan |date=2024-07-01 |title=Runway's Sora competitor Gen-3 Alpha now available |url=https://the-decoder.com/runways-sora-competitor-gen-3-alpha-now-available/ |access-date=2024-11-18 |website=THE DECODER |language=en-US}}</ref><ref>{{Cite news |date=2023-03-20 |title=Generative AI's Next Frontier Is Video |url=https://www.bloomberg.com/news/articles/2023-03-20/generative-ai-s-next-frontier-is-video |access-date=2024-11-18 |work=Bloomberg.com |language=en}}</ref> and OpenAI's [[Sora (text-to-video model)|Sora]],<ref>{{Cite web |date=2024-02-15 |title=OpenAI teases 'Sora,' its new text-to-video AI model |url=https://www.nbcnews.com/tech/tech-news/openai-sora-video-artificial-intelligence-unveiled-rcna139065 |access-date=2024-11-18 |website=NBC News |language=en}}</ref> <ref>{{Cite web |last=Kelly |first=Chris |date=2024-06-25 |title=Toys R Us creates first brand film to use OpenAI's text-to-video tool |url=https://www.marketingdive.com/news/toys-r-us-openai-sora-gen-ai-first-text-video/719797/ |access-date=2024-11-18 |website=Marketing Dive |publisher=[[Informa]] |language=en-US}}</ref> Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.<ref>{{Cite book |last1=Jin |first1=Jiayao |last2=Wu |first2=Jianhang |last3=Xu |first3=Zhoucheng |last4=Zhang |first4=Hang |last5=Wang |first5=Yaxin |last6=Yang |first6=Jielong |chapter=Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network |date=2023-08-04 |title=2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT) |chapter-url=https://ieeexplore.ieee.org/document/10336607 |publisher=IEEE |pages=108–114 |doi=10.1109/CCPQT60491.2023.00024 |isbn=979-8-3503-4269-7}}</ref> [[FLUX.1]] developer Black Forest Labs has announced its text-to-video model SOTA.<ref>{{Cite web |date=2024-08-01 |title=Announcing Black Forest Labs |url=https://blackforestlabs.ai/announcing-black-forest-labs/ |access-date=2024-11-18 |website=Black Forest Labs |language=en-US}}</ref> [[Google]] was preparing to launch a video generation tool named [[Veo (text-to-video model)|Veo]] for [[YouTube Shorts]] in 2025.<ref>{{Cite web |last=Forlini |first=Emily Dreibelbis |date=2024-09-18 |title=Google's veo text-to-video AI generator is coming to YouTube shorts |url=https://www.pcmag.com/news/googles-veo-text-to-video-ai-generator-is-coming-to-youtube-shorts |access-date=2024-11-18 |website=[[PC Magazine]]}}</ref> OnIn May 2025, Google launched the Veo 3 iteration of the model. It was noted for it'sits impressive audio generation capabilities, which were a previous limitation for text-to-video models.<ref>{{Cite web |last=Subin |first=Jennifer Elias,Samantha |date=2025-05-20 |title=Google launches Veo 3, an AI video generator that incorporates audio |url=https://www.cnbc.com/2025/05/20/google-ai-video-generator-audio-veo-3.html |access-date=2025-05-22 |website=CNBC |language=en}}</ref> In July 2025 Lightricks released an update to LTX Video capable of generating clips reaching 60 seconds.<ref>{{Cite web |last=Fink |first=Charlie |title=LTX Video Breaks The 60-Second Barrier, Redefining AI Video As A Longform Medium |url=https://www.forbes.com/sites/charliefink/2025/07/16/ltx-video-breaks-the-60-second-barrier-redefining-ai-video-as-a-longform-medium/ |access-date=2025-07-24 |website=Forbes |language=en}}</ref><ref>{{Cite web |date=2025-07-16 |title=Lightricks' latest release lets creators direct long-form AI-generated videos in real time |url=https://siliconangle.com/2025/07/16/lightricks-latest-release-allows-creators-direct-longform-ai-generated-videos-real-time/ |access-date=2025-07-24 |website=SiliconANGLE |language=en-US}}</ref>
 
== Architecture and training ==
There are several architectures that have been used to create Text-to-Video models. Similar to [[Text-to-image model|Text-to-Image]] models, these models can be trained using [[Recurrent neural network|Recurrent Neural Networks]] (RNNs) such as [[long short-term memory]] (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively.<ref name=":02">{{Cite book |last1=Bhagwatkar |first1=Rishika |last2=Bachu |first2=Saketh |last3=Fitter |first3=Khurshed |last4=Kulkarni |first4=Akshay |last5=Chiddarwar |first5=Shital |chapter=A Review of Video Generation Approaches |date=2020-12-17 |title=2020 International Conference on Power, Instrumentation, Control and Computing (PICC) |chapter-url=https://ieeexplore.ieee.org/document/9362485 |publisher=IEEE |pages=1–5 |doi=10.1109/PICC51425.2020.9362485 |isbn=978-1-7281-7590-4}}</ref> An alternative for these include transformer models. [[Generative adversarial network|Generative adversarial networks]]s (GANs), [[Variational autoencoder|Variational autoencoders]]s (VAEs), — which can aid in the prediction of human motion<ref>{{Cite book |last1=Kim |first1=Taehoon |last2=Kang |first2=ChanHee |last3=Park |first3=JaeHyuk |last4=Jeong |first4=Daun |last5=Yang |first5=ChangHee |last6=Kang |first6=Suk-Ju |last7=Kong |first7=Kyeongbo |chapter=Human Motion Aware Text-to-Video Generation with Explicit Camera Control |date=2024-01-03 |title=2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) |chapter-url=https://ieeexplore.ieee.org/document/10484108 |publisher=IEEE |pages=5069–5078 |doi=10.1109/WACV57701.2024.00500 |isbn=979-8-3503-1892-0}}</ref> — and diffusion models have also been used to develop the image generation aspects of the model.<ref name=":12">{{Cite book |last=Singh |first=Aditi |chapter=A Survey of AI Text-to-Image and AI Text-to-Video Generators |date=2023-05-09 |title=2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC) |chapter-url=https://ieeexplore.ieee.org/document/10303174 |publisher=IEEE |pages=32–36 |doi=10.1109/AIRC57904.2023.10303174 |isbn=979-8-3503-4824-8|arxiv=2311.06329 }}</ref>
 
Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M.<ref name=":2223">{{cite arXiv |last1=Miao |first1=Yibo |title=T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models |date=2024-09-08 |eprint=2407.05965 |last2=Zhu |first2=Yifan |last3=Dong |first3=Yinpeng |last4=Yu |first4=Lijia |last5=Zhu |first5=Jun |last6=Gao |first6=Xiao-Shan|class=cs.CV }}</ref><ref name=":32">{{Cite book |last1=Zhang |first1=Ji |last2=Mei |first2=Kuizhi |last3=Wang |first3=Xiao |last4=Zheng |first4=Yu |last5=Fan |first5=Jianping |chapter=From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification |date=August 2018 |title=2018 24th International Conference on Pattern Recognition (ICPR) |chapter-url=https://ieeexplore.ieee.org/document/8545513 |publisher=IEEE |pages=1695–1700 |doi=10.1109/ICPR.2018.8545513 |isbn=978-1-5386-3788-3}}</ref> These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM.<ref name=":2223" /><ref name=":32" /> These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.
 
The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence.<ref name=":32" /> This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.<ref name=":32" />
 
== Limitations ==
Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs.<ref name=":03">{{Cite book |last1=Bhagwatkar |first1=Rishika |last2=Bachu |first2=Saketh |last3=Fitter |first3=Khurshed |last4=Kulkarni |first4=Akshay |last5=Chiddarwar |first5=Shital |chapter=A Review of Video Generation Approaches |date=2020-12-17 |title=2020 International Conference on Power, Instrumentation, Control and Computing (PICC) |chapter-url=https://ieeexplore.ieee.org/document/9362485 |publisher=IEEE |pages=1–5 |doi=10.1109/PICC51425.2020.9362485 |isbn=978-1-7281-7590-4}}</ref><ref name=":13">{{Cite book |last=Singh |first=Aditi |chapter=A Survey of AI Text-to-Image and AI Text-to-Video Generators |date=2023-05-09 |title=2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC) |chapter-url=https://ieeexplore.ieee.org/document/10303174 |publisher=IEEE |pages=32–36 |doi=10.1109/AIRC57904.2023.10303174 |isbn=979-8-3503-4824-8|arxiv=2311.06329 }}</ref> Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility.<ref name=":13" /><ref name=":03" />
 
Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model’smodel's ability to align generated video with the user’suser's intended message.<ref name=":13" /><ref name=":32"/> Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation.<ref name=":13" />
 
Another issue with the outputs is that text or fine details in AI-generated videos often appear garbled, a problem that [[Stable Diffusion|stable diffusion]] models also struggle with. Examples include distorted hands and unreadable text.
Line 32:
| date = December 2024
}}
The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent.<ref name=":23">{{cite arXiv |last1=Miao |first1=Yibo |title=T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models |date=2024-09-08 |eprint=2407.05965 |last2=Zhu |first2=Yifan |last3=Dong |first3=Yinpeng |last4=Yu |first4=Lijia |last5=Zhu |first5=Jun |last6=Gao |first6=Xiao-Shan|class=cs.CV }}</ref> Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.<ref name=":23" />
 
== Impacts and applications ==
Line 38:
| date = December 2024
}}
Text-to-video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate content.<ref name=":14">{{Cite book |last=Singh |first=Aditi |chapter=A Survey of AI Text-to-Image and AI Text-to-Video Generators |date=2023-05-09 |title=2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC) |chapter-url=https://ieeexplore.ieee.org/document/10303174 |publisher=IEEE |pages=32–36 |doi=10.1109/AIRC57904.2023.10303174 |isbn=979-8-3503-4824-8|arxiv=2311.06329 }}</ref>
 
During the [[Russo-Ukrainian War|Russo-Ukrainian war]], fake videos made with [[Artificial intelligence|Artificial Intelligence]] were created as part of a [[Disinformation in the Russian invasion of Ukraine|propaganda war against Ukraine]] and shared in [[social media]]. These included depictions of children in the [[Ukrainian Armed Forces]], fake ads targeting children encouraging them to denounce critics of the [[Ukrainian government]], or fictitious statements by [[Ukrainian President]] [[Volodymyr Zelenskyy]] about the country's surrender, among others.<ref>{{cite web|access-date=2025-06-16 |date=2025-06-09 |first=ალექსი |language=en-US |last=ქურასბედიანი |title=AI-Generated Photo Of Ukrainian Children In Military Uniforms Circulated Online {{!}} Mythdetector.com |url=https://mythdetector.com/en/ai-generated-photo-of-ukrainian-children/}}<!-- auto-translated from Spanish by Module:CS1 translator --></ref><ref>{{cite web|access-date=2025-06-16 |date=2025-03-28 |language=en |title=Fake Ukraine ad urges kids to report relatives enjoying Russian music |url=https://www.euronews.com/my-europe/2025/03/28/fake-ukrainian-tv-advert-urges-children-to-report-relatives-listening-to-russian-music |website=euronews}}<!-- auto-translated from Spanish by Module:CS1 translator --></ref><ref>{{cite web|access-date=2025-06-16 |date=2024-06-26 |language=en |title=Photos of Ukrainian children generated by artificial intelligence |url=https://behindthenews.ua/en/feiki/inshe/photos-of-ukrainian-children-generated-by-artificial-intelligence-607/ |website=behindthenews.ua}}<!-- auto-translated from Spanish by Module:CS1 translator --></ref><ref>{{cite web|title=Fake Ukrainian TV advert urges children to report relatives listening to Russian music |url=https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_a5b9d956-83dc-43c9-b435-639c28c8dcc9}}<!-- auto-translated from Spanish by Module:CS1 translator --></ref><ref>{{cite news|access-date=2025-06-16 |date=2022-03-16 |language=en |periodical=NPR |title=Deepfake video of Zelenskyy could be 'tip of the iceberg' in info war, experts warn |url=https://www.npr.org/2022/03/16/1087062648/deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia}}<!-- auto-translated from Spanish by Module:CS1 translator --></ref><ref>{{cite web|access-date=2025-06-16 |language=en |title=Ukraine war: Deepfake video of Zelenskyy telling Ukrainians to 'lay down arms' debunked |url=https://news.sky.com/story/ukraine-war-deepfake-video-of-zelenskyy-telling-ukrainians-to-lay-down-arms-debunked-12567789 |website=Sky News}}<!-- auto-translated from Spanish by Module:CS1 translator --></ref>
== Comparison of existing models ==
 
== Comparison of existing models ==
{| class="wikitable sortable"
|+
!''' Model/Product'''
!''' Company'''
!''' Year released'''
!''' Status'''
!class="unsortable" | '''Key features'''
!class="unsortable" | '''Capabilities'''
!class="unsortable" | '''Pricing'''
!class="unsortable" | '''Video length'''
!class="unsortable" | '''Supported languages'''
|-
|Synthesia
|[[Synthesia (company)|Synthesia]]
|2019
|Released
Line 62 ⟶ 64:
|Varies based on subscription
|60+
|-
|Vexub
|Vexub
|2023
|Released
|Text-to-video from prompt, focus on TikTok and YouTube storytelling formats for social media<ref name=":6">{{cite web |title=Vexub – Text-to-video AI generator |url=https://vexub.com |website=Vexub |access-date=2025-06-25}}</ref>
|Generates AI videos (1–15 mins) from text prompts; includes editing and voice features<ref name=":6" />
|Subscription-based, with various plans
|Up to ~15 minutes
|70+
|-
|InVideo AI
Line 84 ⟶ 96:
|-
|Runway Gen-2
|[[Runway AI]]
|2023
|Released
Line 104 ⟶ 116:
|-
|Runway Gen-3 Alpha
|[[Runway AI]]
|2024
|Alpha
Line 113 ⟶ 125:
|Multiple (not specified)
|-
|[[Google Veo]]
|OpenAI Sora
|[[Google]]
|OpenAI
|2024
|Released
|[[Google Gemini]] prompting, voice acting, sound effects, background music. Cinema style realistic videos.<ref name="googlev1">{{Cite web |title=Meet Flow, AI-powered filmmaking with Veo 3|url=https://blog.google/technology/ai/google-flow-veo-ai-filmmaking-tool/|access-date=2025-07-06 |website=blogs.google.com |date=20 May 2025 }}</ref>
|Can generate very realistic and detailed character models/scenes/clips, with accommodating and matching voice acting, ambient sounds, and background music. Ability to extend clips with continuity.<ref name="googlev2">{{Cite web |title=Google Veo DeepMind|url=https://deepmind.google/models/veo/|access-date=2025-07-06 |website=google.com}}</ref>
|Varies ($250 Google Pro/Ultra AI subscription, and additional AI credit Top-Ups)
|Eight seconds for individual clips (however clips can be continued/extended as separate clips)
|50+
|-
|[[OpenAI Sora]]
|[[OpenAI]]
|2024
|Alpha