Text-to-video model: Difference between revisions

Content deleted Content added
No edit summary
Tags: Reverted Visual edit Mobile edit Mobile web edit
m Reverting possible vandalism by 154.91.163.46 to version by OrangKalideres. Report False Positive? Thanks, ClueBot NG. (4408028) (Bot)
 
(14 intermediate revisions by 14 users not shown)
Line 1:
{{Short description|Machine learning model}}
{{Use dmy dates|date=November 2024}}
[[File:OpenAI Sora in Action- Tokyo Walk.webm|thumb|upright=1.35|A video generated using OpenAI's [[Sora (text-to-video model)|Sora]] text-to-video model, using the prompt: <code>A clever-lookingstylish catwoman withwalks sleekdown fura watchesTokyo asstreet afilled clumsywith dogwarm slipsglowing onneon aand smoothanimated woodencity floorsignage. TheShe scenewears isa capturedblack inleather jacket, a lifelike,long realisticred styledress, withand naturalblack lightingboots, and detailedcarries fura texturesblack purse. InShe thewears nextsunglasses moment,and thered catlipstick. standsShe victoriouswalks afterconfidently aand playfulcasually. scuffle,The holdingstreet upis adamp realisticand goldenreflective, trophycreating witha pride,mirror itseffect eyesof gleamingthe withcolorful triumphlights. TheMany backgroundpedestrians is a cozy living room with soft shadows and warmwalk tonesabout.</code>]]
A '''text-to-video model''' is a [[machine learning model]] that uses a [[natural language]] description as input to produce a [[video]] relevant to the input text.<ref name="AIIR">{{cite report|url=https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf|title=Artificial Intelligence Index Report 2023|publisher=Stanford Institute for Human-Centered Artificial Intelligence|page=98|quote=Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.}}</ref> Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video [[diffusion model]]s.<ref>{{cite arXiv |last1=Melnik |first1=Andrew |title=Video Diffusion Models: A Survey |date=2024-05-06 |eprint =2405.03150 |last2=Ljubljanac |first2=Michal |last3=Lu |first3=Cong |last4=Yan |first4=Qi |last5=Ren |first5=Weiming |last6=Ritter |first6=Helge|class=cs.CV }}</ref>
 
Line 10:
In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.<ref>{{Cite arXiv |eprint=2303.08320 |class=cs.CV |first1=Zhengxiong |last1=Luo |first2=Dayou |last2=Chen |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |date=2023 |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu}}</ref> The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the ___domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.<ref>{{Cite arXiv |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |eprint=2303.08320 |last1=Luo |first1=Zhengxiong |last2=Chen |first2=Dayou |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu |date=2023 |class=cs.CV }}</ref> In the same month, [[Adobe Inc.|Adobe]] introduced Firefly AI as part of its features.<ref>{{Cite web |date=2024-10-10 |title=Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom |url=https://news.adobe.com/news/2024/10/101424-adobe-launches-firefly-video-model |access-date=2024-11-18 |publisher=[[Adobe Inc.]]}}</ref>
 
In January 2024, [[Google]] announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.<ref>{{Cite web |last=Yirka |first=Bob |date=2024-01-26 |title=Google announces the development of Lumiere, an AI-based next-generation text-to-video generator. |url=https://techxplore.com/news/2024-01-google-lumiere-ai-based-generation.html |access-date=2024-11-18 |website=Tech Xplore}}</ref> [[Matthias Niessner]] and [[Lourdes Agapito]] at AI company [[Synthesia (company)|Synthesia]] work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.<ref>{{Cite web |title=Text to Speech for Videos |url=https://www.synthesia.io/text-to-speech |access-date=2023-10-17 |website=Synthesia.io}}</ref> In June 2024, Luma Labs launched its [[Dream Machine (text-to-video model)|Dream Machine]] video tool.<ref>{{Cite web |last=Nuñez |first=Michael |date=2024-06-12 |title=Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race |url=https://venturebeat.com/ai/luma-ai-debuts-dream-machine-for-realistic-video-generation-heating-up-ai-media-race/ |access-date=2024-11-18 |website=VentureBeat |language=en-US}}</ref><ref>{{Cite web |last=Fink |first=Charlie |title=Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video |url=https://www.forbes.com/sites/charliefink/2024/06/13/apple-debuts-intelligence-mistral-raises-600-million-new-ai-text-to-video/ |access-date=2024-11-18 |website=Forbes |language=en}}</ref> That same month,<ref>{{Cite web |last=Franzen |first=Carl |date=2024-06-12 |title=What you need to know about Kling, the AI video generator rival to Sora that's wowing creators |url=https://venturebeat.com/ai/what-you-need-to-know-about-kling-the-ai-video-generator-rival-to-sora-thats-wowing-creators/ |access-date=2024-11-18 |website=VentureBeat |language=en-US}}</ref> [[Kuaishou]] extended its Kling AI text-to-video model to international users. In July 2024, [[TikTok]] owner [[ByteDance]] released Jimeng AI in China, through its subsidiary, Faceu Technology.<ref>{{Cite web |date=2024-08-06 |title=ByteDance joins OpenAI's Sora rivals with AI video app launch |url=https://www.reuters.com/technology/artificial-intelligence/bytedance-joins-openais-sora-rivals-with-ai-video-app-launch-2024-08-06/ |access-date=2024-11-18 |publisher=[[Reuters]]}}</ref> By September 2024, the Chinese AI company [[MiniMax (company)|MiniMax]] debuted its video-01 model, joining other established AI model companies like [[Zhipu AI]], [[Baichuan]], and [[Moonshot AI]], which contribute to China's involvement in AI technology.<ref>{{Cite web |date=2024-09-02 |title=Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora |url=https://finance.yahoo.com/news/chinese-ai-tiger-minimax-launches-093000322.html |access-date=2024-11-18 |website=Yahoo! Finance}}</ref> In December 2024 [[Lightricks]] launched LTX Video as an open source model.<ref>{{Cite web |last=Requiroso |first=Kelvene |date=2024-12-15 |title=Lightricks' LTXV Model Breaks Speed Records, Generating 5-Second AI Video Clips in 4 Seconds |url=https://www.eweek.com/news/lightricks-open-source-ai-video-generator/ |access-date=2025-07-24 |website=eWEEK |language=en-US}}</ref>
 
Alternative approaches to text-to-video models include<ref>{{Citation |title=Text2Video-Zero |date=2023-08-12 |url=https://github.com/Picsart-AI-Research/Text2Video-Zero |access-date=2023-08-12 |publisher=Picsart AI Research (PAIR)}}</ref> Google's Phenaki, Hour One, [[Colossyan]],<ref name=":5" /> [[Runway (company)|Runway]]'s Gen-3 Alpha,<ref>{{Cite web |last=Kemper |first=Jonathan |date=2024-07-01 |title=Runway's Sora competitor Gen-3 Alpha now available |url=https://the-decoder.com/runways-sora-competitor-gen-3-alpha-now-available/ |access-date=2024-11-18 |website=THE DECODER |language=en-US}}</ref><ref>{{Cite news |date=2023-03-20 |title=Generative AI's Next Frontier Is Video |url=https://www.bloomberg.com/news/articles/2023-03-20/generative-ai-s-next-frontier-is-video |access-date=2024-11-18 |work=Bloomberg.com |language=en}}</ref> and OpenAI's [[Sora (text-to-video model)|Sora]],<ref>{{Cite web |date=2024-02-15 |title=OpenAI teases 'Sora,' its new text-to-video AI model |url=https://www.nbcnews.com/tech/tech-news/openai-sora-video-artificial-intelligence-unveiled-rcna139065 |access-date=2024-11-18 |website=NBC News |language=en}}</ref><ref>{{Cite web |last=Kelly |first=Chris |date=2024-06-25 |title=Toys R Us creates first brand film to use OpenAI's text-to-video tool |url=https://www.marketingdive.com/news/toys-r-us-openai-sora-gen-ai-first-text-video/719797/ |access-date=2024-11-18 |website=Marketing Dive |publisher=[[Informa]] |language=en-US}}</ref> Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.<ref>{{Cite book |last1=Jin |first1=Jiayao |last2=Wu |first2=Jianhang |last3=Xu |first3=Zhoucheng |last4=Zhang |first4=Hang |last5=Wang |first5=Yaxin |last6=Yang |first6=Jielong |chapter=Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network |date=2023-08-04 |title=2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT) |chapter-url=https://ieeexplore.ieee.org/document/10336607 |publisher=IEEE |pages=108–114 |doi=10.1109/CCPQT60491.2023.00024 |isbn=979-8-3503-4269-7}}</ref> [[FLUX.1]] developer Black Forest Labs has announced its text-to-video model SOTA.<ref>{{Cite web |date=2024-08-01 |title=Announcing Black Forest Labs |url=https://blackforestlabs.ai/announcing-black-forest-labs/ |access-date=2024-11-18 |website=Black Forest Labs |language=en-US}}</ref> [[Google]] was preparing to launch a video generation tool named [[Veo (text-to-video model)|Veo]] for [[YouTube Shorts]] in 2025.<ref>{{Cite web |last=Forlini |first=Emily Dreibelbis |date=2024-09-18 |title=Google's veo text-to-video AI generator is coming to YouTube shorts |url=https://www.pcmag.com/news/googles-veo-text-to-video-ai-generator-is-coming-to-youtube-shorts |access-date=2024-11-18 |website=[[PC Magazine]]}}</ref> In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models.<ref>{{Cite web |last=Subin |first=Jennifer Elias,Samantha |date=2025-05-20 |title=Google launches Veo 3, an AI video generator that incorporates audio |url=https://www.cnbc.com/2025/05/20/google-ai-video-generator-audio-veo-3.html |access-date=2025-05-22 |website=CNBC |language=en}}</ref> In July 2025 Lightricks released an update to LTX Video capable of generating clips reaching 60 seconds.<ref>{{Cite web |last=Fink |first=Charlie |title=LTX Video Breaks The 60-Second Barrier, Redefining AI Video As A Longform Medium |url=https://www.forbes.com/sites/charliefink/2025/07/16/ltx-video-breaks-the-60-second-barrier-redefining-ai-video-as-a-longform-medium/ |access-date=2025-07-24 |website=Forbes |language=en}}</ref><ref>{{Cite web |date=2025-07-16 |title=Lightricks' latest release lets creators direct long-form AI-generated videos in real time |url=https://siliconangle.com/2025/07/16/lightricks-latest-release-allows-creators-direct-longform-ai-generated-videos-real-time/ |access-date=2025-07-24 |website=SiliconANGLE |language=en-US}}</ref>
 
== Architecture and training ==
Line 45:
{| class="wikitable sortable"
|+
!''' Model/Product'''
!''' Company'''
!''' Year released'''
!''' Status'''
!class="unsortable" | '''Key features'''
!class="unsortable" | '''Capabilities'''
!class="unsortable" | '''Pricing'''
!class="unsortable" | '''Video length'''
!class="unsortable" | '''Supported languages'''
|-
|Synthesia
Line 69:
|2023
|Released
|Text-to-video from prompt, focus on TikTok and YouTube storytelling formats for social media<ref name=":6">{{cite web |title=Vexub – Text-to-video AI generator |url=https://vexub.com |website=Vexub |access-date=2025-06-25}}</ref>
|Generates AI videos (1–15 mins) from text prompts; includes editing and voice features<ref name=":6" />
|Subscription-based, with various plans
Line 124:
|Up to 10 seconds per clip, extendable
|Multiple (not specified)
|-
|[[Google Veo]]
|[[Google]]
|2024
|Released
|[[Google Gemini]] prompting, voice acting, sound effects, background music. Cinema style realistic videos.<ref name="googlev1">{{Cite web |title=Meet Flow, AI-powered filmmaking with Veo 3|url=https://blog.google/technology/ai/google-flow-veo-ai-filmmaking-tool/|access-date=2025-07-06 |website=blogs.google.com}}</ref>
|Can generate very realistic and detailed character models/scenes/clips, with accommodating and matching voice acting, ambient sounds, and background music. Ability to extend clips with continuity.<ref name="googlev2">{{Cite web |title=Google Veo DeepMind|url=https://deepmind.google/models/veo/|access-date=2025-07-06 |website=google.com}}</ref>
|Varies ($250 Google Pro/Ultra AI subscription, and additional AI credit Top-Ups)
|Eight seconds for individual clips (however clips can be continued/extended as separate clips)
|50+
|-
|[[OpenAI Sora]]