Text-to-video model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:44, 30 June 2025 edit 185.127.183.60 (talk) No edit summary Tags: Reverted Visual edit Mobile edit Mobile web edit ← Previous edit		Latest revision as of 02:53, 10 August 2025 edit undo ClueBot NG (talk \| contribs) Bots, Pending changes reviewers, Rollbackers 6,476,576 edits m Reverting possible vandalism by 154.91.163.46 to version by OrangKalideres. Report False Positive? Thanks, ClueBot NG. (4408028) (Bot) Tag: Rollback
(14 intermediate revisions by 14 users not shown)
Line 1: {{Short description\|Machine learning model}} {{Use dmy dates\|date=November 2024}} [[File:OpenAI Sora in Action- Tokyo Walk.webm\|thumb\|upright=1.35\|A video generated using OpenAI's [[Sora (text-to-video model)\|Sora]] text-to-video model, using the prompt: <code>A ~~clever-looking~~stylish ~~cat~~woman ~~with~~walks ~~sleek~~down ~~fur~~a ~~watches~~Tokyo asstreet afilled ~~clumsy~~with ~~dog~~warm ~~slips~~glowing onneon aand ~~smooth~~animated ~~wooden~~city ~~floor~~signage. ~~The~~She ~~scene~~wears isa ~~captured~~black inleather jacket, a ~~lifelike,~~long ~~realistic~~red ~~style~~dress, ~~with~~and ~~natural~~black ~~lighting~~boots, and ~~detailed~~carries ~~fur~~a ~~textures~~black purse. InShe ~~the~~wears ~~next~~sunglasses ~~moment,~~and ~~the~~red ~~cat~~lipstick. ~~stands~~She ~~victorious~~walks ~~after~~confidently aand ~~playful~~casually. ~~scuffle,~~The ~~holding~~street upis adamp ~~realistic~~and ~~golden~~reflective, ~~trophy~~creating ~~with~~a ~~pride,~~mirror ~~its~~effect ~~eyes~~of ~~gleaming~~the ~~with~~colorful ~~triumph~~lights. ~~The~~Many ~~background~~pedestrians ~~is a cozy living room with soft shadows and warm~~walk ~~tones~~about.</code>]] A '''text-to-video model''' is a [[machine learning model]] that uses a [[natural language]] description as input to produce a [[video]] relevant to the input text.<ref name="AIIR">{{cite report\|url=https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf\|title=Artificial Intelligence Index Report 2023\|publisher=Stanford Institute for Human-Centered Artificial Intelligence\|page=98\|quote=Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.}}</ref> Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video [[diffusion model]]s.<ref>{{cite arXiv \|last1=Melnik \|first1=Andrew \|title=Video Diffusion Models: A Survey \|date=2024-05-06 \|eprint =2405.03150 \|last2=Ljubljanac \|first2=Michal \|last3=Lu \|first3=Cong \|last4=Yan \|first4=Qi \|last5=Ren \|first5=Weiming \|last6=Ritter \|first6=Helge\|class=cs.CV }}</ref> Line 10: In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.<ref>{{Cite arXiv \|eprint=2303.08320 \|class=cs.CV \|first1=Zhengxiong \|last1=Luo \|first2=Dayou \|last2=Chen \|title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation \|date=2023 \|last3=Zhang \|first3=Yingya \|last4=Huang \|first4=Yan \|last5=Wang \|first5=Liang \|last6=Shen \|first6=Yujun \|last7=Zhao \|first7=Deli \|last8=Zhou \|first8=Jingren \|last9=Tan \|first9=Tieniu}}</ref> The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the ___domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.<ref>{{Cite arXiv \|title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation \|eprint=2303.08320 \|last1=Luo \|first1=Zhengxiong \|last2=Chen \|first2=Dayou \|last3=Zhang \|first3=Yingya \|last4=Huang \|first4=Yan \|last5=Wang \|first5=Liang \|last6=Shen \|first6=Yujun \|last7=Zhao \|first7=Deli \|last8=Zhou \|first8=Jingren \|last9=Tan \|first9=Tieniu \|date=2023 \|class=cs.CV }}</ref> In the same month, [[Adobe Inc.\|Adobe]] introduced Firefly AI as part of its features.<ref>{{Cite web \|date=2024-10-10 \|title=Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom \|url=https://news.adobe.com/news/2024/10/101424-adobe-launches-firefly-video-model \|access-date=2024-11-18 \|publisher=[[Adobe Inc.]]}}</ref> In January 2024, [[Google]] announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.<ref>{{Cite web \|last=Yirka \|first=Bob \|date=2024-01-26 \|title=Google announces the development of Lumiere, an AI-based next-generation text-to-video generator. \|url=https://techxplore.com/news/2024-01-google-lumiere-ai-based-generation.html \|access-date=2024-11-18 \|website=Tech Xplore}}</ref> [[Matthias Niessner]] and [[Lourdes Agapito]] at AI company [[Synthesia (company)\|Synthesia]] work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.<ref>{{Cite web \|title=Text to Speech for Videos \|url=https://www.synthesia.io/text-to-speech \|access-date=2023-10-17 \|website=Synthesia.io}}</ref> In June 2024, Luma Labs launched its [[Dream Machine (text-to-video model)\|Dream Machine]] video tool.<ref>{{Cite web \|last=Nuñez \|first=Michael \|date=2024-06-12 \|title=Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race \|url=https://venturebeat.com/ai/luma-ai-debuts-dream-machine-for-realistic-video-generation-heating-up-ai-media-race/ \|access-date=2024-11-18 \|website=VentureBeat \|language=en-US}}</ref><ref>{{Cite web \|last=Fink \|first=Charlie \|title=Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video \|url=https://www.forbes.com/sites/charliefink/2024/06/13/apple-debuts-intelligence-mistral-raises-600-million-new-ai-text-to-video/ \|access-date=2024-11-18 \|website=Forbes \|language=en}}</ref> That same month,<ref>{{Cite web \|last=Franzen \|first=Carl \|date=2024-06-12 \|title=What you need to know about Kling, the AI video generator rival to Sora that's wowing creators \|url=https://venturebeat.com/ai/what-you-need-to-know-about-kling-the-ai-video-generator-rival-to-sora-thats-wowing-creators/ \|access-date=2024-11-18 \|website=VentureBeat \|language=en-US}}</ref> [[Kuaishou]] extended its Kling AI text-to-video model to international users. In July 2024, [[TikTok]] owner [[ByteDance]] released Jimeng AI in China, through its subsidiary, Faceu Technology.<ref>{{Cite web \|date=2024-08-06 \|title=ByteDance joins OpenAI's Sora rivals with AI video app launch \|url=https://www.reuters.com/technology/artificial-intelligence/bytedance-joins-openais-sora-rivals-with-ai-video-app-launch-2024-08-06/ \|access-date=2024-11-18 \|publisher=[[Reuters]]}}</ref> By September 2024, the Chinese AI company [[MiniMax (company)\|MiniMax]] debuted its video-01 model, joining other established AI model companies like [[Zhipu AI]], [[Baichuan]], and [[Moonshot AI]], which contribute to China's involvement in AI technology.<ref>{{Cite web \|date=2024-09-02 \|title=Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora \|url=https://finance.yahoo.com/news/chinese-ai-tiger-minimax-launches-093000322.html \|access-date=2024-11-18 \|website=Yahoo! Finance}}</ref> In December 2024 [[Lightricks]] launched LTX Video as an open source model.<ref>{{Cite web \|last=Requiroso \|first=Kelvene \|date=2024-12-15 \|title=Lightricks' LTXV Model Breaks Speed Records, Generating 5-Second AI Video Clips in 4 Seconds \|url=https://www.eweek.com/news/lightricks-open-source-ai-video-generator/ \|access-date=2025-07-24 \|website=eWEEK \|language=en-US}}</ref> Alternative approaches to text-to-video models include<ref>{{Citation \|title=Text2Video-Zero \|date=2023-08-12 \|url=https://github.com/Picsart-AI-Research/Text2Video-Zero \|access-date=2023-08-12 \|publisher=Picsart AI Research (PAIR)}}</ref> Google's Phenaki, Hour One, [[Colossyan]],<ref name=":5" /> [[Runway (company)\|Runway]]'s Gen-3 Alpha,<ref>{{Cite web \|last=Kemper \|first=Jonathan \|date=2024-07-01 \|title=Runway's Sora competitor Gen-3 Alpha now available \|url=https://the-decoder.com/runways-sora-competitor-gen-3-alpha-now-available/ \|access-date=2024-11-18 \|website=THE DECODER \|language=en-US}}</ref><ref>{{Cite news \|date=2023-03-20 \|title=Generative AI's Next Frontier Is Video \|url=https://www.bloomberg.com/news/articles/2023-03-20/generative-ai-s-next-frontier-is-video \|access-date=2024-11-18 \|work=Bloomberg.com \|language=en}}</ref> and OpenAI's [[Sora (text-to-video model)\|Sora]],<ref>{{Cite web \|date=2024-02-15 \|title=OpenAI teases 'Sora,' its new text-to-video AI model \|url=https://www.nbcnews.com/tech/tech-news/openai-sora-video-artificial-intelligence-unveiled-rcna139065 \|access-date=2024-11-18 \|website=NBC News \|language=en}}</ref><ref>{{Cite web \|last=Kelly \|first=Chris \|date=2024-06-25 \|title=Toys R Us creates first brand film to use OpenAI's text-to-video tool \|url=https://www.marketingdive.com/news/toys-r-us-openai-sora-gen-ai-first-text-video/719797/ \|access-date=2024-11-18 \|website=Marketing Dive \|publisher=[[Informa]] \|language=en-US}}</ref> Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.<ref>{{Cite book \|last1=Jin \|first1=Jiayao \|last2=Wu \|first2=Jianhang \|last3=Xu \|first3=Zhoucheng \|last4=Zhang \|first4=Hang \|last5=Wang \|first5=Yaxin \|last6=Yang \|first6=Jielong \|chapter=Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network \|date=2023-08-04 \|title=2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT) \|chapter-url=https://ieeexplore.ieee.org/document/10336607 \|publisher=IEEE \|pages=108–114 \|doi=10.1109/CCPQT60491.2023.00024 \|isbn=979-8-3503-4269-7}}</ref> [[FLUX.1]] developer Black Forest Labs has announced its text-to-video model SOTA.<ref>{{Cite web \|date=2024-08-01 \|title=Announcing Black Forest Labs \|url=https://blackforestlabs.ai/announcing-black-forest-labs/ \|access-date=2024-11-18 \|website=Black Forest Labs \|language=en-US}}</ref> [[Google]] was preparing to launch a video generation tool named [[Veo (text-to-video model)\|Veo]] for [[YouTube Shorts]] in 2025.<ref>{{Cite web \|last=Forlini \|first=Emily Dreibelbis \|date=2024-09-18 \|title=Google's veo text-to-video AI generator is coming to YouTube shorts \|url=https://www.pcmag.com/news/googles-veo-text-to-video-ai-generator-is-coming-to-youtube-shorts \|access-date=2024-11-18 \|website=[[PC Magazine]]}}</ref> In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models.<ref>{{Cite web \|last=Subin \|first=Jennifer Elias,Samantha \|date=2025-05-20 \|title=Google launches Veo 3, an AI video generator that incorporates audio \|url=https://www.cnbc.com/2025/05/20/google-ai-video-generator-audio-veo-3.html \|access-date=2025-05-22 \|website=CNBC \|language=en}}</ref> In July 2025 Lightricks released an update to LTX Video capable of generating clips reaching 60 seconds.<ref>{{Cite web \|last=Fink \|first=Charlie \|title=LTX Video Breaks The 60-Second Barrier, Redefining AI Video As A Longform Medium \|url=https://www.forbes.com/sites/charliefink/2025/07/16/ltx-video-breaks-the-60-second-barrier-redefining-ai-video-as-a-longform-medium/ \|access-date=2025-07-24 \|website=Forbes \|language=en}}</ref><ref>{{Cite web \|date=2025-07-16 \|title=Lightricks' latest release lets creators direct long-form AI-generated videos in real time \|url=https://siliconangle.com/2025/07/16/lightricks-latest-release-allows-creators-direct-longform-ai-generated-videos-real-time/ \|access-date=2025-07-24 \|website=SiliconANGLE \|language=en-US}}</ref> == Architecture and training == Line 45: {\| class="wikitable sortable" \|+ !~~'''~~ Model/Product~~'''~~ !~~'''~~ Company~~'''~~ !~~'''~~ Year released~~'''~~ !~~'''~~ Status~~'''~~ !class="unsortable" \| ~~'''~~Key features~~'''~~ !class="unsortable" \| ~~'''~~Capabilities~~'''~~ !class="unsortable" \| ~~'''~~Pricing~~'''~~ !class="unsortable" \| ~~'''~~Video length~~'''~~ !class="unsortable" \| ~~'''~~Supported languages~~'''~~ \|- \|Synthesia Line 69: \|2023 \|Released \|Text-to-video from prompt, focus on TikTok and YouTube storytelling formats for social media<ref name=":6">{{cite web \|title=Vexub – Text-to-video AI generator \|url=https://vexub.com \|website=Vexub \|access-date=2025-06-25}}</ref> \|Generates AI videos (1–15 mins) from text prompts; includes editing and voice features<ref name=":6" /> \|Subscription-based, with various plans Line 124: \|Up to 10 seconds per clip, extendable \|Multiple (not specified) \|- \|[[Google Veo]] \|[[Google]] \|2024 \|Released \|[[Google Gemini]] prompting, voice acting, sound effects, background music. Cinema style realistic videos.<ref name="googlev1">{{Cite web \|title=Meet Flow, AI-powered filmmaking with Veo 3\|url=https://blog.google/technology/ai/google-flow-veo-ai-filmmaking-tool/\|access-date=2025-07-06 \|website=blogs.google.com}}</ref> \|Can generate very realistic and detailed character models/scenes/clips, with accommodating and matching voice acting, ambient sounds, and background music. Ability to extend clips with continuity.<ref name="googlev2">{{Cite web \|title=Google Veo DeepMind\|url=https://deepmind.google/models/veo/\|access-date=2025-07-06 \|website=google.com}}</ref> \|Varies ($250 Google Pro/Ultra AI subscription, and additional AI credit Top-Ups) \|Eight seconds for individual clips (however clips can be continued/extended as separate clips) \|50+ \|- \|[[OpenAI Sora]]