Revision as of 09:42, 9 July 2025 edit 152.58.131.248 (talk) No edit summary Tags: Reverted blanking Mobile edit Mobile web edit ← Previous edit		Revision as of 09:44, 9 July 2025 edit undo NDG (talk \| contribs) Extended confirmed users, Rollbackers, Temporary account IP viewers 581 edits Undid revision 1299598066 by 152.58.131.248 (talk) Vandalism Tag: Undo Next edit →
Line 4: A '''text-to-video model''' is a [[machine learning model]] that uses a [[natural language]] description as input to produce a [[video]] relevant to the input text.<ref name="AIIR">{{cite report\|url=https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf\|title=Artificial Intelligence Index Report 2023\|publisher=Stanford Institute for Human-Centered Artificial Intelligence\|page=98\|quote=Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.}}</ref> Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video [[diffusion model]]s.<ref>{{cite arXiv \|last1=Melnik \|first1=Andrew \|title=Video Diffusion Models: A Survey \|date=2024-05-06 \|eprint =2405.03150 \|last2=Ljubljanac \|first2=Michal \|last3=Lu \|first3=Cong \|last4=Yan \|first4=Qi \|last5=Ren \|first5=Weiming \|last6=Ritter \|first6=Helge\|class=cs.CV }}</ref> == Models == "A cartoon farmer wearing a red dhoti and turban is happily walking from the city holding a small coconut plant. Bright sunny day, rural background turning to urban as he walks. Cartoon style, vibrant colors." {{Globalize\|section\|date=August 2024}} A cute cartoon monkey with a metal bucket runs playfully to the river, fills the bucket with water, and carries it back cheerfully. Background: lush green forest with a flowing river. Cartoon style, bright and fun." There are different models, including [[open source]] models. Chinese-language input<ref name=":5">{{Cite web \|last=Wodecki \|first=Ben \|date=2023-08-11 \|title=Text-to-Video Generative AI Models: The Definitive List \|url=https://aibusiness.com/nlp/ai-video-generation-the-supreme-list \|access-date=2024-11-18 \|website=AI Business \|publisher=[[Informa]]}}</ref> CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on [[GitHub]] in 2022.<ref>{{Citation \|title=CogVideo \|date=2022-10-12 \|url=https://github.com/THUDM/CogVideo \|publisher=THUDM \|access-date=2022-10-12}}</ref> That year, [[Meta Platforms]] released a partial text-to-video model called "Make-A-Video",<ref>{{Cite web \|last=Davies \|first=Teli \|date=2022-09-29 \|title=Make-A-Video: Meta AI's New Model For Text-To-Video Generation \|url=https://wandb.ai/telidavies/ml-news/reports/Make-A-Video-Meta-AI-s-New-Model-For-Text-To-Video-Generation--VmlldzoyNzE4Nzcx \|access-date=2022-10-12 \|website=Weights & Biases \|language=en}}</ref><ref name="Monge">{{Cite web \|last=Monge \|first=Jim Clyde \|date=2022-08-03 \|title=This AI Can Create Video From Text Prompt \|url=https://betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba \|access-date=2022-10-12 \|website=Medium \|language=en}}</ref><ref>{{Cite web \|title=Meta's Make-A-Video AI creates videos from text \|url=https://www.fonearena.com/blog/375627/meta-make-a-video-ai-create-videos-from-text.html \|access-date=2022-10-12 \|website=www.fonearena.com}}</ref> and [[Google]]'s [[Google Brain\|Brain]] (later [[Google DeepMind]]) introduced Imagen Video, a text-to-video model with 3D [[U-Net]].<ref>{{Cite news \|title=google: Google takes on Meta, introduces own video-generating AI \|url=https://m.economictimes.com/tech/technology/google-takes-on-meta-introduces-own-video-generating-ai/articleshow/94681128.cms \|access-date=2022-10-12 \|website=[[The Economic Times]]\| date=6 October 2022 }}</ref><ref name="Monge"/><ref>{{Cite web \|title=Nuh-uh, Meta, we can do text-to-video AI, too, says Google \|url=https://www.theregister.com/AMP/2022/10/06/google_ai_imagen_video/ \|access-date=2022-10-12 \|website=[[The Register]]}}</ref><ref>{{Cite web \|title=Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction \|url=https://paperswithcode.com/paper/see-plan-predict-language-guided-cognitive \|access-date=2022-10-12 \|website=paperswithcode.com \|language=en}}</ref><ref>{{Cite web \|title=Papers with Code - Text-driven Video Prediction \|url=https://paperswithcode.com/paper/text-driven-video-prediction \|access-date=2022-10-12 \|website=paperswithcode.com \|language=en}}</ref> ~~"The cartoon farmer digs a pit in the ground using a shovel, then gently places the coconut plant into it. The monkey watches curiously from the side. Rural farm setting, cartoon style."~~ ~~"The farmer and monkey pour lots of water onto the planted coconut tree using buckets. Water splashes joyfully. The plant looks fresh and happy. Cartoon animation style~~ In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.<ref>{{Cite arXiv \|eprint=2303.08320 \|class=cs.CV \|first1=Zhengxiong \|last1=Luo \|first2=Dayou \|last2=Chen \|title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation \|date=2023 \|last3=Zhang \|first3=Yingya \|last4=Huang \|first4=Yan \|last5=Wang \|first5=Liang \|last6=Shen \|first6=Yujun \|last7=Zhao \|first7=Deli \|last8=Zhou \|first8=Jingren \|last9=Tan \|first9=Tieniu}}</ref> The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the ___domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.<ref>{{Cite arXiv \|title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation \|eprint=2303.08320 \|last1=Luo \|first1=Zhengxiong \|last2=Chen \|first2=Dayou \|last3=Zhang \|first3=Yingya \|last4=Huang \|first4=Yan \|last5=Wang \|first5=Liang \|last6=Shen \|first6=Yujun \|last7=Zhao \|first7=Deli \|last8=Zhou \|first8=Jingren \|last9=Tan \|first9=Tieniu \|date=2023 \|class=cs.CV }}</ref> In the same month, [[Adobe Inc.\|Adobe]] introduced Firefly AI as part of its features.<ref>{{Cite web \|date=2024-10-10 \|title=Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom \|url=https://news.adobe.com/news/2024/10/101424-adobe-launches-firefly-video-model \|access-date=2024-11-18 \|publisher=[[Adobe Inc.]]}}</ref> ~~"The coconut tree has grown tall with many coconuts hanging. The farmer and monkey look up happily. Background: bright day, peaceful field. Cartoon style with cheerful expressions."~~ ~~"The monkey climbs up the coconut tree and throws coconuts down to the farmer, who catches them in a basket. Both laugh joyfully. Cartoon animation with playful energy."~~ In January 2024, [[Google]] announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.<ref>{{Cite web \|last=Yirka \|first=Bob \|date=2024-01-26 \|title=Google announces the development of Lumiere, an AI-based next-generation text-to-video generator. \|url=https://techxplore.com/news/2024-01-google-lumiere-ai-based-generation.html \|access-date=2024-11-18 \|website=Tech Xplore}}</ref> [[Matthias Niessner]] and [[Lourdes Agapito]] at AI company [[Synthesia (company)\|Synthesia]] work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.<ref>{{Cite web \|title=Text to Speech for Videos \|url=https://www.synthesia.io/text-to-speech \|access-date=2023-10-17 \|website=Synthesia.io}}</ref> In June 2024, Luma Labs launched its [[Dream Machine (text-to-video model)\|Dream Machine]] video tool.<ref>{{Cite web \|last=Nuñez \|first=Michael \|date=2024-06-12 \|title=Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race \|url=https://venturebeat.com/ai/luma-ai-debuts-dream-machine-for-realistic-video-generation-heating-up-ai-media-race/ \|access-date=2024-11-18 \|website=VentureBeat \|language=en-US}}</ref><ref>{{Cite web \|last=Fink \|first=Charlie \|title=Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video \|url=https://www.forbes.com/sites/charliefink/2024/06/13/apple-debuts-intelligence-mistral-raises-600-million-new-ai-text-to-video/ \|access-date=2024-11-18 \|website=Forbes \|language=en}}</ref> That same month,<ref>{{Cite web \|last=Franzen \|first=Carl \|date=2024-06-12 \|title=What you need to know about Kling, the AI video generator rival to Sora that's wowing creators \|url=https://venturebeat.com/ai/what-you-need-to-know-about-kling-the-ai-video-generator-rival-to-sora-thats-wowing-creators/ \|access-date=2024-11-18 \|website=VentureBeat \|language=en-US}}</ref> [[Kuaishou]] extended its Kling AI text-to-video model to international users. In July 2024, [[TikTok]] owner [[ByteDance]] released Jimeng AI in China, through its subsidiary, Faceu Technology.<ref>{{Cite web \|date=2024-08-06 \|title=ByteDance joins OpenAI's Sora rivals with AI video app launch \|url=https://www.reuters.com/technology/artificial-intelligence/bytedance-joins-openais-sora-rivals-with-ai-video-app-launch-2024-08-06/ \|access-date=2024-11-18 \|publisher=[[Reuters]]}}</ref> By September 2024, the Chinese AI company [[MiniMax (company)\|MiniMax]] debuted its video-01 model, joining other established AI model companies like [[Zhipu AI]], [[Baichuan]], and [[Moonshot AI]], which contribute to China's involvement in AI technology.<ref>{{Cite web \|date=2024-09-02 \|title=Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora \|url=https://finance.yahoo.com/news/chinese-ai-tiger-minimax-launches-093000322.html \|access-date=2024-11-18 \|website=Yahoo! Finance}}</ref> "The farmer and monkey sit together on a swing made of rope tied to a tree, drinking coconut juice with straws, smiling and swinging slowly. Beautiful sunny background. Cartoon style, vibrant and happy ~~A big cartoon bear walks in angrily, grabs the coconut tree and breaks it down with force. The monkey and farmer look shocked and scared. Dramatic cartoon style, forest in background~~ Alternative approaches to text-to-video models include<ref>{{Citation \|title=Text2Video-Zero \|date=2023-08-12 \|url=https://github.com/Picsart-AI-Research/Text2Video-Zero \|access-date=2023-08-12 \|publisher=Picsart AI Research (PAIR)}}</ref> Google's Phenaki, Hour One, [[Colossyan]],<ref name=":5" /> [[Runway (company)\|Runway]]'s Gen-3 Alpha,<ref>{{Cite web \|last=Kemper \|first=Jonathan \|date=2024-07-01 \|title=Runway's Sora competitor Gen-3 Alpha now available \|url=https://the-decoder.com/runways-sora-competitor-gen-3-alpha-now-available/ \|access-date=2024-11-18 \|website=THE DECODER \|language=en-US}}</ref><ref>{{Cite news \|date=2023-03-20 \|title=Generative AI's Next Frontier Is Video \|url=https://www.bloomberg.com/news/articles/2023-03-20/generative-ai-s-next-frontier-is-video \|access-date=2024-11-18 \|work=Bloomberg.com \|language=en}}</ref> and OpenAI's [[Sora (text-to-video model)\|Sora]],<ref>{{Cite web \|date=2024-02-15 \|title=OpenAI teases 'Sora,' its new text-to-video AI model \|url=https://www.nbcnews.com/tech/tech-news/openai-sora-video-artificial-intelligence-unveiled-rcna139065 \|access-date=2024-11-18 \|website=NBC News \|language=en}}</ref><ref>{{Cite web \|last=Kelly \|first=Chris \|date=2024-06-25 \|title=Toys R Us creates first brand film to use OpenAI's text-to-video tool \|url=https://www.marketingdive.com/news/toys-r-us-openai-sora-gen-ai-first-text-video/719797/ \|access-date=2024-11-18 \|website=Marketing Dive \|publisher=[[Informa]] \|language=en-US}}</ref> Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.<ref>{{Cite book \|last1=Jin \|first1=Jiayao \|last2=Wu \|first2=Jianhang \|last3=Xu \|first3=Zhoucheng \|last4=Zhang \|first4=Hang \|last5=Wang \|first5=Yaxin \|last6=Yang \|first6=Jielong \|chapter=Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network \|date=2023-08-04 \|title=2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT) \|chapter-url=https://ieeexplore.ieee.org/document/10336607 \|publisher=IEEE \|pages=108–114 \|doi=10.1109/CCPQT60491.2023.00024 \|isbn=979-8-3503-4269-7}}</ref> [[FLUX.1]] developer Black Forest Labs has announced its text-to-video model SOTA.<ref>{{Cite web \|date=2024-08-01 \|title=Announcing Black Forest Labs \|url=https://blackforestlabs.ai/announcing-black-forest-labs/ \|access-date=2024-11-18 \|website=Black Forest Labs \|language=en-US}}</ref> [[Google]] was preparing to launch a video generation tool named [[Veo (text-to-video model)\|Veo]] for [[YouTube Shorts]] in 2025.<ref>{{Cite web \|last=Forlini \|first=Emily Dreibelbis \|date=2024-09-18 \|title=Google's veo text-to-video AI generator is coming to YouTube shorts \|url=https://www.pcmag.com/news/googles-veo-text-to-video-ai-generator-is-coming-to-youtube-shorts \|access-date=2024-11-18 \|website=[[PC Magazine]]}}</ref> In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models.<ref>{{Cite web \|last=Subin \|first=Jennifer Elias,Samantha \|date=2025-05-20 \|title=Google launches Veo 3, an AI video generator that incorporates audio \|url=https://www.cnbc.com/2025/05/20/google-ai-video-generator-audio-veo-3.html \|access-date=2025-05-22 \|website=CNBC \|language=en}}</ref> ~~"The cartoon farmer sits on the ground crying beside the broken coconut tree. The monkey tries to comfort him. Background shows fallen coconuts and snapped tree. Sad cartoon tone."~~ "Cartoon monkey looks at the camera and says ‘Like aur Subscribe karna mat bhoolna!’ while pointing at a broken signboard with 'Subscribe' written on it. Farmer wipes tears and nods. Fun ending screen, cartoon YouTube style." == Architecture and training ==

Text-to-video model: Difference between revisions