Text-to-video model: Difference between revisions

Content deleted Content added
No edit summary
Tags: Reverted blanking Mobile edit Mobile web edit
Undid revision 1299598066 by 152.58.131.248 (talk) Vandalism
Line 4:
A '''text-to-video model''' is a [[machine learning model]] that uses a [[natural language]] description as input to produce a [[video]] relevant to the input text.<ref name="AIIR">{{cite report|url=https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf|title=Artificial Intelligence Index Report 2023|publisher=Stanford Institute for Human-Centered Artificial Intelligence|page=98|quote=Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.}}</ref> Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video [[diffusion model]]s.<ref>{{cite arXiv |last1=Melnik |first1=Andrew |title=Video Diffusion Models: A Survey |date=2024-05-06 |eprint =2405.03150 |last2=Ljubljanac |first2=Michal |last3=Lu |first3=Cong |last4=Yan |first4=Qi |last5=Ren |first5=Weiming |last6=Ritter |first6=Helge|class=cs.CV }}</ref>
 
== Models ==
"A cartoon farmer wearing a red dhoti and turban is happily walking from the city holding a small coconut plant. Bright sunny day, rural background turning to urban as he walks. Cartoon style, vibrant colors."
{{Globalize|section|date=August 2024}}
A cute cartoon monkey with a metal bucket runs playfully to the river, fills the bucket with water, and carries it back cheerfully. Background: lush green forest with a flowing river. Cartoon style, bright and fun."
There are different models, including [[open source]] models. Chinese-language input<ref name=":5">{{Cite web |last=Wodecki |first=Ben |date=2023-08-11 |title=Text-to-Video Generative AI Models: The Definitive List |url=https://aibusiness.com/nlp/ai-video-generation-the-supreme-list |access-date=2024-11-18 |website=AI Business |publisher=[[Informa]]}}</ref> CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on [[GitHub]] in 2022.<ref>{{Citation |title=CogVideo |date=2022-10-12 |url=https://github.com/THUDM/CogVideo |publisher=THUDM |access-date=2022-10-12}}</ref> That year, [[Meta Platforms]] released a partial text-to-video model called "Make-A-Video",<ref>{{Cite web |last=Davies |first=Teli |date=2022-09-29 |title=Make-A-Video: Meta AI's New Model For Text-To-Video Generation |url=https://wandb.ai/telidavies/ml-news/reports/Make-A-Video-Meta-AI-s-New-Model-For-Text-To-Video-Generation--VmlldzoyNzE4Nzcx |access-date=2022-10-12 |website=Weights & Biases |language=en}}</ref><ref name="Monge">{{Cite web |last=Monge |first=Jim Clyde |date=2022-08-03 |title=This AI Can Create Video From Text Prompt |url=https://betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba |access-date=2022-10-12 |website=Medium |language=en}}</ref><ref>{{Cite web |title=Meta's Make-A-Video AI creates videos from text |url=https://www.fonearena.com/blog/375627/meta-make-a-video-ai-create-videos-from-text.html |access-date=2022-10-12 |website=www.fonearena.com}}</ref> and [[Google]]'s [[Google Brain|Brain]] (later [[Google DeepMind]]) introduced Imagen Video, a text-to-video model with 3D [[U-Net]].<ref>{{Cite news |title=google: Google takes on Meta, introduces own video-generating AI |url=https://m.economictimes.com/tech/technology/google-takes-on-meta-introduces-own-video-generating-ai/articleshow/94681128.cms |access-date=2022-10-12 |website=[[The Economic Times]]| date=6 October 2022 }}</ref><ref name="Monge"/><ref>{{Cite web |title=Nuh-uh, Meta, we can do text-to-video AI, too, says Google |url=https://www.theregister.com/AMP/2022/10/06/google_ai_imagen_video/ |access-date=2022-10-12 |website=[[The Register]]}}</ref><ref>{{Cite web |title=Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction |url=https://paperswithcode.com/paper/see-plan-predict-language-guided-cognitive |access-date=2022-10-12 |website=paperswithcode.com |language=en}}</ref><ref>{{Cite web |title=Papers with Code - Text-driven Video Prediction |url=https://paperswithcode.com/paper/text-driven-video-prediction |access-date=2022-10-12 |website=paperswithcode.com |language=en}}</ref>
"The cartoon farmer digs a pit in the ground using a shovel, then gently places the coconut plant into it. The monkey watches curiously from the side. Rural farm setting, cartoon style."
"The farmer and monkey pour lots of water onto the planted coconut tree using buckets. Water splashes joyfully. The plant looks fresh and happy. Cartoon animation style
 
In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.<ref>{{Cite arXiv |eprint=2303.08320 |class=cs.CV |first1=Zhengxiong |last1=Luo |first2=Dayou |last2=Chen |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |date=2023 |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu}}</ref> The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the ___domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.<ref>{{Cite arXiv |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |eprint=2303.08320 |last1=Luo |first1=Zhengxiong |last2=Chen |first2=Dayou |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu |date=2023 |class=cs.CV }}</ref> In the same month, [[Adobe Inc.|Adobe]] introduced Firefly AI as part of its features.<ref>{{Cite web |date=2024-10-10 |title=Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom |url=https://news.adobe.com/news/2024/10/101424-adobe-launches-firefly-video-model |access-date=2024-11-18 |publisher=[[Adobe Inc.]]}}</ref>
"The coconut tree has grown tall with many coconuts hanging. The farmer and monkey look up happily. Background: bright day, peaceful field. Cartoon style with cheerful expressions."
 
"The monkey climbs up the coconut tree and throws coconuts down to the farmer, who catches them in a basket. Both laugh joyfully. Cartoon animation with playful energy."
In January 2024, [[Google]] announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.<ref>{{Cite web |last=Yirka |first=Bob |date=2024-01-26 |title=Google announces the development of Lumiere, an AI-based next-generation text-to-video generator. |url=https://techxplore.com/news/2024-01-google-lumiere-ai-based-generation.html |access-date=2024-11-18 |website=Tech Xplore}}</ref> [[Matthias Niessner]] and [[Lourdes Agapito]] at AI company [[Synthesia (company)|Synthesia]] work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.<ref>{{Cite web |title=Text to Speech for Videos |url=https://www.synthesia.io/text-to-speech |access-date=2023-10-17 |website=Synthesia.io}}</ref> In June 2024, Luma Labs launched its [[Dream Machine (text-to-video model)|Dream Machine]] video tool.<ref>{{Cite web |last=Nuñez |first=Michael |date=2024-06-12 |title=Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race |url=https://venturebeat.com/ai/luma-ai-debuts-dream-machine-for-realistic-video-generation-heating-up-ai-media-race/ |access-date=2024-11-18 |website=VentureBeat |language=en-US}}</ref><ref>{{Cite web |last=Fink |first=Charlie |title=Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video |url=https://www.forbes.com/sites/charliefink/2024/06/13/apple-debuts-intelligence-mistral-raises-600-million-new-ai-text-to-video/ |access-date=2024-11-18 |website=Forbes |language=en}}</ref> That same month,<ref>{{Cite web |last=Franzen |first=Carl |date=2024-06-12 |title=What you need to know about Kling, the AI video generator rival to Sora that's wowing creators |url=https://venturebeat.com/ai/what-you-need-to-know-about-kling-the-ai-video-generator-rival-to-sora-thats-wowing-creators/ |access-date=2024-11-18 |website=VentureBeat |language=en-US}}</ref> [[Kuaishou]] extended its Kling AI text-to-video model to international users. In July 2024, [[TikTok]] owner [[ByteDance]] released Jimeng AI in China, through its subsidiary, Faceu Technology.<ref>{{Cite web |date=2024-08-06 |title=ByteDance joins OpenAI's Sora rivals with AI video app launch |url=https://www.reuters.com/technology/artificial-intelligence/bytedance-joins-openais-sora-rivals-with-ai-video-app-launch-2024-08-06/ |access-date=2024-11-18 |publisher=[[Reuters]]}}</ref> By September 2024, the Chinese AI company [[MiniMax (company)|MiniMax]] debuted its video-01 model, joining other established AI model companies like [[Zhipu AI]], [[Baichuan]], and [[Moonshot AI]], which contribute to China's involvement in AI technology.<ref>{{Cite web |date=2024-09-02 |title=Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora |url=https://finance.yahoo.com/news/chinese-ai-tiger-minimax-launches-093000322.html |access-date=2024-11-18 |website=Yahoo! Finance}}</ref>
"The farmer and monkey sit together on a swing made of rope tied to a tree, drinking coconut juice with straws, smiling and swinging slowly. Beautiful sunny background. Cartoon style, vibrant and happy
 
A big cartoon bear walks in angrily, grabs the coconut tree and breaks it down with force. The monkey and farmer look shocked and scared. Dramatic cartoon style, forest in background
Alternative approaches to text-to-video models include<ref>{{Citation |title=Text2Video-Zero |date=2023-08-12 |url=https://github.com/Picsart-AI-Research/Text2Video-Zero |access-date=2023-08-12 |publisher=Picsart AI Research (PAIR)}}</ref> Google's Phenaki, Hour One, [[Colossyan]],<ref name=":5" /> [[Runway (company)|Runway]]'s Gen-3 Alpha,<ref>{{Cite web |last=Kemper |first=Jonathan |date=2024-07-01 |title=Runway's Sora competitor Gen-3 Alpha now available |url=https://the-decoder.com/runways-sora-competitor-gen-3-alpha-now-available/ |access-date=2024-11-18 |website=THE DECODER |language=en-US}}</ref><ref>{{Cite news |date=2023-03-20 |title=Generative AI's Next Frontier Is Video |url=https://www.bloomberg.com/news/articles/2023-03-20/generative-ai-s-next-frontier-is-video |access-date=2024-11-18 |work=Bloomberg.com |language=en}}</ref> and OpenAI's [[Sora (text-to-video model)|Sora]],<ref>{{Cite web |date=2024-02-15 |title=OpenAI teases 'Sora,' its new text-to-video AI model |url=https://www.nbcnews.com/tech/tech-news/openai-sora-video-artificial-intelligence-unveiled-rcna139065 |access-date=2024-11-18 |website=NBC News |language=en}}</ref><ref>{{Cite web |last=Kelly |first=Chris |date=2024-06-25 |title=Toys R Us creates first brand film to use OpenAI's text-to-video tool |url=https://www.marketingdive.com/news/toys-r-us-openai-sora-gen-ai-first-text-video/719797/ |access-date=2024-11-18 |website=Marketing Dive |publisher=[[Informa]] |language=en-US}}</ref> Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.<ref>{{Cite book |last1=Jin |first1=Jiayao |last2=Wu |first2=Jianhang |last3=Xu |first3=Zhoucheng |last4=Zhang |first4=Hang |last5=Wang |first5=Yaxin |last6=Yang |first6=Jielong |chapter=Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network |date=2023-08-04 |title=2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT) |chapter-url=https://ieeexplore.ieee.org/document/10336607 |publisher=IEEE |pages=108–114 |doi=10.1109/CCPQT60491.2023.00024 |isbn=979-8-3503-4269-7}}</ref> [[FLUX.1]] developer Black Forest Labs has announced its text-to-video model SOTA.<ref>{{Cite web |date=2024-08-01 |title=Announcing Black Forest Labs |url=https://blackforestlabs.ai/announcing-black-forest-labs/ |access-date=2024-11-18 |website=Black Forest Labs |language=en-US}}</ref> [[Google]] was preparing to launch a video generation tool named [[Veo (text-to-video model)|Veo]] for [[YouTube Shorts]] in 2025.<ref>{{Cite web |last=Forlini |first=Emily Dreibelbis |date=2024-09-18 |title=Google's veo text-to-video AI generator is coming to YouTube shorts |url=https://www.pcmag.com/news/googles-veo-text-to-video-ai-generator-is-coming-to-youtube-shorts |access-date=2024-11-18 |website=[[PC Magazine]]}}</ref> In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models.<ref>{{Cite web |last=Subin |first=Jennifer Elias,Samantha |date=2025-05-20 |title=Google launches Veo 3, an AI video generator that incorporates audio |url=https://www.cnbc.com/2025/05/20/google-ai-video-generator-audio-veo-3.html |access-date=2025-05-22 |website=CNBC |language=en}}</ref>
"The cartoon farmer sits on the ground crying beside the broken coconut tree. The monkey tries to comfort him. Background shows fallen coconuts and snapped tree. Sad cartoon tone."
"Cartoon monkey looks at the camera and says ‘Like aur Subscribe karna mat bhoolna!’ while pointing at a broken signboard with 'Subscribe' written on it. Farmer wipes tears and nods. Fun ending screen, cartoon YouTube style."
 
== Architecture and training ==