Revision as of 11:28, 5 May 2025 edit 37.111.136.253 (talk) No edit summary Tag: Reverted ← Previous edit		Revision as of 07:45, 8 May 2025 edit undo Gremlin of the wiki (talk \| contribs) 211 edits Undid revision 1288918759 by 37.111.136.253 (talk); advertisement Tags: Undo Mobile edit Mobile web edit Advanced mobile edit Next edit →
Line 6: == Models == {{Globalize\|section\|date=August 2024}} There are different models, including [[open source]] models. Chinese-language input<ref name=":5">{{Cite web \|last=Wodecki \|first=Ben \|date=2023-08-11 \|title=Text-to-Video Generative AI Models: The Definitive List \|url=https://aibusiness.com/nlp/ai-video-generation-the-supreme-list \|access-date=2024-11-18 \|website=AI Business \|publisher=[[Informa]]}}</ref> CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on [[GitHub]] in 2022.<ref>{{Citation \|title=CogVideo \|date=2022-10-12 \|url=https://github.com/THUDM/CogVideo \|publisher=THUDM \|access-date=2022-10-12}}</ref> That year, [[Meta Platforms]] released a partial text-to-video model called "Make-A-Video",<ref>{{Cite web \|last=Davies \|first=Teli \|date=2022-09-29 \|title=Make-A-Video: Meta AI's New Model For Text-To-Video Generation \|url=https://wandb.ai/telidavies/ml-news/reports/Make-A-Video-Meta-AI-s-New-Model-For-Text-To-Video-Generation--VmlldzoyNzE4Nzcx \|access-date=2022-10-12 \|website=Weights & Biases \|language=en}}</ref><ref>{{Cite web \|last=Monge \|first=Jim Clyde \|date=2022-08-03 \|title=This ~~Text~~AI Can toCreate Video From Text Prompt \|url=https://~~deevid~~betterprogramming.aipub/~~text~~this-ai-can-tocreate-video-from-text-prompt-6904439d7aba \|access-date=2022-10-12 \|website=Medium \|language=en}}</ref><ref>{{Cite web \|title=Meta's Make-A-Video AI creates videos from text \|url=https://www.fonearena.com/blog/375627/meta-make-a-video-ai-create-videos-from-text.html \|access-date=2022-10-12 \|website=www.fonearena.com}}</ref> and [[Google]]'s [[Google Brain\|Brain]] (later [[Google DeepMind]]) introduced Imagen Video, a text-to-video model with 3D [[U-Net]].<ref>{{Cite news \|title=google: Google takes on Meta, introduces own video-generating AI \|url=https://m.economictimes.com/tech/technology/google-takes-on-meta-introduces-own-video-generating-ai/articleshow/94681128.cms \|access-date=2022-10-12 \|website=[[The Economic Times]]\| date=6 October 2022 }}</ref><ref>{{Cite web \|last=Monge \|first=Jim Clyde \|date=2022-08-03 \|title=This AI Can Create Video From Text Prompt \|url=https://betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba \|access-date=2022-10-12 \|website=Medium \|language=en}}</ref><ref>{{Cite web \|title=Nuh-uh, Meta, we can do text-to-video AI, too, says Google \|url=https://www.theregister.com/AMP/2022/10/06/google_ai_imagen_video/ \|access-date=2022-10-12 \|website=[[The Register]]}}</ref><ref>{{Cite web \|title=Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction \|url=https://paperswithcode.com/paper/see-plan-predict-language-guided-cognitive \|access-date=2022-10-12 \|website=paperswithcode.com \|language=en}}</ref><ref>{{Cite web \|title=Papers with Code - Text-driven Video Prediction \|url=https://paperswithcode.com/paper/text-driven-video-prediction \|access-date=2022-10-12 \|website=paperswithcode.com \|language=en}}</ref> In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.<ref>{{Cite arXiv \|eprint=2303.08320 \|class=cs.CV \|first1=Zhengxiong \|last1=Luo \|first2=Dayou \|last2=Chen \|title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation \|date=2023 \|last3=Zhang \|first3=Yingya \|last4=Huang \|first4=Yan \|last5=Wang \|first5=Liang \|last6=Shen \|first6=Yujun \|last7=Zhao \|first7=Deli \|last8=Zhou \|first8=Jingren \|last9=Tan \|first9=Tieniu}}</ref> The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the ___domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.<ref>{{Cite arXiv \|title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation \|eprint=2303.08320 \|last1=Luo \|first1=Zhengxiong \|last2=Chen \|first2=Dayou \|last3=Zhang \|first3=Yingya \|last4=Huang \|first4=Yan \|last5=Wang \|first5=Liang \|last6=Shen \|first6=Yujun \|last7=Zhao \|first7=Deli \|last8=Zhou \|first8=Jingren \|last9=Tan \|first9=Tieniu \|date=2023 \|class=cs.CV }}</ref> In the same month, [[Adobe Inc.\|Adobe]] introduced Firefly AI as part of its features.<ref>{{Cite web \|date=2024-10-10 \|title=Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom \|url=https://news.adobe.com/news/2024/10/101424-adobe-launches-firefly-video-model \|access-date=2024-11-18 \|publisher=[[Adobe Inc.]]}}</ref>

Text-to-video model: Difference between revisions