Text-to-video model: Difference between revisions

Content deleted Content added
Revi, not formatting
Line 1:
{{short description|Machine learning model}}
A '''text-to-video model''' is a [[machine learning]] model which takes as input a [[natural language]] description as input and producesproducing a [[video]] matchingor thatmultiples descriptionvideos from the input.<ref name="AIIR">{{cite report|url=https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf|title=Artificial Intelligence Index Report 2023|publisher=Stanford Institute for Human-Centered Artificial Intelligence|page=98|quote=Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.}}</ref>
 
Video prediction on making objects realistic in a stable background is performed by using [[recurrent neural network]] for a sequence to sequence model with a connector [[convolutional neural network]] encoding and decoding each frame pixel by pixel,<ref>{{Cite web |title=Leading India |url=https://www.leadingindia.ai/downloads/projects/VP/vp_16.pdf}}</ref> creating video using [[deep learning]].<ref>{{Cite web |last=Narain |first=Rohit |date=2021-12-29 |title=Smart Video Generation from Text Using Deep Neural Networks |url=https://www.datatobiz.com/blog/smart-video-generation-from-text/ |access-date=2022-10-12 |language=en-US}}</ref> Testing of the [[data set]] in conditional [[generative model]] for existing information from text can be done by [[variational autoencoder]] and [[generative adversarial network]] (GAN).
 
== Methodology ==
* Data collection and data set preparation using clear video from kinetic human action video.
* Training the [[convolutional neural network]] for making video.
* Keywords extraction from text using [[natural-language programming]] .
* Testing of Data set in conditional generative model for existing static and dynamic information from text by [[variational autoencoder]] and [[generative adversarial network]].
 
== Models ==
{{Update section|date=February 2024}}
There are different models, including [[open source]] models. The demo version of CogVideo presentedis an early text-to-video model "of 9.4 billion parameters", with their codecodes presented inon [[GitHub]].<ref>{{Citation |title=CogVideo |date=2022-10-12 |url=https://github.com/THUDM/CogVideo |publisher=THUDM |access-date=2022-10-12}}</ref> [[Meta Platforms]] useshas a partial text-to-video{{NoteTag|It withcan [https://Makeavideo.studioalso makeavideogenerate videos from images, video insertion between two images, and videos variations.studio]|name=}} model called "Make-A-Video".<ref>{{Cite web |last=Davies |first=Teli |date=2022-09-29 |title=Make-A-Video: Meta AI's New Model For Text-To-Video Generation |url=https://wandb.ai/telidavies/ml-news/reports/Make-A-Video-Meta-AI-s-New-Model-For-Text-To-Video-Generation--VmlldzoyNzE4Nzcx |access-date=2022-10-12 |website=WWeights &B Biases |language=en}}</ref><ref>{{Cite web |last=Monge |first=Jim Clyde |date=2022-08-03 |title=This AI Can Create Video From Text Prompt |url=https://betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba |access-date=2022-10-12 |website=Medium |language=en}}</ref><ref>{{Cite web |title=Meta's Make-A-Video AI creates videos from text |url=https://www.fonearena.com/blog/375627/meta-make-a-video-ai-create-videos-from-text.html |access-date=2022-10-12 |website=www.fonearena.com}}</ref> [[Google]]'s used[[Google Brain|Brain]] has released a research paper introducing Imagen Video, for convertinga text-to-video model with 3D [[U-Net]].<ref>{{Cite web |title=google: Google takes on Meta, introduces own video-generating AI - The Economic Times |url=https://m.economictimes.com/tech/technology/google-takes-on-meta-introduces-own-video-generating-ai/amp_articleshow/94681128.cms?amp_gsa=1&amp_js_v=a9&usqp=mq331AQKKAFQArABIIACAw==#amp_tf=From%20%251$s&aoh=16655942495197&referrer=https://www.google.com&ampshare=https://m.economictimes.com/tech/technology/google-takes-on-meta-introduces-own-video-generating-ai/articleshow/94681128.cms |access-date=2022-10-12 |website=m.economictimes.com}}</ref><ref>{{Cite web |last=Monge |first=Jim Clyde |date=2022-08-03 |title=This AI Can Create Video From Text Prompt |url=https://betterprogramming.pub/this-ai-can-create-video-from-text-prompt-6904439d7aba |access-date=2022-10-12 |website=Medium |language=en}}</ref><ref>{{Cite web |title=Nuh-uh, Meta, we can do text-to-video AI, too, says Google |url=https://www.theregister.com/AMP/2022/10/06/google_ai_imagen_video/ |access-date=2022-10-12 |website=www.theregister.com}}</ref><ref>{{Cite web |title=Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction |url=https://paperswithcode.com/paper/see-plan-predict-language-guided-cognitive |access-date=2022-10-12 |website=paperswithcode.com |language=en}}</ref><ref>{{Cite web |title=Papers with Code - Text-driven Video Prediction |url=https://paperswithcode.com/paper/text-driven-video-prediction |access-date=2022-10-12 |website=paperswithcode.com |language=en}}</ref>
 
Antonia Antonova presented another model.<ref>{{Cite web |title=Text to Video Generation |url=https://antonia.space/text-to-video-generation |access-date=2022-10-12 |website=Antonia Antonova |language=en-US}}</ref>
 
In March 2023, a landmark research paper by Alibaba research was published, applying many of the principles found in latent image diffusion models to video generation.<ref>{{Cite web |title=Home - DAMO Academy |url=https://damo.alibaba.com/ |access-date=2023-08-12 |website=damo.alibaba.com}}</ref><ref>{{Cite arXiv |last1=Luo |first1=Zhengxiong |last2=Chen |first2=Dayou |last3=Zhang |first3=Yingya |last4=Huang |first4=Yan |last5=Wang |first5=Liang |last6=Shen |first6=Yujun |last7=Zhao |first7=Deli |last8=Zhou |first8=Jingren |last9=Tan |first9=Tieniu |date=2023 |title=VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |class=cs.CV |eprint=2303.08320}}</ref> Services like Kaiber orand Reemix have since adopted similar approaches to video generation in their respective products.
 
[[Matthias Niessner]] (TUM) and [[Lourdes Agapito]] (UCL) at AI company [[Synthesia (company)|Synthesia]] work on developing 3D neural rendering techniques that can synthesise realistic video. Theby goal is to improve existing text to video model byusing 2D and 3D neural representations of shape, appearanceappearances, and motion for controllable video synthesis of avatars that look and sound like real people.<ref>{{Cite web |title=Text to Speech for Videos |url=https://www.synthesia.io/text-to-speech |access-date=2023-10-17}}</ref>
 
Although alternativeAlternative approaches to text-to-video models exist,.<ref>{{Citation |title=Text2Video-Zero |date=2023-08-12 |url=https://github.com/Picsart-AI-Research/Text2Video-Zero |access-date=2023-08-12 |publisher=Picsart AI Research (PAIR)}}</ref> full latent diffusion models are currently regarded to be state of the art for video diffusion.
 
== See also ==
* [[Text-to-image model]]
* [[VideoPoet]], earlyunreleased Google's model, precursor of [[Lumiere (text-to-video model)|Lumiere]]
* [[Sora (text-to-video model)|Sora]], unreleased OpenAI model
* [[Runway (company)|Runway]], the company developing Gen-1 and Gen-2 models