A '''text-to-video model''' is a [[machine learning model]] whichthat takes a [[natural language]] description as input and producingproduces a [[video]] orrelevant multiples videos fromto the input text.<ref name="AIIR">{{cite report|url=https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf|title=Artificial Intelligence Index Report 2023|publisher=Stanford Institute for Human-Centered Artificial Intelligence|page=98|quote=Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.}}</ref> Recent advancements in generating high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.<ref>{{Citation |last=Melnik |first=Andrew |title=Video Diffusion Models: A Survey |date=2024-05-06 |url=http://arxiv.org/abs/2405.03150 |access-date=2024-06-21 |doi=10.48550/arXiv.2405.03150 |last2=Ljubljanac |first2=Michal |last3=Lu |first3=Cong |last4=Yan |first4=Qi |last5=Ren |first5=Weiming |last6=Ritter |first6=Helge}}</ref>
Video prediction on making objects realistic in a stable background is performed by using [[recurrent neural network]] for a sequence to sequence model with a connector [[convolutional neural network]] encoding and decoding each frame pixel by pixel,<ref>{{Cite web |title=Leading India |url=https://www.leadingindia.ai/downloads/projects/VP/vp_16.pdf}}</ref> creating video using [[deep learning]].<ref>{{Cite web |last=Narain |first=Rohit |date=2021-12-29 |title=Smart Video Generation from Text Using Deep Neural Networks |url=https://www.datatobiz.com/blog/smart-video-generation-from-text/ |access-date=2022-10-12 |language=en-US}}</ref> Testing of the [[data set]] in conditional [[generative model]] for existing information from text can be done by [[variational autoencoder]] and [[generative adversarial network]] (GAN).