A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.[2]
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
"A cartoon farmer wearing a red dhoti and turban is happily walking from the city holding a small coconut plant. Bright sunny day, rural background turning to urban as he walks. Cartoon style, vibrant colors." A cute cartoon monkey with a metal bucket runs playfully to the river, fills the bucket with water, and carries it back cheerfully. Background: lush green forest with a flowing river. Cartoon style, bright and fun." "The cartoon farmer digs a pit in the ground using a shovel, then gently places the coconut plant into it. The monkey watches curiously from the side. Rural farm setting, cartoon style." "The farmer and monkey pour lots of water onto the planted coconut tree using buckets. Water splashes joyfully. The plant looks fresh and happy. Cartoon animation style
"The coconut tree has grown tall with many coconuts hanging. The farmer and monkey look up happily. Background: bright day, peaceful field. Cartoon style with cheerful expressions." "The monkey climbs up the coconut tree and throws coconuts down to the farmer, who catches them in a basket. Both laugh joyfully. Cartoon animation with playful energy." "The farmer and monkey sit together on a swing made of rope tied to a tree, drinking coconut juice with straws, smiling and swinging slowly. Beautiful sunny background. Cartoon style, vibrant and happy A big cartoon bear walks in angrily, grabs the coconut tree and breaks it down with force. The monkey and farmer look shocked and scared. Dramatic cartoon style, forest in background "The cartoon farmer sits on the ground crying beside the broken coconut tree. The monkey tries to comfort him. Background shows fallen coconuts and snapped tree. Sad cartoon tone." "Cartoon monkey looks at the camera and says ‘Like aur Subscribe karna mat bhoolna!’ while pointing at a broken signboard with 'Subscribe' written on it. Farmer wipes tears and nods. Fun ending screen, cartoon YouTube style."
Architecture and training
There are several architectures that have been used to create Text-to-Video models. Similar to Text-to-Image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively.[3] An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion[4] — and diffusion models have also been used to develop the image generation aspects of the model.[5]
Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M.[6][7] These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM.[6][7] These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.
The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence.[7] This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.[7]
Limitations
Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs.[8][9] Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility.[9][8]
Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model's ability to align generated video with the user's intended message.[9][7] Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation.[9]
Another issue with the outputs is that text or fine details in AI-generated videos often appear garbled, a problem that stable diffusion models also struggle with. Examples include distorted hands and unreadable text.
Ethics
This section relies largely or entirely upon a single source. (December 2024) |
The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent.[6] Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.[6]
Impacts and applications
This section relies largely or entirely upon a single source. (December 2024) |
Text-to-video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate content.[10]
During the Russo-Ukrainian war, fake videos made with Artificial Intelligence were created as part of a propaganda war against Ukraine and shared in social media. These included depictions of children in the Ukrainian Armed Forces, fake ads targeting children encouraging them to denounce critics of the Ukrainian government, or fictitious statements by Ukrainian President Volodymyr Zelenskyy about the country's surrender, among others.[11][12][13][14][15][16]
Comparison of models
Model/Product | Company | Year released | Status | Key features | Capabilities | Pricing | Video length | Supported languages |
---|---|---|---|---|---|---|---|---|
Synthesia | Synthesia | 2019 | Released | AI avatars, multilingual support for 60+ languages, customization options[17] | Specialized in realistic AI avatars for corporate training and marketing[17] | Subscription-based, starting around $30/month | Varies based on subscription | 60+ |
Vexub | Vexub | 2023 | Released | Text-to-video from prompt, focus on TikTok and YouTube storytelling formats for social media[18] | Generates AI videos (1–15 mins) from text prompts; includes editing and voice features[18] | Subscription-based, with various plans | Up to ~15 minutes | 70+ |
InVideo AI | InVideo | 2021 | Released | AI-powered video creation, large stock library, AI talking avatars[17] | Tailored for social media content with platform-specific templates[17] | Free plan available, Paid plans starting at $16/month | Varies depending on content type | Multiple (not specified) |
Fliki | Fliki AI | 2022 | Released | Text-to-video with AI avatars and voices, extensive language and voice support[17] | Supports 65+ AI avatars and 2,000+ voices in 70 languages[17] | Free plan available, Paid plans starting at $30/month | Varies based on subscription | 70+ |
Runway Gen-2 | Runway AI | 2023 | Released | Multimodal video generation from text, images, or videos[19] | High-quality visuals, various modes like stylization and storyboard[19] | Free trial, Paid plans (details not specified) | Up to 16 seconds | Multiple (not specified) |
Pika Labs | Pika Labs | 2024 | Beta | Dynamic video generation, camera and motion customization[20] | User-friendly, focused on natural dynamic generation[20] | Currently free during beta | Flexible, supports longer videos with frame continuation | Multiple (not specified) |
Runway Gen-3 Alpha | Runway AI | 2024 | Alpha | Enhanced visual fidelity, photorealistic humans, fine-grained temporal control[21] | Ultra-realistic video generation with precise key-framing and industry-level customization[21] | Free trial available, custom pricing for enterprises | Up to 10 seconds per clip, extendable | Multiple (not specified) |
Google Veo | 2024 | Released | Google Gemini prompting, voice acting, sound effects, background music. Cinema style realistic videos.[22] | Can generate very realistic and detailed character models/scenes/clips, with accommodating and matching voice acting, ambient sounds, and background music. Ability to extend clips with continuity.[23] | Varies ($250 Google Pro/Ultra AI subscription, and additional AI credit Top-Ups) | Eight seconds for individual clips (however clips can be continued/extended as separate clips) | 50+ | |
OpenAI Sora | OpenAI | 2024 | Alpha | Deep language understanding, high-quality cinematic visuals, multi-shot videos[24] | Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures[24] | Pricing not yet disclosed | Expected to generate longer videos; duration specifics TBD | Multiple (not specified) |
See also
- Text-to-image model
- AI slop
- VideoPoet, unreleased Google's model, precursor of Lumiere
- Deepfake
- Human image synthesis
- ChatGPT
References
- ^ Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98.
Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
- ^ Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (6 May 2024). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].
- ^ Bhagwatkar, Rishika; Bachu, Saketh; Fitter, Khurshed; Kulkarni, Akshay; Chiddarwar, Shital (17 December 2020). "A Review of Video Generation Approaches". 2020 International Conference on Power, Instrumentation, Control and Computing (PICC). IEEE. pp. 1–5. doi:10.1109/PICC51425.2020.9362485. ISBN 978-1-7281-7590-4.
- ^ Kim, Taehoon; Kang, ChanHee; Park, JaeHyuk; Jeong, Daun; Yang, ChangHee; Kang, Suk-Ju; Kong, Kyeongbo (3 January 2024). "Human Motion Aware Text-to-Video Generation with Explicit Camera Control". 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE. pp. 5069–5078. doi:10.1109/WACV57701.2024.00500. ISBN 979-8-3503-1892-0.
- ^ Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv:2311.06329. doi:10.1109/AIRC57904.2023.10303174. ISBN 979-8-3503-4824-8.
- ^ a b c d Miao, Yibo; Zhu, Yifan; Dong, Yinpeng; Yu, Lijia; Zhu, Jun; Gao, Xiao-Shan (8 September 2024). "T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models". arXiv:2407.05965 [cs.CV].
- ^ a b c d e Zhang, Ji; Mei, Kuizhi; Wang, Xiao; Zheng, Yu; Fan, Jianping (August 2018). "From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification". 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. pp. 1695–1700. doi:10.1109/ICPR.2018.8545513. ISBN 978-1-5386-3788-3.
- ^ a b Bhagwatkar, Rishika; Bachu, Saketh; Fitter, Khurshed; Kulkarni, Akshay; Chiddarwar, Shital (17 December 2020). "A Review of Video Generation Approaches". 2020 International Conference on Power, Instrumentation, Control and Computing (PICC). IEEE. pp. 1–5. doi:10.1109/PICC51425.2020.9362485. ISBN 978-1-7281-7590-4.
- ^ a b c d Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv:2311.06329. doi:10.1109/AIRC57904.2023.10303174. ISBN 979-8-3503-4824-8.
- ^ Singh, Aditi (9 May 2023). "A Survey of AI Text-to-Image and AI Text-to-Video Generators". 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC). IEEE. pp. 32–36. arXiv:2311.06329. doi:10.1109/AIRC57904.2023.10303174. ISBN 979-8-3503-4824-8.
- ^ ქურასბედიანი, ალექსი (9 June 2025). "AI-Generated Photo Of Ukrainian Children In Military Uniforms Circulated Online | Mythdetector.com". Retrieved 16 June 2025.
- ^ "Fake Ukraine ad urges kids to report relatives enjoying Russian music". euronews. 28 March 2025. Retrieved 16 June 2025.
- ^ "Photos of Ukrainian children generated by artificial intelligence". behindthenews.ua. 26 June 2024. Retrieved 16 June 2025.
- ^ "Fake Ukrainian TV advert urges children to report relatives listening to Russian music".
- ^ "Deepfake video of Zelenskyy could be 'tip of the iceberg' in info war, experts warn". NPR. 16 March 2022. Retrieved 16 June 2025.
- ^ "Ukraine war: Deepfake video of Zelenskyy telling Ukrainians to 'lay down arms' debunked". Sky News. Retrieved 16 June 2025.
- ^ a b c d e f "Top AI Video Generation Models of 2024". Deepgram. Retrieved 30 August 2024.
- ^ a b "Vexub – Text-to-video AI generator". Vexub. Retrieved 25 June 2025.
- ^ a b "Runway Research | Gen-2: Generate novel videos with text, images or video clips". runwayml.com. Retrieved 30 August 2024.
- ^ a b Sharma, Shubham (26 December 2023). "Pika Labs' text-to-video AI platform opens to all: Here's how to use it". VentureBeat. Retrieved 30 August 2024.
- ^ a b "Runway Research | Introducing Gen-3 Alpha: A New Frontier for Video Generation". runwayml.com. Retrieved 30 August 2024.
- ^ "Meet Flow, AI-powered filmmaking with Veo 3". blogs.google.com. Retrieved 6 July 2025.
- ^ "Google Veo DeepMind". google.com. Retrieved 6 July 2025.
- ^ a b "Sora | OpenAI". openai.com. Retrieved 30 August 2024.