On Sunday, Runway announced a new AI video synthesis model called Gen-3 Alpha, which is still under development but appears to produce video of similar quality to OpenAI's Sora, which debuted earlier this year (and hasn't been released yet). It can create new, high-definition videos from text prompts that can range from realistic humans to surrealist monsters roaming the countryside.
Unlike the previous Best of June 2023 runway model, which could only create two-second clips, the Gen-3 Alpha can reportedly create 10-second-long video segments of people, places, and things, with a consistency and coherence that easily surpasses Gen-2. If 10 seconds seems like a pittance compared to Sora's full-minute video, consider that the company is working with a much smaller budget of compute than the more lavishly funded OpenAI — and actually has a history of shipping video creation capabilities to commercial users.
The Gen-3 Alpha doesn't generate audio to accompany the video clips, and it's highly likely that consistent generations over time (which keep a character consistent over time) rely on the same high-quality training material. But it's hard to ignore the improvements in Runaways' visual fidelity over the past year.
The trend of AI video increased
The last few weeks have been quite busy for AI video synthesis in the AI research community, including the launch of Kling, a Chinese model built by Beijing-based Kuaishou Technology (sometimes called “Kwai”). Kling can produce a two-minute 1080p HD video at 30 frames per second, with a level of detail and coherency reportedly matching Sora.
Generation-3 Alpha Prompt: “A subtle reflection of a woman in the window of a train moving at high speed in a Japanese city.”
Shortly after Kling launched, people on social media began creating surreal AI videos using Luma AI's Luma Dream Machine. These videos were new and weird but generally lacked coherence; we tested the Dream Machine and weren't impressed with what we saw.
Meanwhile, one of the original text-to-video pioneers, New York City-based Runway — founded in 2018 — recently found itself the target of memes suggesting that its Generation-2 technology is falling out of favor compared to newer video synthesis models. Perhaps that’s why the Generation-3 Alpha was announced.
Generation-3 Alpha Prompt: “An astronaut running down a street in Rio de Janeiro.”
Generating realistic humans has always been difficult for video synthesis models, so Runaway particularly showcases Gen-3 Alpha's ability to create what its developers call “expressive” human characters with a range of actions, gestures, and emotions. Although the examples the company provided weren't particularly expressive – most people were just staring and blinking slowly – they do look realistic.
The human examples provided include videos of a woman riding a train, an astronaut running on the road, a man illuminating his face with the light of a TV set, a woman driving a car, and a woman running, etc.
Generation-3 Alpha Prompt: “Close-up shot of a young woman driving a car, looking thoughtful, with a hazy green forest visible from the car window during rainy weather.”
The generated demo videos also include more surreal video synthesis examples, including a giant creature walking through a dilapidated city, a man made of rocks walking through a forest, and the giant cotton candy monster shown below, which is probably the best video on the entire page.
Generation-3 Alpha Prompt: “A giant human made of furry blue cotton candy stomps the ground and roars as he flies toward the sky, with the clear blue sky behind them.”
The Generation-3 will power various Runway AI editing tools (one of the company's most notable achievements), including multi-motion brushes, advanced camera controls, and director mode. It can create videos from text or image prompts.
Runway says the Generation-3 Alpha is the first in a series of models trained on a new infrastructure designed for large-scale multimodal training, a step toward the development of “General World Models,” which are hypothetical AI systems that build internal representations of environments and use them to simulate future events within those environments.