What just happened? Image generation technology has advanced rapidly in recent years, yet achieving coherent video rendering remains a challenge for contemporary AI models. Nonetheless, Google has recently demonstrated remarkable progress in this area, showcasing significantly enhanced technology within the field of video generation.

Google has just unveiled Lumiere, the company's latest AI model for video creation. Lumiere is a significant improvement in video synthesis, Google states, as creating "realistic, diverse and coherent motion" has always been one of the main challenges for AI-based video generation. Lumiere provides a space-time diffusion model that can seemingly solve – or try to solve – that problem.

Mountain View's latest foray into the generative AI business is good enough for text-to-video generation, image-to-video rendering, and stylized generation. Users can create a completely new video clip by writing a textual prompt, providing a source image (no matter how authentic, realistic, or edited that image is), or by using a reference image as the target style.

Lumiere employs a novel "Space-Time U-Net architecture" that generates the entire video clip at once, through a single pass in the AI model. Compared to existing models, which synthesize different keyframes for the same video, Lumiere's approach can achieve state-of-the-art text-to-video results, with much less weirdness than before.

Lumiere's additional capabilities include video stylization, which transforms a source video into different materials, and Cinemagraphs, which provides a way to animate a limited, highlighted portion in a source image. The Video Inpainting feature can change single portions of a source video, like changing the colors, materials, or textures of a girl's dress.

As Google highlights in the official paper, Lumiere can generate "low-resolution," 1024×1024 videos lasting no more than 5 seconds. Previous AI video models were capable of generating longer videos, but Google claims that users preferred Lumiere's output over existing AI models. Mountain View says that Lumiere was trained on a dataset containing 30 million videos along with their text descriptions, though the origin (or copyright state) of those sourced, 5-second videos is currently unknown.

The paper by Google researchers highlights a potential "societal impact" of video generative AI tech like Lumiere, stating that the model's primary goal is to enable "novice users" to generate visual content in new creative and flexible ways. New tools for detecting biases and "malicious" use cases of video generative models should, however, be developed asap to avoid spoiling the fun.