SoRa

prior work - recurrent networks,123 generative adversarial networks,4567 autoregressive transformers,89 and diffusion models.

image + video → has patches Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.15161718 We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

How: 1) video into patch - compressing video into low-dimensional latent space 2) decomposing representation into spacetime patches Video compression - 1) trained a network - reduces → dimensionality of raw video reduction in both spatial and temporal dimension 2) decoder - that maps generated latents back to pixel space

Spacetime latent patches 1) Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. 2) Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. diffusion transformer → input → noisy patches + text prompts output → original clean patches

Re-captioning technique We apply the re-captioning technique introduced in DALL·E 330 to videos. →We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

Generating images → We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame

A significant challenge for video generation systems → has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies.