Imagen Video: video gen with diffusion

https://arxiv.org/pdf/2210.02303.pdf

applies texst to video → with diffusion → but uses pretrained text to video model

text conditional video generation → based on → cascade of video diffusion models

generate videos using→ 1) base video gen tool 2) sequence of interleavel spatial and temporal video super resolution model changes 1) Fully convolutional temporal and spatial superresolution models at certain resolution 2) choice of v- parametrization of diffusion models

apply progressive distillation to our video models with classifier free guidance for fast, high quality sampling

Prior work on video generation → approach is → restricted dataset 1) autoreggresive models 2) latent variable model with autoreggresive prior 3) non-autoreggressive + latent variable approaches autoreggressive generation + RNN → with conditional diffusion observations

Model is → 1) frozen T5 text encoder 2) base video diffusion model 3) interleaved spatial and temporal super resolution diffusion model

Key contributions -

7 sub models which perform 1) text conditional video generation 2) spatial super resolution 3) temporal super resolution

ImageNet is built from → diffusion models → specified in continuous time

We parametrize in terms of v-parametrization

Cascaded diffusion models → generate image at low resolution → then sequentially increase resolution of image through a series of → super resolution diffusion models

→

methods from image to video → like :: v-parametrization conditional augmentation classifier-free guidance