Imagen Video: video gen with diffusion

 
 
applies texst to video → with diffusion → but uses pretrained text to video model
text conditional video generation → based on → cascade of video diffusion models
 
generate videos using→ 1) base video gen tool 2) sequence of interleavel spatial and temporal video super resolution model changes 1) Fully convolutional temporal and spatial superresolution models at certain resolution 2) choice of v- parametrization of diffusion models
 
apply progressive distillation to our video models with classifier free guidance for fast, high quality sampling
Prior work on video generation → approach is → restricted dataset 1) autoreggresive models 2) latent variable model with autoreggresive prior 3) non-autoreggressive + latent variable approaches autoreggressive generation + RNN → with conditional diffusion observations
 
Model is → 1) frozen T5 text encoder 2) base video diffusion model 3) interleaved spatial and temporal super resolution diffusion model
 
Key contributions -
notion image
notion image
notion image
notion image
 
7 sub models which perform 1) text conditional video generation 2) spatial super resolution 3) temporal super resolution
notion image
ImageNet is built from → diffusion models → specified in continuous time
 
We parametrize in terms of v-parametrization
 
Cascaded diffusion models → generate image at low resolution → then sequentially increase resolution of image through a series of → super resolution diffusion models
notion image
methods from image to video → like :: v-parametrization conditional augmentation classifier-free guidance