applies texst to video → with diffusion
→ but uses pretrained text to video model
text conditional video generation → based on
→ cascade of video diffusion models
generate videos using→
1) base video gen tool
2) sequence of interleavel spatial and temporal video super resolution model
changes
1) Fully convolutional temporal and spatial superresolution models at certain resolution
2) choice of v- parametrization of diffusion models
apply progressive distillation to our video models with classifier free guidance for fast, high quality sampling
Prior work on video generation
→ approach is → restricted dataset
1) autoreggresive models
2) latent variable model with autoreggresive prior
3) non-autoreggressive + latent variable approaches
autoreggressive generation + RNN → with conditional diffusion observations
Model is →
1) frozen T5 text encoder
2) base video diffusion model
3) interleaved spatial and temporal super resolution diffusion model
Key contributions -
7 sub models which perform
1) text conditional video generation
2) spatial super resolution
3) temporal super resolution
ImageNet is built from → diffusion models
→ specified in continuous time
We parametrize in terms of v-parametrization
Cascaded diffusion models → generate image at low resolution
→ then sequentially increase resolution of image through a series of
→ super resolution diffusion models
→
methods from image to video
→ like
:: v-parametrization
conditional augmentation
classifier-free guidance