NaVit

 
suboptimal choice of → resizing images to a fixed resolution is → obsolete bUT ViT- gives flexible sequence based modeling :: + varying input lengths Native Resolution ViT→ uses sequence packing during training to process inputs of arbitrary resolution + aspect ratios. Principle of ViT→ splitting an image into patches, → each of which is linearly projected into token ( but input images are resized to a fixed square aspect ratio and then split into a fixed number of patches)
notion image
multiple patch sizes = smooth variation of input length → via random sampling of patch size at each training step + resizing algo to allow —>initial convolutional embedding to support multiple patch sizes Pix2Struct- alternative patching approach which preserves aspect ratio
 
Princpile of NaViT → multiple images from different sequence are packed in a single sequencePatch n Pack : enables variable resolution while preserving the aspect ratio ::: inspired by — example packing in NLP, where multiple examples are packed into a single sequence to accommodate efficient training on variable length inputs
 
observation- 1) randomly sampling resolution at training time reduces cost 2) good performace accross wide range of resolutions 3) fixed batch shape by example packing gave new ideas → aspect ratio preserving resolution sampling variable token dropping rates adaptive computation
 
notion image
 
To solve different image size, 2 things done 1) resizing 2) padding the image bUT FLAWED - 1) resizing hampers performance 2) padding is inefficient + most images are typically not square
 
notion image
 
Self attention mask - to prevent examples attending to each other self attention pooling - pool the token representations within each example, resulting = single vector representation/example in the sequence Factorized positional embedding - 2D Positional embedding - where embedding of size (maxLen, maxLen) are learned, , and indexed with (x,y) coordinates of each patch Factorized positional embedding - decompose into separate embedding of x and y coordinates Consider - learned embedding, sinusoidal embedding, learned Fourier positional embedding
 
Token dropping - ( random omissin of input patches during training) → to accelerate training
notion image
greater throughput/greater performance
 
enables mixed resolution training by sampling → from distribution of image sizes, while retaining each images original aspect ratio
 
notion image
It says cost of self attention is miniscule compared to MLP cost
 
things copied from ViT→ query key normalization omission of biases attention pooling
notion image
 
notion image
benefit from → preserved aspect ratios evaluate over many resolution but main is → increase in number of training examples ::: achieved through → combination of → sampling mulitple varialbe-resoluiton examples and token dropping, leading to variable size images
 
notion image
finetuned with variable resolution >> single resolution
 
 
Token dropping strategies- 1. continuously sampled token dropping rates 2. resolution dependent token dropping rates
notion image
constant >> beta distribution _- token dropping rate
 
notion image
3 different spatio- temporal patches “ tubelets” central frame embedding RELATED WORK -
notion image
notion image
 
notion image
notion image
featuer maps at mulitpel spatial scales compute cant be scaled by reducing resolution after tuning at higher resolution large batches with coarse spatiotemporal resolution + finer resolution later