NaVit

https://arxiv.org/pdf/2307.06304.pdf

suboptimal choice of → resizing images to a fixed resolution is → obsolete bUT ViT- gives flexible sequence based modeling :: + varying input lengths Native Resolution ViT→ uses sequence packing during training to process inputs of arbitrary resolution + aspect ratios. Principle of ViT→ splitting an image into patches, → each of which is linearly projected into token ( but input images are resized to a fixed square aspect ratio and then split into a fixed number of patches)

multiple patch sizes = smooth variation of input length → via random sampling of patch size at each training step + resizing algo to allow —>initial convolutional embedding to support multiple patch sizes Pix2Struct- alternative patching approach which preserves aspect ratio

Princpile of NaViT → multiple images from different sequence are packed in a single sequence → Patch n Pack : enables variable resolution while preserving the aspect ratio ::: inspired by — example packing in NLP, where multiple examples are packed into a single sequence to accommodate efficient training on variable length inputs

observation- 1) randomly sampling resolution at training time reduces cost 2) good performace accross wide range of resolutions 3) fixed batch shape by example packing gave new ideas → aspect ratio preserving resolution sampling variable token dropping rates adaptive computation

To solve different image size, 2 things done 1) resizing 2) padding the image bUT FLAWED - 1) resizing hampers performance 2) padding is inefficient + most images are typically not square

Self attention mask - to prevent examples attending to each other self attention pooling - pool the token representations within each example, resulting = single vector representation/example in the sequence Factorized positional embedding - 2D Positional embedding - where embedding of size (maxLen, maxLen) are learned, , and indexed with (x,y) coordinates of each patch Factorized positional embedding - decompose into separate embedding of x and y coordinates Consider - learned embedding, sinusoidal embedding, learned Fourier positional embedding

Token dropping - ( random omissin of input patches during training) → to accelerate training

greater throughput/greater performance

enables mixed resolution training by sampling → from distribution of image sizes, while retaining each images original aspect ratio

It says cost of self attention is miniscule compared to MLP cost

things copied from ViT→ query key normalization omission of biases attention pooling

benefit from → preserved aspect ratios evaluate over many resolution but main is → increase in number of training examples ::: achieved through → combination of → sampling mulitple varialbe-resoluiton examples and token dropping, leading to variable size images

finetuned with variable resolution >> single resolution

Token dropping strategies- 1. continuously sampled token dropping rates 2. resolution dependent token dropping rates

constant >> beta distribution _- token dropping rate

3 different spatio- temporal patches “ tubelets” central frame embedding RELATED WORK -

→

featuer maps at mulitpel spatial scales compute cant be scaled by reducing resolution after tuning at higher resolution large batches with coarse spatiotemporal resolution + finer resolution later