suboptimal choice of → resizing images to a fixed resolution is
→ obsolete
bUT ViT- gives flexible sequence based modeling
:: + varying input lengths
Native Resolution ViT→
uses sequence packing during training to process inputs of arbitrary resolution + aspect ratios.
Principle of ViT→ splitting an image into patches,
→ each of which is linearly projected into token
( but input images are resized to a fixed square aspect ratio and then split into a fixed number of patches)
multiple patch sizes = smooth variation of input length
→ via random sampling of patch size at each training step + resizing algo to allow
—>initial convolutional embedding to support multiple patch sizes
Pix2Struct- alternative patching approach which preserves aspect ratio
Princpile of NaViT → multiple images from different sequence are packed in a single sequence
→
Patch n Pack
: enables variable resolution while preserving the aspect ratio
::: inspired by — example packing in NLP,
where multiple examples are packed into a single sequence to accommodate efficient training on variable length inputsobservation-
1) randomly sampling resolution at training time reduces cost
2) good performace accross wide range of resolutions
3) fixed batch shape by example packing gave new ideas →
aspect ratio preserving resolution sampling
variable token dropping rates
adaptive computation
To solve different image size, 2 things done
1) resizing
2) padding the image
bUT FLAWED - 1) resizing hampers performance 2) padding is inefficient
+ most images are typically not square
Self attention mask - to prevent examples attending to each other
self attention pooling - pool the token representations within each example, resulting = single vector representation/example in the sequence
Factorized positional embedding -
2D Positional embedding - where embedding of size (maxLen, maxLen) are learned, ,
and indexed with (x,y) coordinates of each patch
Factorized positional embedding -
decompose into separate embedding of x and y coordinates
Consider - learned embedding, sinusoidal embedding, learned Fourier positional embedding
Token dropping - ( random omissin of input patches during training)
→ to accelerate training
greater throughput/greater performance
enables mixed resolution training by sampling → from distribution of image sizes, while retaining each images original aspect ratio
It says cost of self attention is miniscule compared to MLP cost
things copied from ViT→
query key normalization
omission of biases
attention pooling
benefit from →
preserved aspect ratios
evaluate over many resolution
but main is → increase in number of training examples
::: achieved through → combination of → sampling mulitple varialbe-resoluiton examples and token dropping, leading to variable size images
finetuned with variable resolution >> single resolution
Token dropping strategies-
1. continuously sampled token dropping rates
2. resolution dependent token dropping rates
constant >> beta distribution
_- token dropping rate
3 different spatio- temporal patches “ tubelets”
central frame embedding
RELATED WORK -
→
featuer maps at mulitpel spatial scales
compute cant be scaled by reducing resolution after tuning at higher resolution
large batches with coarse spatiotemporal resolution + finer resolution later