Video vision transformer

We present pure-transformer based models for video classification, drawing upon there recent success of such models in image classification.
model extracts spatio-temporal tokens from input video, which are then encoded by series of transformer layers
to handle long sequences of tokens in video, → variants of model which factorise spatial and temporal dimensions of input although transformer models are known to be only effective when large scale data is there, → how effectively leveraging the model during training and leveraging pretrained image models to be able ot run on comparatively small datasets
Some classification studies - Kinetics400and 600, EpicKitchens, Something-Somethingv2 and Moments inTime, outperforming prior methods based on deep3D convolutional networks.
 
since Alexnet, many approaches on deep Convolutional Network have advanced SOTA.
 
transformer vs CNN This is in stark contrast to convolutions where the corresponding“ receptive field” is limited, and grows linearly with the depth of the network
Hence, approaches to - 1) integrate transformers into CNN 2)replace convolution entirely
 
Main benefits of transformers were observed at large scale - → as transformers lack inductive biases of convolutions( like translational equivariance) —> which either requires more data/stronger regularisation
Made several transformer based models for video classification → best models are based on : —deep 3D convolutional architectures :: which were extension of image classification CNNs —> these models were changed by incorporating self-attention into their later layers to better capture long-range dependencies
main operation - self attention → computed on : sequence of spatio-temporal tokens extracted from input video :: to process the large data - innovation in methods to factorise model along spatial and temporal dimensions → to train on small data - how to regularize model during training and leveraging pretrained image models
 
💡
As pure transformer models present different characteristics, we need to determine the best design choices for each architecture
 
notion image
 
 
3 analysis done - 1) tokenisation strategies 2) model architecture 3) regularisation methods
 
 
hand crafted featues to encode appearance and motion info 2D image CNN as 2 stream network - RGB frames + optical flow images spatio-temporal 3D CNN - has more params + require larger training dataset 3d CNN -need more computational - by factorising convolution across spatial and temporal dimension + use grouped convolution
transformer - self attention, layer normalisation, MLP
Variants of transformer - to reduce computation cost of self attention → when processing larger sequences + improve parameter efficiency
Self attention used extensively in CV, but only at the end of the layers/ augment residual blocks within a ResNet architecture
image classification + video classification- after a paper which used only tranformer(ditched convolution)
 
do additional regularisation + pretrained models
What ViT - does → models pairwise interactions bw all spatio-temporal tokens { idea is pretty neat - separate into pixels, combine them togethe, and then you have a sequence, and then send it to transformer }
 
{ first step is embedding video into tokens, then, after you have the tokens, how are you going to send it to the transformer architecture for further processing } {{{ which i think they are doing is, factorsing the spatial and temporal dimensions of the video at various levels of the transformer architecture }}}
 
notion image
 
first archi is → just concatenation of all spatial- temporal layers
second archi is→ ( called Factorised encoder) first send tokens into spatial layer, → then send them into temporal layers —> Late fusion of temporal information
notion image
3 things:
First
self attention bw all temporal and spatial dimensions
second
self attention only bw temporal
third
self attention bw temporal and spatial separately
fourth
multi head dot product attention operation
coz of less labelled examples - training models from scratch to high accuracy is difficult →hence → video models are initialized from pretrained image models But questions —> how to initialise parameters not present/incompatible with image models
 
 
notion image
 
notion image