Video vision transformer

We present pure-transformer based models for video classification, drawing upon there recent success of such models in image classification.

model extracts spatio-temporal tokens from input video, which are then encoded by series of transformer layers

to handle long sequences of tokens in video, → variants of model which factorise spatial and temporal dimensions of input although transformer models are known to be only effective when large scale data is there, → how effectively leveraging the model during training and leveraging pretrained image models to be able ot run on comparatively small datasets

Some classification studies - Kinetics400and 600, EpicKitchens, Something-Somethingv2 and Moments inTime, outperforming prior methods based on deep3D convolutional networks.

since Alexnet, many approaches on deep Convolutional Network have advanced SOTA.

transformer vs CNN This is in stark contrast to convolutions where the corresponding“ receptive field” is limited, and grows linearly with the depth of the network

Hence, approaches to - 1) integrate transformers into CNN 2)replace convolution entirely

Main benefits of transformers were observed at large scale - → as transformers lack inductive biases of convolutions( like translational equivariance) —> which either requires more data/stronger regularisation

Made several transformer based models for video classification → best models are based on : —deep 3D convolutional architectures :: which were extension of image classification CNNs —> these models were changed by incorporating self-attention into their later layers to better capture long-range dependencies

main operation - self attention → computed on : sequence of spatio-temporal tokens extracted from input video :: to process the large data - innovation in methods to factorise model along spatial and temporal dimensions → to train on small data - how to regularize model during training and leveraging pretrained image models

💡

As pure transformer models present different characteristics, we need to determine the best design choices for each architecture

3 analysis done - 1) tokenisation strategies 2) model architecture 3) regularisation methods

hand crafted featues to encode appearance and motion info 2D image CNN as 2 stream network - RGB frames + optical flow images spatio-temporal 3D CNN - has more params + require larger training dataset 3d CNN -need more computational - by factorising convolution across spatial and temporal dimension + use grouped convolution

transformer - self attention, layer normalisation, MLP

Variants of transformer - to reduce computation cost of self attention → when processing larger sequences + improve parameter efficiency

Self attention used extensively in CV, but only at the end of the layers/ augment residual blocks within a ResNet architecture

image classification + video classification- after a paper which used only tranformer(ditched convolution)

do additional regularisation + pretrained models

What ViT - does → models pairwise interactions bw all spatio-temporal tokens { idea is pretty neat - separate into pixels, combine them togethe, and then you have a sequence, and then send it to transformer }

{ first step is embedding video into tokens, then, after you have the tokens, how are you going to send it to the transformer architecture for further processing } {{{ which i think they are doing is, factorsing the spatial and temporal dimensions of the video at various levels of the transformer architecture }}}

first archi is → just concatenation of all spatial- temporal layers

second archi is→ ( called Factorised encoder) first send tokens into spatial layer, → then send them into temporal layers —> Late fusion of temporal information

3 things:

First	self attention bw all temporal and spatial dimensions
second	self attention only bw temporal
third	self attention bw temporal and spatial separately
fourth	multi head dot product attention operation

coz of less labelled examples - training models from scratch to high accuracy is difficult →hence → video models are initialized from → pretrained image models But questions —> how to initialise parameters not present/incompatible with image models