We present
pure-transformer based models for video classification
, drawing upon there recent success of such models in image classification.
model extracts spatio-temporal tokens from input video, which are then encoded by series of transformer layers
to handle long sequences of tokens in video, →
variants of model which factorise spatial and temporal dimensions of input
although transformer models are known to be only effective when large scale data is there,
→ how effectively leveraging the model during training and leveraging pretrained image models to be able ot run on comparatively small datasets
Some classification studies -
Kinetics400and 600,
EpicKitchens,
Something-Somethingv2 and
Moments inTime,
outperforming prior methods based on deep3D convolutional networks.
since Alexnet, many approaches on deep Convolutional Network have advanced SOTA.
transformer vs CNN
This is in stark contrast to convolutions where the corresponding“ receptive field” is limited, and grows linearly with the depth of the network
Hence, approaches to -
1) integrate transformers into CNN
2)replace convolution entirely
Main benefits of transformers were observed at large scale -
→ as transformers
lack inductive biases of convolutions
( like translational equivariance)
—> which either requires more data/stronger regularisation
Made several transformer based models for video classification
→ best models are based on :
—deep 3D convolutional architectures :: which were
extension of image classification CNNs
—> these models were changed by incorporating self-attention into their later layers to better capture long-range dependencies
main operation - self attention
→ computed on : sequence of spatio-temporal tokens extracted from input video
:: to process the large data - innovation in methods to factorise model along spatial and temporal dimensions
→ to train on small data - how to regularize model during training and leveraging pretrained image models
As pure transformer models present different characteristics, we need to determine the best design choices for each architecture
3 analysis done -
1) tokenisation strategies
2) model architecture
3) regularisation methods
hand crafted featues to encode appearance and motion info
2D image CNN as 2 stream network - RGB frames + optical flow images
spatio-temporal 3D CNN - has more params + require larger training dataset
3d CNN -need more computational - by factorising convolution across spatial and temporal dimension + use grouped convolution
transformer - self attention, layer normalisation, MLP
Variants of transformer - to reduce computation cost of self attention
→ when processing larger sequences + improve parameter efficiency
Self attention used extensively in CV, but only at the end of the layers/ augment residual blocks within a ResNet architecture
image classification + video classification- after a paper which used only tranformer(ditched convolution)
do additional regularisation + pretrained models
What ViT - does → models pairwise interactions bw all spatio-temporal tokens
{ idea is pretty neat - separate into pixels, combine them togethe, and then you have a sequence, and then send it to transformer }
{ first step is embedding video into tokens, then, after you have the tokens, how are you going to send it to the transformer architecture for further processing }
{{{ which i think they are doing is, factorsing the spatial and temporal dimensions of the video at various levels of the transformer architecture }}}
first archi is →
just concatenation of all spatial- temporal layers
second archi is→ ( called Factorised encoder)
first send tokens into spatial layer,
→ then send them into temporal layers
—> Late fusion of temporal information
3 things:
First | self attention bw all temporal and spatial dimensions |
second | self attention only bw temporal |
third | self attention bw temporal and spatial separately |
fourth | multi head dot product attention operation |
coz of less labelled examples - training models from scratch to high accuracy is difficult
→hence →
video models are initialized from
→ pretrained image models
But questions —> how to initialise parameters not present/incompatible with image models