Understanding ViVit and how videos are being encoded

We have a bunch of videos, so how do we encode it and pass it through a transformer for classification task. Here’s how: Lets say Input video is black and white( so input channels are 1), and each frame is 28*28, and there are 28 frames in total( in simple terms, we have each video which is ~3 seconds long) So, here’s how we convert these videos to a sequence to sequence layer( akin to how ViViT works): input video shape is ( 28*28*28) Lets say kernel is 8*8*8 ( yes, we are doing 3D convolution here), and we choose 128 to be the number of output channels ( to capture more relations in different dimensions for the video) So output is [28 -8 + 2 * 0 ( padding) / 8( stride) + 1 ] = 3.5, and take floor of that, so 3

Video to be inputted to the transformer encoder

So output is 3*3*3*128 * 128 channels for each input) And then you add positional encoding to each of the 27 tokens (3*3*3) of the input video.

So, we are inputting to the transformer: videos reduced in the form→ 32,27,128

How attention works on the reduced video tokens

Let’s see how transformer works in this:

Final Output( 11 classes)

So the final output is 32,128 to 32,11. In a Feed forward manner, which will reduce it to probability values for each of the 11 classes.