Attention is all you need

Transformer - based solely on attention mechanism, ditching recurrence and convolution

Recurrent NN, CNN, and Encoder-Decoder

RNN- precludes parallelization, and this becomes bottleneck at longer sequence lengths

Attention mechanism - prominent without RNN also.

In transformer →to learn dependencies bw text, is reduced to a Constant operation → at cost of → reduced effective resolution, coz of averaging attention-weighted positions ::solved with the help of Multi-Head attention

Self attention: attention mech → relating different position of a sequence to compute a representation of the sequence

Most neural sequence transduction models have → encoder-decoder structure