Attention is all you need

Transformer - based solely on attention mechanism, ditching recurrence and convolution
 
Recurrent NN, CNN, and Encoder-Decoder
RNN- precludes parallelization, and this becomes bottleneck at longer sequence lengths
notion image
Attention mechanism - prominent without RNN also.
In transformer →to learn dependencies bw text, is reduced to a Constant operation → at cost of → reduced effective resolution, coz of averaging attention-weighted positions ::solved with the help of Multi-Head attention
 
notion image
 
Self attention: attention mech → relating different position of a sequence to compute a representation of the sequence
 
Most neural sequence transduction models have → encoder-decoder structure