Method Section
Method
An attention function can be described as mapping a query and a set of key-value pairs to an output. We compute the scaled dot-product attentionSimplifiedEach word computes a relevance score with every other word. The scores are divided by √d to prevent extreme values, then passed through softmax. on a set of queries simultaneously, packed into a matrix Q.
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we project them h = 8 times with different learned projections.