Table of Contents

Multi-Head Attention in Transformers

Positional Encoding is a crucial component used in the architecture of the Transformer. The output of the positional encoding goes into the first sub-layer of the Transformer architecture. This sub-layer is a multi-head attention mechanism.

The multi-head attention mechanism is a key feature of the Transformer model that helps it in handling sequential data more effectively. It allows the model to look at different parts of the input sequence all at once.

In this chapter, we will explore the structure of multi-head attention mechanism, its advantages and Python implementation.

What is Self-Attention Mechanism?

The self-attention mechanism, also called scaled dot-product attention, is an essential component of the Transformer based models. It allows the model to focus on different tokens in the input sequence relative to each other. This is done by calculating a weighted sum of the input values. The weights here are based on similarity between tokens.

Self-Attention Mechanism

Below are the steps involved in the self-attention mechanism −

Creating Queries, Keys, and Values − The self-attention mechanism transforms each token in the input sequence into three vectors namely Query(Q), Key(K), and Value(V).
Calculating Attention Scores − Next, the self-attention mechanism calculates the attention scores by taking the dot product of the Query (Q) and Key (K) metrices. The attention scores show the significance of each word to the current word being processed.
Applying Softmax Function − Now, in this step a softmax function is applied to these attention scores to convert them into probabilities, which ensures that the attention weights sum up to 1.
Weighted Sum of Values − In this last step, to produce the output, the softmax attention scores are used to compute a weighted sum of Value vectors.

Mathematically, the self-attention mechanism can be summarized by the following equation −

$$\mathrm{Self-Attention(Q,K,V) \: = \: softmax \left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V}$$

What is Multi-Head Attention?

Multi-head attention extends the concept of self-attention mechanism by allowing the model to focus on different parts of the input sequence simultaneously. Rather than running a single attention function, multi-head attention runs multiple self-attention mechanisms, or “heads,” in parallel. This method enables the model to better understand various relationships and dependencies within the input sequence.

Take a look at the following image, it is a part of the original transformer architecture. It represents the structure of multi-head sublayer −

Steps of Multi-head Attention

Given below are the key steps involved in the multi-head attention −

Applying Multiple Heads − First, the input embeddings are linearly projected into multiple sets (one set per head) of Query(Q), Key(K), and Value(V) metrices.
Performing Parallel Self-Attention − Next, the self-attention mechanism is performed parallelly by each head on its respective projections.
Concatenation − Now, the outputs of all the heads are concatenated.
Combining the information − In this last step, the information from all the heads is combined. It is done by passing the concatenated output through a final linear layer.

Mathematically, the multi-head attention mechanism can be summarized by the following equation −

$$\mathrm{MultiHead(Q,K,V) \: = \: Concat(head_{1}, \: \dotso \: ,head_{h})W^{\circ}}$$

Where each head is calculated as −

$$\mathrm{head_{i}\:=\: Attention(QW_{i}^{Q}, \: KW_{i}^{K}, \: VW_{i}^{V} )\:=\: softmax\left(\frac{QW_{i}^{Q} (KW_{i}^{K})^{T}}{\sqrt{d_{k}}}\right)VW_{i}^{V}}$$

Advantages of Multi-Head Attention

Enhanced Representation − By focusing on different parts of the input sequence simultaneously, multi-head attention enables the model to better understand various relationships and dependencies within the input sequence.

Parallel Processing − By enabling parallel processing, multi-head attention mechanism significantly improves the model training efficiency compared to sequential models like RNNs.

Example

The following Python script will implement multi-head attention mechanism −

import numpy as np

class MultiHeadAttention:
   def __init__(self, d_model, num_heads):
      self.d_model = d_model
      self.num_heads = num_heads
      assert d_model % num_heads == 0
      self.depth = d_model // num_heads

      # Initializing weight matrices for queries, keys, and values
      self.W_q = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
      self.W_k = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
      self.W_v = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)

      # Initializing weight matrix for output
      self.W_o = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
        
   def split_heads(self, x, batch_size):
      """
      Split the last dimension into (num_heads, depth).
      Transpose the result to shape (batch_size, num_heads, seq_len, depth)
      """
      x = np.reshape(x, (batch_size, -1, self.num_heads, self.depth))
      return np.transpose(x, (0, 2, 1, 3))
    
   def scaled_dot_product_attention(self, q, k, v):
      """
      Compute scaled dot product attention.
      """
      matmul_qk = np.matmul(q, k)
      scaled_attention_logits = matmul_qk / np.sqrt(self.depth)
      scaled_attention_logits -= np.max(scaled_attention_logits, axis=-1, keepdims=True)
      attention_weights = np.exp(scaled_attention_logits)
      attention_weights /= np.sum(attention_weights, axis=-1, keepdims=True)
      output = np.matmul(attention_weights, v)
      return output, attention_weights
    
   def call(self, inputs):
      q, k, v = inputs
      batch_size = q.shape[0]

      # The Linear transformations
      q = np.dot(q, self.W_q)
      k = np.dot(k, self.W_k)
      v = np.dot(v, self.W_v)

      # Split heads
      q = self.split_heads(q, batch_size)
      k = self.split_heads(k, batch_size)
      v = self.split_heads(v, batch_size)

      # The Scaled dot-product attention
      attention_output, attention_weights = self.scaled_dot_product_attention(q, k.transpose(0, 1, 3, 2), v)

      # Combining heads
      attention_output = np.transpose(attention_output, (0, 2, 1, 3))
      concat_attention = np.reshape(attention_output, (batch_size, -1, self.d_model))

      # Linear transformation for output
      output = np.dot(concat_attention, self.W_o)

      return output, attention_weights

# An Example usage
d_model = 512
num_heads = 8
batch_size = 2
seq_len = 10

# Creating an instance of MultiHeadAttention
multi_head_attn = MultiHeadAttention(d_model, num_heads)

# Example input (batch_size, sequence_length, embedding_dim)
Q = np.random.randn(batch_size, seq_len, d_model)
K = np.random.randn(batch_size, seq_len, d_model)
V = np.random.randn(batch_size, seq_len, d_model)

# Performing multi-head attention
output, attention_weights = multi_head_attn.call([Q, K, V])

print("Input Query (Q):n", Q)
print("Multi-Head Attention Output:n", output)

Output

After implementing the above script, we will get the following output −

Input Query (Q):
 [[[ 1.38223113 -0.41160481  1.00938637 ... -0.23466982 -0.20555623  0.80305284]
  [ 0.64676968 -0.83592083  2.45028238 ... -0.1884722  -0.25315478   0.18875416]
  [-0.52094419 -0.03697595 -0.61598294 ...  1.25611974 -0.35473911   0.15091853]
  ...
  [ 1.15939786 -0.5304271  -0.45396363 ...  0.8034571   0.66646109  -1.28586743]
  [ 0.6622964  -0.62871864  0.61371113 ... -0.59802729 -0.66135327  -0.24437055]
  [ 0.83111283 -0.81060387 -0.30858598 ... -0.74773536 -1.3032037   3.06236077]]

 [[-0.88579467 -0.15480352  0.76149486 ... -0.5033709   1.20498808  -0.55297549]
  [-1.11233207  0.7560376  -1.41004173 ... -2.12395203  2.15102493   0.09244935]
  [ 0.33003584  1.67364745 -0.30474183 ...  1.65907682 -0.61370707   0.58373516]
  ...
  [-2.07447136 -1.04964997 -0.15290381 ... -0.19912739 -1.02747937   0.20710549]
  [ 0.38910395 -1.04861089 -1.66583867 ...  0.21530474 -1.45005951   0.04472527]
  [-0.4718725  -0.45374148 -0.59990784 ... -1.9545574   0.11470969 1.03736175]]]

Multi-Head Attention Output:
 [[[ 0.36106079 -2.04297889  0.34937837 ...  0.11306262  0.53263072 -1.32641213]
  [ 1.09494311 -0.56658386  0.24210239 ...  1.1671274  -0.02322074   0.90110388]
  [ 0.45854972 -0.54493138 -0.63421376 ...  1.12479291  0.02585155  -0.08487499]
  ...
  [ 0.18252303 -0.17292067  0.46922657 ... -0.41311278 -1.17954406  -0.17005412]
  [-0.7849032  -2.12371221 -0.80403028 ... -2.35884088  0.15292393  -0.05569091]
  [ 1.07844261  0.18249226  0.81735183 ...  1.16346645 -1.71611237 -1.09860234]]

 [[ 0.58842816 -0.04493786 -1.72007093 ... -2.37506208 -1.83098896   2.84016717]
  [ 0.36608434  0.11709812  0.79108595 ... -1.6308595  -0.96052828   0.40893208]
  [-1.42113667  0.67459219 -0.8731377  ... -1.47390056 -0.42947079   1.04828361]
  ...
  [ 1.14151388 -1.5437165  -1.23405718 ...  0.29237056  0.56595327  -0.19385628]
  [-2.33028535  0.7245296   1.01725021 ... -0.9380485  -1.78988485   0.9938851 ]
  [-0.88115094  3.03051907  0.39447342 ... -1.89168756  0.94973626   0.61657539]
]]

Conclusion

The multi-head attention sublayer is a key component of the Transformer architecture. It helps the model in handling sequential data more effectively by allowing it to look at different parts of the input sequence all at once.

In this chapter, we provided a comprehensive overview of the multi-head attention mechanism, its advantages, and how it can be implemented using Python.

Learn Multi-Head Attention work project make money