Positional Encoding is a crucial component used in the architecture of the Transformer. The output of the positional encoding goes into the first sub-layer of the Transformer architecture. This sub-layer is a multi-head attention mechanism.
The multi-head attention mechanism is a key feature of the Transformer model that helps it in handling sequential data more effectively. It allows the model to look at different parts of the input sequence all at once.
In this chapter, we will explore the structure of multi-head attention mechanism, its advantages and Python implementation.
What is Self-Attention Mechanism?
The self-attention mechanism, also called scaled dot-product attention, is an essential component of the Transformer based models. It allows the model to focus on different tokens in the input sequence relative to each other. This is done by calculating a weighted sum of the input values. The weights here are based on similarity between tokens.
Self-Attention Mechanism
Below are the steps involved in the self-attention mechanism −
- Creating Queries, Keys, and Values − The self-attention mechanism transforms each token in the input sequence into three vectors namely Query(Q), Key(K), and Value(V).
- Calculating Attention Scores − Next, the self-attention mechanism calculates the attention scores by taking the dot product of the Query (Q) and Key (K) metrices. The attention scores show the significance of each word to the current word being processed.
- Applying Softmax Function − Now, in this step a softmax function is applied to these attention scores to convert them into probabilities, which ensures that the attention weights sum up to 1.
- Weighted Sum of Values − In this last step, to produce the output, the softmax attention scores are used to compute a weighted sum of Value vectors.
Mathematically, the self-attention mechanism can be summarized by the following equation −
$$\mathrm{Self-Attention(Q,K,V) \: = \: softmax \left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V}$$
What is Multi-Head Attention?
Multi-head attention extends the concept of self-attention mechanism by allowing the model to focus on different parts of the input sequence simultaneously. Rather than running a single attention function, multi-head attention runs multiple self-attention mechanisms, or “heads,” in parallel. This method enables the model to better understand various relationships and dependencies within the input sequence.
Take a look at the following image, it is a part of the original transformer architecture. It represents the structure of multi-head sublayer −
Steps of Multi-head Attention
Given below are the key steps involved in the multi-head attention −
- Applying Multiple Heads − First, the input embeddings are linearly projected into multiple sets (one set per head) of Query(Q), Key(K), and Value(V) metrices.
- Performing Parallel Self-Attention − Next, the self-attention mechanism is performed parallelly by each head on its respective projections.
- Concatenation − Now, the outputs of all the heads are concatenated.
- Combining the information − In this last step, the information from all the heads is combined. It is done by passing the concatenated output through a final linear layer.
Mathematically, the multi-head attention mechanism can be summarized by the following equation −
$$\mathrm{MultiHead(Q,K,V) \: = \: Concat(head_{1}, \: \dotso \: ,head_{h})W^{\circ}}$$
Where each head is calculated as −
$$\mathrm{head_{i}\:=\: Attention(QW_{i}^{Q}, \: KW_{i}^{K}, \: VW_{i}^{V} )\:=\: softmax\left(\frac{QW_{i}^{Q} (KW_{i}^{K})^{T}}{\sqrt{d_{k}}}\right)VW_{i}^{V}}$$
Advantages of Multi-Head Attention
Enhanced Representation − By focusing on different parts of the input sequence simultaneously, multi-head attention enables the model to better understand various relationships and dependencies within the input sequence.
Parallel Processing − By enabling parallel processing, multi-head attention mechanism significantly improves the model training efficiency compared to sequential models like RNNs.
Example
The following Python script will implement multi-head attention mechanism −
import numpy as np class MultiHeadAttention: def __init__(self, d_model, num_heads): self.d_model = d_model self.num_heads = num_heads assert d_model % num_heads == 0 self.depth = d_model // num_heads # Initializing weight matrices for queries, keys, and values self.W_q = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) self.W_k = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) self.W_v = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) # Initializing weight matrix for output self.W_o = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) def split_heads(self, x, batch_size): """ Split the last dimension into (num_heads, depth). Transpose the result to shape (batch_size, num_heads, seq_len, depth) """ x = np.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return np.transpose(x, (0, 2, 1, 3)) def scaled_dot_product_attention(self, q, k, v): """ Compute scaled dot product attention. """ matmul_qk = np.matmul(q, k) scaled_attention_logits = matmul_qk / np.sqrt(self.depth) scaled_attention_logits -= np.max(scaled_attention_logits, axis=-1, keepdims=True) attention_weights = np.exp(scaled_attention_logits) attention_weights /= np.sum(attention_weights, axis=-1, keepdims=True) output = np.matmul(attention_weights, v) return output, attention_weights def call(self, inputs): q, k, v = inputs batch_size = q.shape[0] # The Linear transformations q = np.dot(q, self.W_q) k = np.dot(k, self.W_k) v = np.dot(v, self.W_v) # Split heads q = self.split_heads(q, batch_size) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) # The Scaled dot-product attention attention_output, attention_weights = self.scaled_dot_product_attention(q, k.transpose(0, 1, 3, 2), v) # Combining heads attention_output = np.transpose(attention_output, (0, 2, 1, 3)) concat_attention = np.reshape(attention_output, (batch_size, -1, self.d_model)) # Linear transformation for output output = np.dot(concat_attention, self.W_o) return output, attention_weights # An Example usage d_model = 512 num_heads = 8 batch_size = 2 seq_len = 10 # Creating an instance of MultiHeadAttention multi_head_attn = MultiHeadAttention(d_model, num_heads) # Example input (batch_size, sequence_length, embedding_dim) Q = np.random.randn(batch_size, seq_len, d_model) K = np.random.randn(batch_size, seq_len, d_model) V = np.random.randn(batch_size, seq_len, d_model) # Performing multi-head attention output, attention_weights = multi_head_attn.call([Q, K, V]) print("Input Query (Q):n", Q) print("Multi-Head Attention Output:n", output)
Output
After implementing the above script, we will get the following output −
Input Query (Q): [[[ 1.38223113 -0.41160481 1.00938637 ... -0.23466982 -0.20555623 0.80305284] [ 0.64676968 -0.83592083 2.45028238 ... -0.1884722 -0.25315478 0.18875416] [-0.52094419 -0.03697595 -0.61598294 ... 1.25611974 -0.35473911 0.15091853] ... [ 1.15939786 -0.5304271 -0.45396363 ... 0.8034571 0.66646109 -1.28586743] [ 0.6622964 -0.62871864 0.61371113 ... -0.59802729 -0.66135327 -0.24437055] [ 0.83111283 -0.81060387 -0.30858598 ... -0.74773536 -1.3032037 3.06236077]] [[-0.88579467 -0.15480352 0.76149486 ... -0.5033709 1.20498808 -0.55297549] [-1.11233207 0.7560376 -1.41004173 ... -2.12395203 2.15102493 0.09244935] [ 0.33003584 1.67364745 -0.30474183 ... 1.65907682 -0.61370707 0.58373516] ... [-2.07447136 -1.04964997 -0.15290381 ... -0.19912739 -1.02747937 0.20710549] [ 0.38910395 -1.04861089 -1.66583867 ... 0.21530474 -1.45005951 0.04472527] [-0.4718725 -0.45374148 -0.59990784 ... -1.9545574 0.11470969 1.03736175]]] Multi-Head Attention Output: [[[ 0.36106079 -2.04297889 0.34937837 ... 0.11306262 0.53263072 -1.32641213] [ 1.09494311 -0.56658386 0.24210239 ... 1.1671274 -0.02322074 0.90110388] [ 0.45854972 -0.54493138 -0.63421376 ... 1.12479291 0.02585155 -0.08487499] ... [ 0.18252303 -0.17292067 0.46922657 ... -0.41311278 -1.17954406 -0.17005412] [-0.7849032 -2.12371221 -0.80403028 ... -2.35884088 0.15292393 -0.05569091] [ 1.07844261 0.18249226 0.81735183 ... 1.16346645 -1.71611237 -1.09860234]] [[ 0.58842816 -0.04493786 -1.72007093 ... -2.37506208 -1.83098896 2.84016717] [ 0.36608434 0.11709812 0.79108595 ... -1.6308595 -0.96052828 0.40893208] [-1.42113667 0.67459219 -0.8731377 ... -1.47390056 -0.42947079 1.04828361] ... [ 1.14151388 -1.5437165 -1.23405718 ... 0.29237056 0.56595327 -0.19385628] [-2.33028535 0.7245296 1.01725021 ... -0.9380485 -1.78988485 0.9938851 ] [-0.88115094 3.03051907 0.39447342 ... -1.89168756 0.94973626 0.61657539] ]]
Conclusion
The multi-head attention sublayer is a key component of the Transformer architecture. It helps the model in handling sequential data more effectively by allowing it to look at different parts of the input sequence all at once.
In this chapter, we provided a comprehensive overview of the multi-head attention mechanism, its advantages, and how it can be implemented using Python.