Discuss Generative AI Generative AI is a type of artificial intelligence technology that generates new text, audio, video, or any other type of content by using algorithms like Generative Adversarial Networks or Variational Auto Encoders (VAEs). It learns patterns from existing training data and produces new and unique output that resembles real-world data.
Category: gen-ai
Generative AI – Useful Resources The following resources contain additional information on Generative AI. Please use them to get more in-depth knowledge on this topic. 11 Lectures 49 mins 19 Lectures 3.5 hours 23 Lectures 1 hours
Transformers in Generative AI Transformers are a type of neural network architecture that transforms an input sequence into an output sequence. The GPT models are transformer neural networks. ChatGPT uses the transformer architectures because they allow the model to focus on the most relevant segments of input data. Read this chapter to understand what Transformers model is, its key components, the need for transformer model, and a comparative analysis between transformers and Generative Adversarial Networks (GANs). What is a Transformer Model? A Transformer Model is a type of neural network that learns context through sequential data analysis. Transformers helped Large Language Models (LLMs) to understand context in language and write so efficiently. Transformers can process and anlyze an entire article at once, not just individual words or sentences. It allows LLMs to capture the context and generate better content. Unlike Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), transformers rely on modern and evolving mathematical techniques known as the Self-attention Mechanism to process and generate text. The self-attention mechanism helps how distant data elements depend on each other. Key Components of the Transformer Model This section presents a brief overview of the key components that make the Transformer Model so successful − Self-Attention Mechanism The self-attention mechanism allows models to weigh different parts of input sequence differently. It enables the model to capture long-range dependencies and relationships within the text, leading to more coherent and context-aware text generation. Multi-Head Attention The Transformer model uses multiple attention heads where each head operates independently and captures various aspects of the input data. To get the result, the outputs of these heads are combined. With the use of multi-head attention, transformers provide better representation of the input data. Positional Encoding Transformers cannot inherently capture the sequential nature of text that’s why positional encoding is added to the input embeddings. The role of positional encoding is to provide information about the position of each word in the sequence. Feedforward Neural Networks After applying the self-attention mechanism, the transformed input representations are passed through feedforward neural network (FFNN) for further processing. Layer Normalization Layer normalization allows the model to converge more efficiently because it helps stabilize and accelerate the training process. Encoder-Decoder Structure The Transformer model is composed of an encoder and a decoder, each consisting of multiple layers. The encoder processes the input sequence and generates an encoded representation, while the decoder uses this representation to generate the output sequence. Why Do We Need Transformer Models? In this section, we will highlight the reasons why the transformer architecture is needed. Transformers Can Capture Long-Range Dependencies Due to the Vanishing Gradient Problem, Recurrent Neural Networks (RNNs) and their variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) cannot handle long-range dependencies effectively. On the other hand, transformers use self-attention mechanisms which allow them to consider the entire sequence at once. This ability allows transformers to capture long-range dependencies more effectively than RNNs. Transformers Can Handle Parallel Processing RNNs process sequences sequentially which leads to longer training time and inefficiency especially with large datasets and long sequences. The self-attention mechanism in transformers allows parallel processing of input sequences which speeds up training time. Transformers are Scalable Although CNNs can process data in parallel, they are not inherently suitable for sequential data. Moreover, CNNs cannot capture global context effectively. The architecture of transformers is designed in such a way that they can handle input sequences of varying lengths. This makes transformers more scalable than CNNs. Difference between Transformers and Generative Adversarial Networks Although both Transformers and GANs are powerful deep learning models, they serve different purposes and are used in different domains. The following table presents a comparative analysis of these two models based on their features − Feature Transformers GANs Architecture It uses self-attention mechanisms to process input data. It processes the input sequences in parallel that make them able to handle long-range dependencies. It is composed of encoder and decoder layers. GANs are primarily used for generating realistic synthetic data. It consists of two competing networks: a generator and a discriminator. The generator creates fake data, and the discriminator evaluates it against real data. Key Features It can handle tasks like image classification and speech recognition which are even beyond NLP. Transformers require significant computational resources for training. It can generate high-quality, realistic synthetic data. GAN training can be unstable hence it requires careful parameter tuning. Applications Transformers are versatile in nature and can be adapted to various machine learning tasks. Language translation, text summarization, sentiment analysis, Image Processing, Speech Recognition, etc. The focus of GANs is on the tasks that require high-quality synthetic data generation. Image and video generation, creating synthetic faces, and data augmentation, medical imaging, Enhancing image resolution, etc. Advantages It can handle long-range dependencies effectively. Its capability of parallel processing saves training time. It performs better than other models in NLP tasks. It is useful for creative applications and scenarios where labeled data is limited. It is capable of generating highly realistic synthetic data. GANs has significantly improved the capabilities of image and video generation. Limitations Transformers require large amounts of training data and computational power. It can be less interpretable than simpler models. There is scalability issues with very long sequences due to quadratic complexity in self-attention mechanism. GANs training is complex and can be unstable. For example, mode collapse. They are less effective for sequential data tasks. Computational cost is high. Conclusion The Transformer Models have fundamentally transformed the field of natural language processing (NLP). By using the transformers and their multimodal architecture, ChatGPT can generate multimodal output for a wide range of applications. Like Transformers, GANs are also a powerful deep learning model used for various applications. We presented a comparative analysis between Transformers and GANs.
Positional Encoding in Transformer Models With the help of input embeddings, transformers get vector representations of discrete tokens like words, sub-words, or characters. However, these vector representations do not provide information about the position of these tokens within the sequence. That’s the reason a critical component named “positional encoding” is used in the architecture of the Transformer just after the input embedding sub-layer. Positional encoding enables the model to understand the sequence order by providing each token in the input sequence with information about its position. In this chapter, we will understand what positional encoding is, why we need it, its working and its implementation in Python programming language. What is Positional Encoding? Positional Encoding is a mechanism used in Transformer to provide information about the order of tokens within an input sequence. In the Transformer architecture, positional encoding component is added after the input embedding sub-layer. Take a look at the following diagram; it is a part of the original transformer architecture, representing the structure of positional encoding component − Why Do We Need Positional Encoding in the Transformer Model? The Transformer, despite having powerful self-attention mechanism, lacks an inherent sense of order. Unlike Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) that process a sequence in a specific order, the Transformer’s parallel processing does not provide information about the position of the tokens within the input sequence. Due to this, the model cannot understand the context, particularly in the tasks where the order of words is important. To overcome this limitation, positional encoding is introduced that gives each token in the input sequence information about its position. These encodings are then added to the input embeddings which ensures that the Transformer processes the tokens along with their positional context. How Positional Encoding Works? We have discussed in previous chapter that Transformer expects a fixed size dimensional space (it may be dmodel = 512 or any other constant value) for each vector representation of the output of the positional encoding function. As an example, let’s see the sentence given below − I am playing with the brown ball and my brother is playing with the red ball. The words “brown” and “red” may be similar but in this sentence, they are far apart. The word “brown” is in position 6 (pos = 6) and the word “red” is in position 15 (pos = 15). Here, the problem is that we need to find a way to add a value to the word embeddings of each word in the input sentence so that it has the information about its sequence. However, for each word embedding, we need to find a way to provide information in the range of (0, 512). Positional encoding can be achieved in many ways, but Vashwani et. al. (2017) in the original Transformer model used a specific method based on sinusoidal functions to generate a unique position encoding for each position in the sequence. The equations below show how the positional encoding for a given position pos and dimension i can be defined − $$\mathrm{PE_{pos \: 2i} \: = \: sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)}$$ $$\mathrm{PE_{pos \: 2i+1} \: = \: cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)}$$ Here, dmodel is the dimension of embeddings. Creating Positional Encodings Using Sinusoidal Functions Given below is a Python script to create positional encodings using sinusoidal functions − def positional_encoding(max_len, d_model): pe = np.zeros((max_len, d_model)) position = np.arange(0, max_len).reshape(-1, 1) div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) pe[:, 0::2] = np.sin(position * div_term) pe[:, 1::2] = np.cos(position * div_term) return pe # Parameters max_len = len(tokens) # Generate positional encodings pos_encodings = positional_encoding(max_len, embed_dim) # Adjust the length of the positional encodings to match the input input_embeddings_with_pos = input_embeddings + pos_encodings[:len(tokens)] print(“Positional Encodings:n”, pos_encodings) print(“Input Embeddings with Positional Encoding:n”, input_embeddings_with_pos) Now, let’s see how we can add them to the input embeddings we have implemented in the previous chapter − import numpy as np # Example text and tokenization text = “Transformers revolutionized the field of NLP” tokens = text.split() # Creating a vocabulary vocab = {word: idx for idx, word in enumerate(tokens)} # Example input (sequence of token indices) input_indices = np.array([vocab[word] for word in tokens]) print(“Vocabulary:”, vocab) print(“Input Indices:”, input_indices) # Parameters vocab_size = len(vocab) embed_dim = 512 # Dimension of the embeddings # Initialize the embedding matrix with random values embedding_matrix = np.random.rand(vocab_size, embed_dim) # Get the embeddings for the input indices input_embeddings = embedding_matrix[input_indices] print(“Embedding Matrix:n”, embedding_matrix) print(“Input Embeddings:n”, input_embeddings) def positional_encoding(max_len, d_model): pe = np.zeros((max_len, d_model)) position = np.arange(0, max_len).reshape(-1, 1) div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) pe[:, 0::2] = np.sin(position * div_term) pe[:, 1::2] = np.cos(position * div_term) return pe # Parameters max_len = len(tokens) # Generate positional encodings pos_encodings = positional_encoding(max_len, embed_dim) # Adjust the length of the positional encodings to match the input input_embeddings_with_pos = input_embeddings + pos_encodings[:len(tokens)] print(“Positional Encodings:n”, pos_encodings) print(“Input Embeddings with Positional Encoding:n”, input_embeddings_with_pos) Output After running the above script, we will get the following output − Vocabulary: {”Transformers”: 0, ”revolutionized”: 1, ”the”: 2, ”field”: 3, ”of”: 4, ”NLP”: 5} Input Indices: [0 1 2 3 4 5] Embedding Matrix: [[0.71034683 0.08027048 0.89859858 … 0.48071898 0.76495253 0.53869711] [0.71247114 0.33418585 0.15329225 … 0.61768814 0.32710687 0.89633072] [0.11731439 0.97467007 0.66899319 … 0.76157481 0.41975638 0.90980636] [0.42299987 0.51534082 0.6459627 … 0.58178494 0.13362482 0.13826352] [0.2734792 0.80146145 0.75947837 … 0.15180679 0.93250566 0.43946461] [0.5750698 0.49106984 0.56273384 … 0.77180581 0.18834177 0.6658962 ]] Input Embeddings: [[0.71034683 0.08027048 0.89859858 … 0.48071898 0.76495253 0.53869711] [0.71247114 0.33418585 0.15329225 … 0.61768814 0.32710687 0.89633072] [0.11731439 0.97467007 0.66899319 … 0.76157481 0.41975638 0.90980636] [0.42299987 0.51534082 0.6459627 … 0.58178494 0.13362482 0.13826352] [0.2734792 0.80146145 0.75947837 … 0.15180679 0.93250566 0.43946461] [0.5750698 0.49106984 0.56273384 … 0.77180581 0.18834177 0.6658962 ]] Positional Encodings: [[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 … 1.00000000e+00 0.00000000e+00 1.00000000e+00] [ 8.41470985e-01 5.40302306e-01 8.21856190e-01 … 9.99999994e-01 1.03663293e-04 9.99999995e-01] [ 9.09297427e-01 -4.16146837e-01 9.36414739e-01 … 9.99999977e-01 2.07326584e-04 9.99999979e-01] [ 1.41120008e-01 -9.89992497e-01 2.45085415e-01 … 9.99999948e-01 3.10989874e-04 9.99999952e-01] [-7.56802495e-01 -6.53643621e-01 -6.57166863e-01 … 9.99999908e-01 4.14653159e-04 9.99999914e-01] [-9.58924275e-01 2.83662185e-01 -9.93854779e-01 … 9.99999856e-01 5.18316441e-04 9.99999866e-01]] Input Embeddings with Positional Encoding: [[0.71034683 1.08027048 0.89859858 … 1.48071898 0.76495253 1.53869711] [1.55394213 0.87448815 0.97514844
CycleGAN and StyleGAN Read this chapter to understand CycleGAN and StyleGAN and how they stand out for their remarkable capabilities in generating and transforming images. What is Cycle Generative Adversarial Network? CycleGAN, or in short Cycle-Consistent Generative Adversarial Network, is a kind of GAN framework that is designed to transfer the characteristics of one image to another. In other words, CycleGAN is designed for unpaired image-to-image translation tasks where there is no relation between the input and output images. In contrast to traditional GANs that require paired training data, CycleGAN can learn mappings between two different domains without any supervision. How does a CycleGAN Work? The working of CycleGAN lies in the fact that it treats the problem as an image reconstruction problem. Let’s understand how it works − The CycleGAN first takes an image input, say, “X”. It then uses generator, say, “G” to convert input image into the reconstructed image. Once reconstruction is done, it reverses the process of reconstructed image to the original image with the help of another generator, say, “F”. Architecture of CycleGAN Like a traditional GAN, CycleGAN also has two parts-a generator and a Discriminator. But along with these two components, CycleGAN introduces the concept of cycle consistency. Let’s understand these components of CycleGAN in detail − Generator Networks (G_AB and G_BA) CycleGAN have two generator networks say, G_AB and G_BA. These generators translate images from domain A to domain B and vice versa. They are responsible for minimizing the reconstruction error between the original and translated images. Discriminator Networks (D_A and D_B) CycleGAN have two discriminator networks say, D_A and D_B. These discriminators distinguish between real and translated images in domain A and B respectively. They are responsible for improving the realism of the generated images using adversarial loss. Cycle Consistency Loss CycleGAN introduces a third component called Cycle Consistency Loss. It enforces consistency between the real and translated images across both domains A and B. With the help of Cycle Consistency Loss, the generators learn meaningful mappings between both the domains and ensure realism of the generated images. Given below is the schematic diagram of a CycleGAN − Applications of CycleGAN CycleGAN finds its applications in various image-to-image translation tasks, including the following − Style Transfer − CycleGAN can be used for transferring the style of images between different domains. It includes converting photos to paintings, day scenes to night scenes, and aerial photos to maps, etc. Domain Adaptation − CycleGAN can be used for adapting models trained on synthetic data to real-world data. It increases generalization and performance in various tasks like object detection and semantic segmentation. Image Enhancement − CycleGAN can be used for enhancing the quality of images by removing artifacts, adjusting colors, and improving visual aesthetics. What is Style Generative Adversarial Network? StyleGAN, or in short Style Generative Adversarial Network, is a kind of GAN framework developed by NVIDIA. StyleGAN is specifically designed for generating photo-realistic high-quality images. In contrast to traditional GANs, StyleGAN introduced some innovative techniques for improved image synthesis and have some better control over specific attributes. Architecture of StyleGAN StyleGAN uses the traditional progressive GAN architecture and along with that, it proposed some modifications in its generator part. The discriminator part is almost like the traditional progressive GAN. Let’s understand how the StyleGAN architecture is different − Progressive Growing In comparison with traditional GAN, StyleGAN uses a progressive growing strategy with the help of which the generator and discriminator networks are gradually increased in size and complexity during training. This Progressive growing allows StyleGAN to generate images of higher resolution (up to 1024×1024 pixels). Mapping Network To control the style and appearance of the generated images, StyleGAN uses a mapping network. This mapping network converts the input latent space vectors into intermediate latent vectors. Synthesis Network StyleGAN also incorporates a synthesis network that takes the intermediate latent vectors produced by the mapping network and generates the final image output. The synthesis network, consisting of a series of convolutional layers with adaptive instance normalization, enables the model to generate high-quality images with small details. Style Mixing Regularization StyleGAN also introduces style mixing regularization during training that allows the model to combine different styles from multiple latent vectors. The advantage of style mixing regularization is that it enhances the realism of the generated output images. Applications of StyleGAN StyleGAN finds its application in various domains, including the following − Artistic Rendering Due to having some better control over specific attributes like age, gender, and facial expression, StyleGAN can be used to create realistic portraits, artwork, and other kinds of images. Fashion and Design StyleGAN can be used to generate diverse clothing designs, textures, and styles. This feature makes StyleGAN a valuable model in fashion design and virtual try-on applications. Face Morphing StyleGAN provides us with smooth morphing between different facial attributes. This feature makes StyleGAN useful for applications like age progression, gender transformation, and facial expression transfer. Conclusion In this chapter, we explained two diverse variants of the traditional Generative Adversarial Networks, namely CycleGAN and StyleGAN. While CycleGAN is designed for unpaired image-to-image translation tasks where there is no relation between the input and output images, StyleGAN is specifically designed for generating photo-realistic high-quality images. Understanding the architectures and innovations behind both CycleGAN and StyleGAN provides us with an insight into their potential to create realistic output images.
Multi-Head Attention in Transformers Positional Encoding is a crucial component used in the architecture of the Transformer. The output of the positional encoding goes into the first sub-layer of the Transformer architecture. This sub-layer is a multi-head attention mechanism. The multi-head attention mechanism is a key feature of the Transformer model that helps it in handling sequential data more effectively. It allows the model to look at different parts of the input sequence all at once. In this chapter, we will explore the structure of multi-head attention mechanism, its advantages and Python implementation. What is Self-Attention Mechanism? The self-attention mechanism, also called scaled dot-product attention, is an essential component of the Transformer based models. It allows the model to focus on different tokens in the input sequence relative to each other. This is done by calculating a weighted sum of the input values. The weights here are based on similarity between tokens. Self-Attention Mechanism Below are the steps involved in the self-attention mechanism − Creating Queries, Keys, and Values − The self-attention mechanism transforms each token in the input sequence into three vectors namely Query(Q), Key(K), and Value(V). Calculating Attention Scores − Next, the self-attention mechanism calculates the attention scores by taking the dot product of the Query (Q) and Key (K) metrices. The attention scores show the significance of each word to the current word being processed. Applying Softmax Function − Now, in this step a softmax function is applied to these attention scores to convert them into probabilities, which ensures that the attention weights sum up to 1. Weighted Sum of Values − In this last step, to produce the output, the softmax attention scores are used to compute a weighted sum of Value vectors. Mathematically, the self-attention mechanism can be summarized by the following equation − $$\mathrm{Self-Attention(Q,K,V) \: = \: softmax \left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V}$$ What is Multi-Head Attention? Multi-head attention extends the concept of self-attention mechanism by allowing the model to focus on different parts of the input sequence simultaneously. Rather than running a single attention function, multi-head attention runs multiple self-attention mechanisms, or “heads,” in parallel. This method enables the model to better understand various relationships and dependencies within the input sequence. Take a look at the following image, it is a part of the original transformer architecture. It represents the structure of multi-head sublayer − Steps of Multi-head Attention Given below are the key steps involved in the multi-head attention − Applying Multiple Heads − First, the input embeddings are linearly projected into multiple sets (one set per head) of Query(Q), Key(K), and Value(V) metrices. Performing Parallel Self-Attention − Next, the self-attention mechanism is performed parallelly by each head on its respective projections. Concatenation − Now, the outputs of all the heads are concatenated. Combining the information − In this last step, the information from all the heads is combined. It is done by passing the concatenated output through a final linear layer. Mathematically, the multi-head attention mechanism can be summarized by the following equation − $$\mathrm{MultiHead(Q,K,V) \: = \: Concat(head_{1}, \: \dotso \: ,head_{h})W^{\circ}}$$ Where each head is calculated as − $$\mathrm{head_{i}\:=\: Attention(QW_{i}^{Q}, \: KW_{i}^{K}, \: VW_{i}^{V} )\:=\: softmax\left(\frac{QW_{i}^{Q} (KW_{i}^{K})^{T}}{\sqrt{d_{k}}}\right)VW_{i}^{V}}$$ Advantages of Multi-Head Attention Enhanced Representation − By focusing on different parts of the input sequence simultaneously, multi-head attention enables the model to better understand various relationships and dependencies within the input sequence. Parallel Processing − By enabling parallel processing, multi-head attention mechanism significantly improves the model training efficiency compared to sequential models like RNNs. Example The following Python script will implement multi-head attention mechanism − import numpy as np class MultiHeadAttention: def __init__(self, d_model, num_heads): self.d_model = d_model self.num_heads = num_heads assert d_model % num_heads == 0 self.depth = d_model // num_heads # Initializing weight matrices for queries, keys, and values self.W_q = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) self.W_k = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) self.W_v = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) # Initializing weight matrix for output self.W_o = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model) def split_heads(self, x, batch_size): “”” Split the last dimension into (num_heads, depth). Transpose the result to shape (batch_size, num_heads, seq_len, depth) “”” x = np.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return np.transpose(x, (0, 2, 1, 3)) def scaled_dot_product_attention(self, q, k, v): “”” Compute scaled dot product attention. “”” matmul_qk = np.matmul(q, k) scaled_attention_logits = matmul_qk / np.sqrt(self.depth) scaled_attention_logits -= np.max(scaled_attention_logits, axis=-1, keepdims=True) attention_weights = np.exp(scaled_attention_logits) attention_weights /= np.sum(attention_weights, axis=-1, keepdims=True) output = np.matmul(attention_weights, v) return output, attention_weights def call(self, inputs): q, k, v = inputs batch_size = q.shape[0] # The Linear transformations q = np.dot(q, self.W_q) k = np.dot(k, self.W_k) v = np.dot(v, self.W_v) # Split heads q = self.split_heads(q, batch_size) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) # The Scaled dot-product attention attention_output, attention_weights = self.scaled_dot_product_attention(q, k.transpose(0, 1, 3, 2), v) # Combining heads attention_output = np.transpose(attention_output, (0, 2, 1, 3)) concat_attention = np.reshape(attention_output, (batch_size, -1, self.d_model)) # Linear transformation for output output = np.dot(concat_attention, self.W_o) return output, attention_weights # An Example usage d_model = 512 num_heads = 8 batch_size = 2 seq_len = 10 # Creating an instance of MultiHeadAttention multi_head_attn = MultiHeadAttention(d_model, num_heads) # Example input (batch_size, sequence_length, embedding_dim) Q = np.random.randn(batch_size, seq_len, d_model) K = np.random.randn(batch_size, seq_len, d_model) V = np.random.randn(batch_size, seq_len, d_model) # Performing multi-head attention output, attention_weights = multi_head_attn.call([Q, K, V]) print(“Input Query (Q):n”, Q) print(“Multi-Head Attention Output:n”, output) Output After implementing the above script, we will get the following output − Input Query (Q): [[[ 1.38223113 -0.41160481 1.00938637 … -0.23466982 -0.20555623 0.80305284] [ 0.64676968 -0.83592083 2.45028238 … -0.1884722 -0.25315478 0.18875416] [-0.52094419 -0.03697595 -0.61598294 … 1.25611974 -0.35473911 0.15091853] … [ 1.15939786 -0.5304271 -0.45396363 … 0.8034571 0.66646109 -1.28586743] [ 0.6622964 -0.62871864 0.61371113 … -0.59802729 -0.66135327 -0.24437055] [ 0.83111283 -0.81060387 -0.30858598 … -0.74773536 -1.3032037 3.06236077]] [[-0.88579467 -0.15480352 0.76149486 … -0.5033709 1.20498808 -0.55297549] [-1.11233207 0.7560376 -1.41004173 … -2.12395203 2.15102493 0.09244935] [ 0.33003584 1.67364745 -0.30474183 … 1.65907682 -0.61370707 0.58373516] … [-2.07447136 -1.04964997 -0.15290381 … -0.19912739 -1.02747937 0.20710549] [ 0.38910395 -1.04861089 -1.66583867 … 0.21530474 -1.45005951 0.04472527] [-0.4718725 -0.45374148 -0.59990784 … -1.9545574 0.11470969 1.03736175]]] Multi-Head
Input Embeddings in Transformers The two main components of a Transformer, i.e., the encoder and the decoder, contain various mechanisms and sub-layers. In the Transformer architecture, the first sublayer is Input Embedding. Input embeddings are a crucial component that serve as the initial representation of the input data. These embeddings, before feeding text into the model for processing, first convert raw text data like words or sub words into a format that the model can process. Read this chapter to understand what input embeddings are, why they are important, and how they are implemented in Transformers, with Python examples to illustrate these concepts. What are Input Embeddings? Input embeddings are basically vector representations of discrete tokens like words, sub words, or characters. These vectors capture the semantic meaning of the tokens that enables the model to understand and manipulate the text data effectively. The role of the input embedding sublayer in Transformer is to convert the input tokens in a high-dimensional space $\mathrm{d_{model} \: = \: 512}$ where similar tokens have similar vector representations. Importance of Input Embeddings in Transformers Let”s now understand why input embeddings are important in Transformers − Semantic Representation Input embeddings capture semantic similarities between words of input text. For example, words like “king” and “queen”, or “cat” and “dog” will have vectors that are close to each other in the embedding space. Dimensionality Reduction The traditional one-hot encoding represents each token as a binary vector with a single high value which requires large space. On the other hand, input embeddings reduce this computational complexity by providing a compact and dense representation. Enhanced Learning Embeddings also capture the contextual relationships hence they improve the model”s ability to learn patterns and relationships in the data. This ability leads to better performance of a ML model on NLP tasks. Working of the Input Embedding Sublayer The working of input embedding sublayer is like other standard transduction models, includes the following steps − Step 1: Tokenization Before the input tokens can be embedded, the raw text data should be tokenized. Tokenization is the process in which the text splits into smaller units like words or sub words. Let’s see both kinds of tokenization − Word-level tokenization − As the name implies, it splits the text into individual words. Subword-level tokenization − As name implies, it splits the text into smaller units, which can be the parts of word. This kind of tokenization is used in models like BERT and GPT to handle misspellings. For example, a subword-level tokenizer applied to the text =” Transformers revolutionized the field of NLP” will produce the tokens = “[“Transformers”, ” revolutionized “, ” the “, “field”, “of”, “NLP”] Step 2: Embedding Layer The second step is embedding layer which is basically a lookup table that maps each token to a dense vector of fixed dimensions. This process involves the following two steps − Vocabulary − It is a set of unique tokens recognized by the model. Embedding Dimension − It represents the size of the vector space in which tokens are represented, for example a size of 512. When a token is passed to the embedding layer, it returns the corresponding dense vector from the embedding matrix. How Input Embeddings are Implemented in a Transformer? Given below is a Python example to illustrate how input embeddings are implemented in a Transformer − Example import numpy as np # Example text and tokenization text = “Transformers revolutionized the field of NLP” tokens = text.split() # Creating a vocabulary Vocab = {word: idx for idx, word in enumerate(tokens)} # Example input (sequence of token indices) input_indices = np.array([vocab[word] for word in tokens]) print(“Vocabulary:”, vocab) print(“Input Indices:”, input_indices) # Parameters vocab_size = len(vocab) embed_dim = 512 # Dimension of the embeddings # Initialize the embedding matrix with random values embedding_matrix = np.random.rand(vocab_size, embed_dim) # Get the embeddings for the input indices input_embeddings = embedding_matrix[input_indices] print(“Embedding Matrix:n”, embedding_matrix) print(“Input Embeddings:n”, input_embeddings) Output The above Python script first splits the text into tokens and creates a vocabulary that maps each word to a unique index. After that, it initializes an embedding matrix with random values where each row corresponds to the embedding of a word. We are using the dimension of embedding = 512. Vocabulary: {”Transformers”: 0, ”revolutionized”: 1, ”the”: 2, ”field”: 3, ”of”: 4, ”NLP”: 5} Input Indices: [0 1 2 3 4 5] Embedding Matrix: [[0.29083668 0.70830247 0.22773598 … 0.62831348 0.90910366 0.46552784] [0.01269533 0.47420163 0.96242738 … 0.38781376 0.33485277 0.53721033] [0.62287977 0.09313765 0.54043664 … 0.7766359 0.83653342 0.75300144] [0.32937143 0.51701913 0.39535506 … 0.60957358 0.22620172 0.60341522] [0.65193484 0.25431826 0.55643452 … 0.76561879 0.24922971 0.96247851] [0.78385765 0.58940282 0.71930539 … 0.61332926 0.24710099 0.5445185 ]] Input Embeddings: [[0.29083668 0.70830247 0.22773598 … 0.62831348 0.90910366 0.46552784] [0.01269533 0.47420163 0.96242738 … 0.38781376 0.33485277 0.53721033] [0.62287977 0.09313765 0.54043664 … 0.7766359 0.83653342 0.75300144] [0.32937143 0.51701913 0.39535506 … 0.60957358 0.22620172 0.60341522] [0.65193484 0.25431826 0.55643452 … 0.76561879 0.24922971 0.96247851] [0.78385765 0.58940282 0.71930539 … 0.61332926 0.24710099 0.5445185 ]] Conclusion The input embedding sublayer converts raw text data like words or sub words into a format that the model can process. We also explained why input embeddings are important for the successful working of Transformer. By capturing semantic similarities between words and reducing the computational complexity by providing their compact and dense representation, this sublayer ensures that the model can effectively learn patterns and relationships in the data. We also provided a Python implementation example to achieve the fundamental steps required to transform raw text data into a format suitable for further processing in a Transformer model. Understanding and implementing the input embedding sublayer is crucial for effectively using the Transformer models for NLP tasks.
Training a Generative Adversarial Network (GANs) We explored the architecture of Generative Adversarial Networks and how they work. In this chapter, we will take a practical example to demonstrate how you can implement and train a GAN to generate handwritten digits, same as those in the MNIST dataset. We”ll use Python along with TensorFlow and Keras for this example. Process of Training a Generative Adversarial Network The training of GANs involves optimizing both the generator model and the discriminator model iteratively. Let’s understand the training process of a Generative Adversarial Network (GAN) using the following steps: Initialization The process starts with two neural networks: the Generator Network (G) and the Discriminator Network (D). The Generator takes a random seed or noise vector as input and produces generated samples. The Discriminator takes either real data samples or generated samples as input and classifies them as real or fake. Generating Fake Data A random noise vector is fed into the Generator Network. The Generator processes this noise and outputs generated data samples that are intended to resemble real data. Generator Training First it generates fake data from input random noise. Then it calculates the generator’s loss using the discriminator’s output. Finally, it updates the generator’s weights to minimize the loss. Discriminator Training First, it takes a batch of real data and a batch of fake data. Then it calculates the discriminator’s loss for both real and fake data. Finally, it updates the discriminator’s weights to minimize the loss. Iterative Training Repeat steps 2 to 4. During each iteration, both the Generator and Discriminator are alternately trained and try to improve each other”s performance. This alternating optimization continues until the generator generates data that is identical to the real data and the discriminator can no longer reliably distinguish between real and fake data. Training and Building a GAN Here, we will show the step-by-step procedure of training and building a GAN using Python and the MNIST dataset − Step 1: Setting Up the Environment Before we start, we need to set up our Python environment with the necessary libraries. Ensure you have TensorFlow and Keras installed on your computer. You can install them using pip as follows − pip install tensorflow Step 2: Import Necessary Libraries We need to import the essential libraries − import numpy as np import tensorflow as tf from tensorflow.keras import layers, models from tensorflow.keras.datasets import mnist import matplotlib.pyplot as plt Step 3: Load and Preprocess the MNIST Dataset The MNIST dataset consists of 60,000 training images and 10,000 testing images of handwritten digits, each of size 28×28 pixels. We will normalize the pixel values to the range [-1, 1] to make training more efficient − # Load the dataset (x_train, _), (_, _) = mnist.load_data() # Normalize the images to [-1, 1] x_train = (x_train – 127.5) / 127.5 x_train = np.expand_dims(x_train, axis=-1) # Set batch size and buffer size BUFFER_SIZE = 60000 BATCH_SIZE = 256 Step 4: Create the Generator and Discriminator Models The generator creates fake images from random noise, and the discriminator attempts to distinguish between real and fake images. Implementation the Generator Model The generator model takes a random noise vector as input and transforms it through a series of layers to produce a fake image − def build_generator(): model = models.Sequential() model.add(layers.Dense(256, use_bias=False, input_shape=(100,))) model.add(layers.BatchNormalization()) model.add(layers.LeakyReLU()) model.add(layers.Dense(512, use_bias=False)) model.add(layers.BatchNormalization()) model.add(layers.LeakyReLU()) model.add(layers.Dense(28 * 28 * 1, use_bias=False, activation=”tanh”)) model.add(layers.Reshape((28, 28, 1))) return model generator = build_generator() Implementation the Discriminator Model The discriminator model takes an image as input (either real or generated) and outputs a probability value indicating whether the image is real or fake − def build_discriminator(): model = models.Sequential() model.add(layers.Flatten(input_shape=(28, 28, 1))) model.add(layers.Dense(512)) model.add(layers.LeakyReLU()) model.add(layers.Dropout(0.3)) model.add(layers.Dense(256)) model.add(layers.LeakyReLU()) model.add(layers.Dropout(0.3)) model.add(layers.Dense(1, activation=”sigmoid”)) return model discriminator = build_discriminator() Step 5: Define Loss Functions and Optimizers In this step, we will use binary cross-entropy loss for both the generator and discriminator. The generator aims to maximize the probability of the discriminator making a mistake, while the discriminator aims to minimize its classification error. cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=True) def generator_loss(fake_output): return cross_entropy(tf.ones_like(fake_output), fake_output) def discriminator_loss(real_output, fake_output): real_loss = cross_entropy(tf.ones_like(real_output), real_output) fake_loss = cross_entropy(tf.zeros_like(fake_output), fake_output) total_loss = real_loss + fake_loss return total_loss generator_optimizer = tf.keras.optimizers.Adam(1e-4) discriminator_optimizer = tf.keras.optimizers.Adam(1e-4) Step 6: Define the Training Loop The training process for a GAN involves training the generator and discriminator iteratively. Here, we will define a training step that includes generating fake images, calculating losses, and updating the model weights using backpropagation. @tf.function def train_step(images): noise = tf.random.normal([BATCH_SIZE, 100]) with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape: generated_images = generator(noise, training=True) real_output = discriminator(images, training=True) fake_output = discriminator(generated_images, training=True) gen_loss = generator_loss(fake_output) disc_loss = discriminator_loss(real_output, fake_output) gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables) gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables) generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables)) discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables)) def train(dataset, epochs): for epoch in range(epochs): for image_batch in dataset: train_step(image_batch) print(f”Epoch {epoch+1} completed”) Step 7: Prepare the Dataset and Train the GAN Next, we will prepare the dataset by shuffling and batching the MNIST images and then we will start the training process. # Prepare the dataset for training train_dataset = tf.data.Dataset.from_tensor_slices(x_train).shuffle(BUFFER_SIZE).batch(BATCH_SIZE) # Train the GAN EPOCHS = 50 train(train_dataset, EPOCHS) Step 8: Generate and Display Images Now, after training the GAN, we can generate and display new images created by the generator. It involves creating random noise, feeding it to the generator, and displaying the resulting images. def generate_and_save_images(model, epoch, test_input): predictions = model(test_input, training=False) fig = plt.figure(figsize=(7.50, 3.50)) for i in range(predictions.shape[0]): plt.subplot(4, 4, i + 1) plt.imshow(predictions[i, :, :, 0] * 127.5 + 127.5, cmap=”gray”) plt.axis(”off”) plt.savefig(”image_at_epoch_{:04d}.png”.format(epoch)) plt.show() seed = tf.random.normal([16, 100]) generate_and_save_images(generator, EPOCHS, seed) After implementation, when you run this code, you will get the following output − Conclusion Training a GAN using Python involves several key steps such as setting up the environment, creating the generator and discriminator models, defining loss functions and optimizers, and implementing the training loop. By following these steps, you can train your own GAN and explore the fascinating world of generative adversarial networks. In this chapter, we provided a detailed guide to building
Conditional Generative Adversarial Networks (cGAN) What is a Conditional GAN? A Generative Adversarial Network (GAN) is a deep learning framework that can generate new random plausible examples for a given dataset. Conditional GAN (cGAN) extends the GAN framework by including the condition information like class labels, attributes, or even other data samples, into both the generator and the discriminator networks. With the help of these conditioning information, Conditional GANs provide us the control over the characteristic of the generated output. Read this chapter to understand the concept of Conditional GANs, their architecture, applications, and challenges. Where do We Need a Conditional GAN? While working with GANs, there may arise a situation where we want it to generate specific types of images. For example, to produce fake pictures of a dog, you train your GAN with a broad spectrum of dog images. While we can use our trained model to generate an image of a random dog, we cannot instruct it to generate an image of, say, a Dalmatian dog or a Rottweiler. To produce fake pictures of a dog with Conditional GAN, during training, we pass the images to the network with their actual labels (Dalmatian dog, Rottweiler, Pug, etc.) for the model to learn the difference between these dogs. In this way, we can make our model able to generate images of specific dog breeds. A Conditional GAN is an extension of the traditional GAN architecture that allows us to generate images by conditioning the network with additional information. Architecture of Conditional GANs Like traditional GANs, the architecture of a Conditional GAN consists of two main components: a generator network and a discriminative network. The only difference is that in Conditional GANs, both the generator network and discriminative network receive additional conditioning information y along with their respective inputs. Let’s understand it with the help of this diagram − The Generator Network The generator networks, as shown in the above diagram, takes two inputs: a random noise vector which is sampled from a predefined distribution and the conditioning information “y”. It now transforms it into synthetic data samples. Once transformed, the goal of the generator is to not only produce data that is identical to real data but also align with the provided conditional information. The Discriminator Network The discriminator network receives both real data samples and fake samples generated by the generator, along with the conditioning information “y”. The goal of the discriminator network is to evaluate the input data and tries to distinguish between real data samples from the dataset and fake data samples generated by the generator model while considering the provided conditioning information. We have seen the use of conditioning information in cGAN architecture. Let’s understand conditional information and its types. Conditional Information Conditional information often denoted by “y” is an additional information which is provided to both generator network and discriminator network to condition the generation process. Based on the application and the required control over the generated output, conditional information can take various forms. Types of Conditional Information Some of the common types of conditional information are as follows − Class Labels − In image classification tasks, conditional information “y” may represent the class labels corresponding to different categories. For example, in handwritten digits dataset, “y” could indicate the digit class (0-9) that the generator network should produce. Attributes − In image generation tasks, conditional information “y” may represent specific attributes or features of the desired output, such as the color of objects, the style of clothing, or the pose of a person. Textual Descriptions − For text-to-image synthesis tasks, conditional information “y” may consist of textual descriptions or captions describing the desired characteristics of the generated image. Applications of Conditional GANs Listed below are some of the fields where Conditional GANs find its applications − Image-to-Image Translation Conditional GANs are best suited for tasks like translating images from one domain to another. Translating images includes converting satellite images to maps, transforming sketches into realistic images, or converting day-time scenes to night-time scenes etc. Semantic Image Synthesis Conditional GANs can condition on semantic labels, hence they can generate realistic images based on textual descriptions or semantic layouts. Super-Resolution and Inpainting Conditional GANs can also be used for image super-resolution tasks in which low-resolution images are transformed into similar high-resolution images. They can also be used for inpainting tasks in which, based on contextual information, missing parts of an image are filled in. Style Transfer and Editing Conditional GANs allow us to manipulate specific attributes like color, texture, or artistic style while preserving other aspects of the image. Challenges in using Conditional GANs Conditional GANs offer significant advancements in generative modeling but they also have some challenges. Let’s see which kind of challenges you can face while using Conditional GANs − Mode Collapse Like traditional GANs, Conditional GANs can also experience mode collapse. In mode collapse, the generator learns to produce limited varieties of samples and fails to capture the entire data distribution. Conditioning Information Quality The effectiveness of Conditional GANs depends on the quality and relevance of the provided conditioning information. Noisy or irrelevant conditioning information can lead to poor generation outputs. Training Instability The training instability issues observed in traditional GANs can also be faced by Conditioning GANs. To avoid this, CGANs require careful architecture design and training approaches. Scalability With the increased complexity of conditioning information, it becomes difficult to handle Conditional GANs. It then requires more computational resources. Conclusion Conditional GAN (cGAN) extends the GAN framework by including the condition information like class labels, attributes, or even other data samples. Conditional GANs provide us the control over the characteristics of the generated output. From image-to-image translation to semantic image synthesis, Conditional GANs find their applications across various domains.
Architecture of Transformers in Generative AI Large Language Models (LLMs) based on transformers have outperformed the earlier Recurrent Neural Networks (RNNs) in various tasks like sentiment analysis, machine translation, text summarization, etc. Transformers achieve their unique capabilities from its architecture. This chapter will explain the main ideas of the original transformer model in simple terms to make it easier to understand. We will focus on the key components that make the transformer: the encoder, the decoder, and the unique attention mechanism that connects them both. How do Transformers Work in Generative AI? Let’s understand how a transformer works − First, when we provide a sentence to the transformer, it pays extra attention to the important words in that sentence. It then considers all the words simultaneously rather than one after another which helps the transformer to find the dependency between the words in that sentence. After that, it finds the relationship between words in that sentence. For example, suppose a sentence is about stars and galaxies then it knows that these words are related. Once done, the transformer uses this knowledge to understand the complete story and how words connect with each other. With this understanding, the transformer can even predict which word might come next. Transformer Architecture in Generative AI The transformer has two main components: the encoder and the decoder. Below is a simplified architecture of the transformer − As you can see in the diagram, on the left side of the transformer, the input enters the encoder. The input is first converted to input embeddings and then crosses through an attention sub-layer and FeedForward Network (FFN) sub-layer. Similarly, on the right side, the target output enters the decoder. The output is also first converted to output embeddings and then crosses through two attention sub-layers and a FeedForward Network (FFN) sub-layer. In this architecture, there is no RNN, LSTM, or CNN. Recurrence has also been discarded and replaced by attention mechanism. Let’s discuss the two main components, i.e., the encoder and the decoder, of the transformer in detail. A Layer of the Encoder Stack of the Transformer In transformer, the encoder processes the input sequences and breaks it down into some meaningful representations. The layers of the encoder of the transformer model are stacks of layers where each encoder stack layer has the following structure − This encoder layer structure remains the same for all the layers of the Transformer model. Each layer of encoder stack contains the following two sub-layers − A multi-head attention mechanism FeedForward Network (FFN) As we can see in the above diagram, there is a residual connection around both the sub-layers, i.e., multi-head attention mechanism and FeedForward Network. The job of these residual connections is to send the unprocessed input x of a sub-layer to a layer normalization function. In this way, the normalized output of each layer can be calculated as below − Layer Normalization (x + Sublayer(x)) We will discuss the sub-layers, i.e., multi-head attention and FNN, Input embeddings, positional encodings, normalization, and residual connections in detail in subsequent chapters. A Layer of the Decoder Stack of the Transformer In transformer, the decoder takes the representations generated by the encoder and processes them to generate output sequences. It is just like a translation or a text continuation. Like the encoder, the layers of the decoder of the transformer model are also stacks of layers where each decoder stack layer has the following structure − Like encoder layers, the decoder layer structure remains the same for all the N=6 layers of the Transformer model. Each layer of decoder stack contains the following three sub-layers − Masked multi-head attention mechanism A multi-head attention mechanism FeedForward Network (FFN) Opposite to the encoder, the decoder has a third sub-layer called masked multi-head attention in which, at a given position, the following words are masked. The advantage of this sub-layer is that the transformer makes its predictions based on its inferences without seeing the entire sequence. Like the encoder, there is a residual connection around all the sub-layers and the normalized output of each layer can be calculated as below − Layer Normalization (x + Sublayer(x)) As we can see in the above diagram, after all the decoder blocks there is a final linear layer. The role of this linear layer is to map the data to the desired output vocabulary size. A softmax function is then applied to the mapped data to generate a probability distribution over the target vocabulary. This will result in the final output sequence. Conclusion In this chapter, we explained in detail the architecture of transformers in Generative AI. We mainly focused on its two main parts: the encoder and the decoder. The role of the encoder is to understand the input sequence by looking at the relationships between all the words. It uses self-attention and feed-forward layers to create a detailed representation of the input. The decoder takes the detailed representations of the input and generates the output sequence. It uses masked self-attention to ensure it generates the sequence in the correct order and utilizes encoder-decoder attention to integrate the information from the encoder. By exploring how the encoder and the decoder work, we see how Transformers have fundamentally transformed the field of natural language processing (NLP). It is the encoder and decoder structure that make the Transformer so powerful and effective in various industries and transform the way we interact with AI systems.