Scikit Image – Using Plotly Plotly in Python is commonly referred to as “plotly.py”. It is a free and open-source plotting library built on top of “plotly.js”. Plotly.py provides a rich set of features and supports more than 40 unique chart types. It is widely used for financial analysis, geographical mapping, scientific visualization, 3D plotting, and data analysis applications. It offers an interactive interface that allows users to explore and interact with data visualizations. It provides functionalities like zooming, panning, tooltips, and hover effects, making it easy to analyse and understand complex datasets. Scikit Image Using Plotly Plotly.py can be used along with the scikit-image library to achieve various data visualization tasks related to image processing. To set up plotly, you need to ensure that the library is installed and properly configured. Installing plotly using pip Execute the below commands in the command prompt to install the plotly module. It is an easy way to install the latest package of Plotly from PyPi. pip install plotly Installing plotly using conda If you”re using the Anaconda distribution already in your system, then you can directly use the conda package manager to install plotly. conda install -c plotly plotly Once Plotly is installed, you can import it into your Python scripts or interactive sessions using the following statement − import plotly This imports the necessary modules from Plotly to create interactive and customizable visualizations. Below are a few basic Python programs that demonstrate how to use the Plotly along with scikit-image to perform data visualization in image processing tasks effectively. Example 1 The following example displays an RBG image using the Plotly.express.imshow() method. import plotly.express as px from skimage import io # Read an image image = io.imread(”Images/Tajmahal.jpg”) # Display the image using Plotly fig = px.imshow(image) fig.show() Output On executing the above program, you will get the following output − Example 2 The following example demonstrates how to apply a circular mask to an image using scikit-image and display the original image and the masked image side by side using Plotly. import matplotlib.pyplot as plt from skimage import io import numpy as np # Load the image image_path = ”Images_/Zoo.jpg” image = io.imread(image_path) image_copy = np.copy(image) # Create circular mask rows, cols, _ = image.shape row, col = np.ogrid[:rows, :cols] center_row, center_col = rows / 2, cols / 2 radius = min(rows, cols) / 2 outer_disk_mask = ((row – center_row)**2 + (col – center_col)**2 > radius**2) # Apply mask to image image[outer_disk_mask] = 0 # Display the original and masked images using Matplotlib fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5)) axes[0].imshow(image_copy) axes[0].set_title(”Original Image”) axes[0].axis(”off”) axes[1].imshow(image) axes[1].set_title(”Masked Image”) axes[1].axis(”off”) plt.tight_layout() plt.show() Output
Category: Machine Learning
Scikit Learn – K-Nearest Neighbors (KNN) This chapter will help you in understanding the nearest neighbor methods in Sklearn. Neighbor based learning method are of both types namely supervised and unsupervised. Supervised neighbors-based learning can be used for both classification as well as regression predictive problems but, it is mainly used for classification predictive problems in industry. Neighbors based learning methods do not have a specialised training phase and uses all the data for training while classification. It also does not assume anything about the underlying data. That’s the reason they are lazy and non-parametric in nature. The main principle behind nearest neighbor methods is − To find a predefined number of training samples closet in distance to the new data point Predict the label from these number of training samples. Here, the number of samples can be a user-defined constant like in K-nearest neighbor learning or vary based on the local density of point like in radius-based neighbor learning. sklearn.neighbors Module Scikit-learn have sklearn.neighbors module that provides functionality for both unsupervised and supervised neighbors-based learning methods. As input, the classes in this module can handle either NumPy arrays or scipy.sparse matrices. Types of algorithms Different types of algorithms which can be used in neighbor-based methods’ implementation are as follows − Brute Force The brute-force computation of distances between all pairs of points in the dataset provides the most naïve neighbor search implementation. Mathematically, for N samples in D dimensions, brute-force approach scales as 0[DN2] For small data samples, this algorithm can be very useful, but it becomes infeasible as and when number of samples grows. Brute force neighbor search can be enabled by writing the keyword algorithm=’brute’. K-D Tree One of the tree-based data structures that have been invented to address the computational inefficiencies of the brute-force approach, is KD tree data structure. Basically, the KD tree is a binary tree structure which is called K-dimensional tree. It recursively partitions the parameters space along the data axes by dividing it into nested orthographic regions into which the data points are filled. Advantages Following are some advantages of K-D tree algorithm − Construction is fast − As the partitioning is performed only along the data axes, K-D tree’s construction is very fast. Less distance computations − This algorithm takes very less distance computations to determine the nearest neighbor of a query point. It only takes 𝑶[𝐥𝐨𝐠 (𝑵)] distance computations. Disadvantages Fast for only low-dimensional neighbor searches − It is very fast for low-dimensional (D < 20) neighbor searches but as and when D grow it becomes inefficient. As the partitioning is performed only along the data axes, K-D tree neighbor searches can be enabled by writing the keyword algorithm=’kd_tree’. Ball Tree As we know that KD Tree is inefficient in higher dimensions, hence, to address this inefficiency of KD Tree, Ball tree data structure was developed. Mathematically, it recursively divides the data, into nodes defined by a centroid C and radius r, in such a way that each point in the node lies within the hyper-sphere defined by centroid C and radius r. It uses triangle inequality, given below, which reduces the number of candidate points for a neighbor search $$arrowvert X+Yarrowvertleq arrowvert Xarrowvert+arrowvert Yarrowvert$$ Advantages Following are some advantages of Ball Tree algorithm − Efficient on highly structured data − As ball tree partition the data in a series of nesting hyper-spheres, it is efficient on highly structured data. Out-performs KD-tree − Ball tree out-performs KD tree in high dimensions because it has spherical geometry of the ball tree nodes. Disadvantages Costly − Partition the data in a series of nesting hyper-spheres makes its construction very costly. Ball tree neighbor searches can be enabled by writing the keyword algorithm=’ball_tree’. Choosing Nearest Neighbors Algorithm The choice of an optimal algorithm for a given dataset depends upon the following factors − Number of samples (N) and Dimensionality (D) These are the most important factors to be considered while choosing Nearest Neighbor algorithm. It is because of the reasons given below − The query time of Brute Force algorithm grows as O[DN]. The query time of Ball tree algorithm grows as O[D log(N)]. The query time of KD tree algorithm changes with D in a strange manner that is very difficult to characterize. When D < 20, the cost is O[D log(N)] and this algorithm is very efficient. On the other hand, it is inefficient in case when D > 20 because the cost increases to nearly O[DN]. Data Structure Another factor that affect the performance of these algorithms is intrinsic dimensionality of the data or sparsity of the data. It is because the query times of Ball tree and KD tree algorithms can be greatly influenced by it. Whereas, the query time of Brute Force algorithm is unchanged by data structure. Generally, Ball tree and KD tree algorithms produces faster query time when implanted on sparser data with smaller intrinsic dimensionality. Number of Neighbors (k) The number of neighbors (k) requested for a query point affects the query time of Ball tree and KD tree algorithms. Their query time becomes slower as number of neighbors (k) increases. Whereas the query time of Brute Force will remain unaffected by the value of k. Number of query points Because, they need construction phase, both KD tree and Ball tree algorithms will be effective if there are large number of query points. On the other hand, if there are a smaller number of query points, Brute Force algorithm performs better than KD tree and Ball tree algorithms.
Scikit Learn – Anomaly Detection Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points. Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories − Point anomalies − It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data. Contextual anomalies − Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context. Collective anomalies − It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values. Methods Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them. Outlier detection The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection. Novelty detection It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection. There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows − estimator.fit(X_train) Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows − estimator.fit(X_test) The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter. We can also define decision_function method that defines outliers as negative value and inliers as non-negative value. estimator.decision_function(X_test) Sklearn algorithms for Outlier Detection Let us begin by understanding what an elliptic envelop is. Fitting an elliptic envelop This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop. This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode. Parameters Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method − Sr.No Parameter & Description 1 store_precision − Boolean, optional, default = True We can specify it if the estimated precision is stored. 2 assume_centered − Boolean, optional, default = False If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian. 3 support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates. 4 contamination − float in (0., 1.), optional, default = 0.1 It provides the proportion of the outliers in the data set. 5 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. Attributes Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method − Sr.No Attributes & Description 1 support_ − array-like, shape(n_samples,) It represents the mask of the observations used to compute robust estimates of location and shape. 2 location_ − array-like, shape (n_features) It returns the estimated robust location. 3 covariance_ − array-like, shape (n_features, n_features) It returns the estimated robust covariance matrix. 4 precision_ − array-like, shape (n_features, n_features) It returns the estimated pseudo inverse matrix. 5 offset_ − float It is used to define the decision function from the raw scores. decision_function = score_samples -offset_ Implementation Example import numpy as np^M from sklearn.covariance import EllipticEnvelope^M true_cov = np.array([[.5, .6],[.6, .4]]) X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500) cov = EllipticEnvelope(random_state = 0).fit(X)^M # Now we can use predict method. It will return 1 for an inlier and -1 for an outlier. cov.predict([[0, 0],[2, 2]]) Output array([ 1, -1]) Isolation Forest In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features. Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node. Parameters Followings table consist the parameters used by sklearn. ensemble.IsolationForest method − Sr.No Parameter & Description 1 n_estimators − int, optional, default = 100 It represents the number of base estimators in the ensemble. 2 max_samples − int or float, optional, default = “auto” It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples ∗ 𝑋.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples). 3 support_fraction − float in (0., 1.), optional, default =
Machine Learning – Performance Metrics Performance metrics in machine learning are used to evaluate the performance of a machine learning model. These metrics provide quantitative measures to assess how well a model is performing and to compare the performance of different models. Performance metrics are important because they help us understand how well our model is performing and whether it is meeting our requirements. In this way, we can make informed decisions about whether to use a particular model or not. There are many performance metrics that can be used in machine learning, depending on the type of problem being solved and the specific requirements of the problem. Some common performance metrics include − Accuracy − Accuracy is one of the most basic performance metrics and measures the proportion of correctly classified instances in the dataset. It is calculated as the number of correctly classified instances divided by the total number of instances in the dataset. Precision − Precision measures the proportion of true positive instances out of all predicted positive instances. It is calculated as the number of true positive instances divided by the sum of true positive and false positive instances. Recall − Recall measures the proportion of true positive instances out of all actual positive instances. It is calculated as the number of true positive instances divided by the sum of true positive and false negative instances. F1 Score − F1 score is the harmonic mean of precision and recall. It is a balanced measure that takes into account both precision and recall. It is calculated as 2 * (precision × recall) / (precision + recall). ROC AUC Score − ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is a measure of the ability of a classifier to distinguish between positive and negative instances. It is calculated by plotting the true positive rate against the false positive rate at different classification thresholds and calculating the area under the curve. Confusion Matrix − A confusion matrix is a table that is used to evaluate the performance of a classification model. It shows the number of true positives, true negatives, false positives, and false negatives for each class in the dataset. Example Here is an example code snippet to calculate the accuracy, precision, recall, and F1 score for a binary classification problem − from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Load the iris dataset iris = load_iris() X = iris.data y = iris.target # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train a logistic regression model on the training set model = LogisticRegression() model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Compute performance metrics accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average=”macro”) recall = recall_score(y_test, y_pred, average=”macro”) f1 = f1_score(y_test, y_pred, average=”macro”) # Print the performance metrics print(“Accuracy:”, accuracy) print(“Precision:”, precision) print(“Recall:”, recall) print(“F1 Score:”, f1) Output When you execute this code, it will produce the following output − Accuracy: 1.0 Precision: 1.0 Recall: 1.0 F1 Score: 1.0
Prompt Engineering – What is Generative AI? In this chapter, we will delve into the world of generative AI and its role in prompt engineering. Generative AI refers to a class of artificial intelligence techniques that focus on creating data, such as images, text, or audio, rather than processing existing data. We will explore how generative AI models, particularly generative language models, play a crucial role in prompt engineering and how they can be fine-tuned for various NLP tasks. Generative Language Models Generative language models, such as GPT-3 and other variants, have gained immense popularity due to their ability to generate coherent and contextually relevant text. Generative language models can be used for a wide range of tasks, including text generation, translation, summarization, and more. They serve as a foundation for prompt engineering by providing contextually aware responses to custom prompts. Fine-Tuning Generative Language Models Fine-tuning is the process of adapting a pre-trained language model to a specific task or domain using task-specific data. Prompt engineers can fine-tune generative language models with domain-specific datasets, creating prompt-based language models that excel in specific tasks. Customizing Model Responses Custom Prompt Engineering − Prompt engineers have the flexibility to customize model responses through the use of tailored prompts and instructions. Role of Generative AI − Generative AI models allow for more dynamic and interactive interactions, where model responses can be modified by incorporating user instructions and constraints in the prompts. Creative Writing and Storytelling Creative Writing Applications − Generative AI models are widely used in creative writing tasks, such as generating poetry, short stories, and even interactive storytelling experiences. Co-Creation with Users − By involving users in the writing process through interactive prompts, generative AI can facilitate co-creation, allowing users to collaborate with the model in storytelling endeavors. Language Translation Multilingual Prompting − Generative language models can be fine-tuned for multilingual translation tasks, enabling prompt engineers to build prompt-based translation systems. Real-Time Translation − Interactive translation prompts allow users to obtain instant translation responses from the model, making it a valuable tool for multilingual communication. Multimodal Prompting Integrating Different Modalities − Generative AI models can be extended to multimodal prompts, where users can combine text, images, audio, and other forms of input to elicit responses from the model. Enhanced Contextual Understanding − Multimodal prompts enable generative AI models to provide more comprehensive and contextually aware responses, enhancing the user experience. Ethical Considerations Responsible Use of Generative AI − As with any AI technology, prompt engineers must consider ethical implications, potential biases, and the responsible use of generative AI models. Addressing Potential Risks − Prompt engineers should be vigilant in monitoring and mitigating risks associated with content generation and ensure that the models are deployed responsibly. Future Directions Continual Advancements − Generative AI is an active area of research, and prompt engineers can expect continuous advancements in model architectures and training techniques. Integration with Other AI Technologies − The integration of generative AI with other AI technologies, such as reinforcement learning and multimodal fusion, holds promise for even more sophisticated prompt-based language models. Conclusion In this chapter, we explored the role of generative AI in prompt engineering and how generative language models serve as a powerful foundation for contextually aware responses. By fine-tuning generative language models and customizing model responses through tailored prompts, prompt engineers can create interactive and dynamic language models for various applications. From creative writing and language translation to multimodal interactions, generative AI plays a significant role in enhancing user experiences and enabling co-creation between users and language models. As prompt engineering continues to evolve, generative AI will undoubtedly play a central role in shaping the future of human-computer interactions and NLP applications.
Prompt Engineering – Introduction Prompt engineering is the process of crafting text prompts that help large language models (LLMs) generate more accurate, consistent, and creative outputs. By carefully choosing the words and phrases in a prompt, prompt engineers can influence the way that an LLM interprets a task and the results that it produces. What are Prompts? In the context of AI models, prompts are input instructions or cues that shape the model”s response. These prompts can be in the form of natural language instructions, system-defined instructions, or conditional constraints. A prompt is a short piece of text that is used to guide an LLM”s response. It can be as simple as a single sentence, or it can be more complex, with multiple clauses and instructions. The goal of a prompt is to provide the LLM with enough information to understand what is being asked of it, and to generate a relevant and informative response. By providing clear and explicit prompts, developers can guide the model”s behavior and influence the generated output. Types of Prompts There can be wide variety of prompts which you will get to know during the course of this tutorial. This being an introductory chapter, let”s start with a small set to highlight the different types of prompts that one can use − Natural Language Prompts − These prompts emulate human-like instructions, providing guidance in the form of natural language cues. They allow developers to interact with the model more intuitively, using instructions that resemble how a person would communicate. System Prompts − System prompts are predefined instructions or templates that developers provide to guide the model”s output. They offer a structured way of specifying the desired output format or behavior, providing explicit instructions to the model. Conditional Prompts − Conditional prompts involve conditioning the model on specific context or constraints. By incorporating conditional prompts, developers can guide the model”s behavior based on conditional statements, such as “If X, then Y” or “Given A, generate B.” How Does Prompt Engineering Work? Prompt engineering is a complex and iterative process. There is no single formula for creating effective prompts, and the best approach will vary depending on the specific LLM and the task at hand. However, there are some general principles that prompt engineers can follow − Start with a clear understanding of the task − What do you want the LLM to do? What kind of output are you looking for? Once you have a clear understanding of the task, you can start to craft a prompt that will help the LLM achieve your goals. Use clear and concise language − The LLM should be able to understand your prompt without any ambiguity. Use simple words and phrases, and avoid jargon or technical terms. Be specific − The more specific you are in your prompt, the more likely the LLM is to generate a relevant and informative response. For example, instead of asking the LLM to “write a poem,” you could ask it to “write a poem about a lost love.” Use examples − If possible, provide the LLM with examples of the kind of output you are looking for. This will help the LLM to understand your expectations and to generate more accurate results. Experiment − There is no one-size-fits-all approach to prompt engineering. The best way to learn what works is to experiment with different prompts and see what results you get. Evaluating and Validating Prompts Evaluating prompt effectiveness is crucial to assess the model”s behavior and performance. Metrics such as output quality, relevance, and coherence can help evaluate the impact of different prompts. User feedback and human evaluation can provide valuable insights into prompt efficacy, ensuring the desired output is achieved consistently. Ethical Considerations in Prompt Engineering Prompt engineering should address ethical considerations to ensure fairness and mitigate biases. Designing prompts that promote inclusivity and diversity while avoiding the reinforcement of existing biases is essential. Careful evaluation and monitoring of prompt impact on the model”s behavior can help identify and mitigate potential ethical risks. Benefits of Prompt Engineering Prompt engineering can be a powerful tool for improving the performance of LLMs. By carefully crafting prompts, prompt engineers can help LLMs to generate more accurate, consistent, and creative outputs. This can be beneficial for a variety of applications, including − Question answering − Prompt engineering can be used to improve the accuracy of LLMs” answers to factual questions. Creative writing − Prompt engineering can be used to help LLMs generate more creative and engaging text, such as poems, stories, and scripts. Machine translation − Prompt engineering can be used to improve the accuracy of LLMs” translations between languages. Coding − Prompt engineering can be used to help LLMs generate more accurate and efficient code. Future Directions and Open Challenges Prompt engineering is an evolving field, and there are ongoing research efforts to explore its potential further. Future directions may involve automated prompt generation techniques, adaptive prompts that evolve with user interactions, and addressing challenges related to nuanced prompts for complex tasks. Prompt engineering is a powerful tool in enhancing AI models and achieving desired outputs. By employing effective prompts, developers can guide the behavior of AI models, control biases, and improve the overall performance and reliability of AI applications. As the field progresses, continued exploration of prompt engineering techniques and best practices will pave the way for even more sophisticated and contextually aware AI models.
Scikit Learn – Boosting Methods In this chapter, we will learn about the boosting methods in Sklearn, which enables building an ensemble model. Boosting methods build ensemble model in an increment way. The main principle is to build the model incrementally by training each base model estimator sequentially. In order to build powerful ensemble, these methods basically combine several week learners which are sequentially trained over multiple iterations of training data. The sklearn.ensemble module is having following two boosting methods. AdaBoost It is one of the most successful boosting ensemble method whose main key is in the way they give weights to the instances in dataset. That’s why the algorithm needs to pay less attention to the instances while constructing subsequent models. Classification with AdaBoost For creating a AdaBoost classifier, the Scikit-learn module provides sklearn.ensemble.AdaBoostClassifier. While building this classifier, the main parameter this module use is base_estimator. Here, base_estimator is the value of the base estimator from which the boosted ensemble is built. If we choose this parameter’s value to none then, the base estimator would be DecisionTreeClassifier(max_depth=1). Implementation example In the following example, we are building a AdaBoost classifier by using sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score. from sklearn.ensemble import AdaBoostClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples = 1000, n_features = 10,n_informative = 2, n_redundant = 0,random_state = 0, shuffle = False) ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0) ADBclf.fit(X, y) Output AdaBoostClassifier(algorithm = ”SAMME.R”, base_estimator = None, learning_rate = 1.0, n_estimators = 100, random_state = 0) Example Once fitted, we can predict for new values as follows − print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [1] Example Now we can check the score as follows − ADBclf.score(X, y) Output 0.995 Example We can also use the sklearn dataset to build classifier using Extra-Tree method. For example, in an example given below, we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean()) Output 0.7851435406698566 Regression with AdaBoost For creating a regressor with Ada Boost method, the Scikit-learn library provides sklearn.ensemble.AdaBoostRegressor. While building regressor, it will use the same parameters as used by sklearn.ensemble.AdaBoostClassifier. Implementation example In the following example, we are building a AdaBoost regressor by using sklearn.ensemble.AdaBoostregressor and also predicting for new values by using predict() method. from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100) ADBregr.fit(X, y) Output AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = ”linear”, n_estimators = 100, random_state = 0) Example Once fitted we can predict from regression model as follows − print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [85.50955817] Gradient Tree Boosting It is also called Gradient Boosted Regression Trees (GRBT). It is basically a generalization of boosting to arbitrary differentiable loss functions. It produces a prediction model in the form of an ensemble of week prediction models. It can be used for the regression and classification problems. Their main advantage lies in the fact that they naturally handle the mixed type data. Classification with Gradient Tree Boost For creating a Gradient Tree Boost classifier, the Scikit-learn module provides sklearn.ensemble.GradientBoostingClassifier. While building this classifier, the main parameter this module use is ‘loss’. Here, ‘loss’ is the value of loss function to be optimized. If we choose loss = deviance, it refers to deviance for classification with probabilistic outputs. On the other hand, if we choose this parameter’s value to exponential then it recovers the AdaBoost algorithm. The parameter n_estimators will control the number of week learners. A hyper-parameter named learning_rate (in the range of (0.0, 1.0]) will control overfitting via shrinkage. Implementation example In the following example, we are building a Gradient Boosting classifier by using sklearn.ensemble.GradientBoostingClassifier. We are fitting this classifier with 50 week learners. from sklearn.datasets import make_hastie_10_2 from sklearn.ensemble import GradientBoostingClassifier X, y = make_hastie_10_2(random_state = 0) X_train, X_test = X[:5000], X[5000:] y_train, y_test = y[:5000], y[5000:] GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1, random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test) Output 0.8724285714285714 Example We can also use the sklearn dataset to build classifier using Gradient Boosting Classifier. As in the following example we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import GradientBoostingClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean()) Output 0.7946582356674234 Regression with Gradient Tree Boost For creating a regressor with Gradient Tree Boost method, the Scikit-learn library provides sklearn.ensemble.GradientBoostingRegressor. It can specify the loss function for regression via the parameter name loss. The default value for loss is ‘ls’. Implementation example In the following example, we are building a Gradient Boosting regressor by using sklearn.ensemble.GradientBoostingregressor and also finding the mean squared error by using mean_squared_error() method. import numpy as np from sklearn.metrics import mean_squared_error from sklearn.datasets import make_friedman1 from sklearn.ensemble import GradientBoostingRegressor X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:] y_train, y_test = y[:1000], y[1000:] GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss = ”ls”).fit(X_train, y_train) Once fitted we can find the mean squared error as follows − mean_squared_error(y_test, GDBreg.predict(X_test)) Output 5.391246106657164
Scikit Learn – Support Vector Machines This chapter deals with a machine learning method termed as Support Vector Machines (SVMs). Introduction Support vector machines (SVMs) are powerful yet flexible supervised machine learning methods used for classification, regression, and, outliers’ detection. SVMs are very efficient in high dimensional spaces and generally are used in classification problems. SVMs are popular and memory efficient because they use a subset of training points in the decision function. The main goal of SVMs is to divide the datasets into number of classes in order to find a maximum marginal hyperplane (MMH) which can be done in the following two steps − Support Vector Machines will first generate hyperplanes iteratively that separates the classes in the best way. After that it will choose the hyperplane that segregate the classes correctly. Some important concepts in SVM are as follows − Support Vectors − They may be defined as the datapoints which are closest to the hyperplane. Support vectors help in deciding the separating line. Hyperplane − The decision plane or space that divides set of objects having different classes. Margin − The gap between two lines on the closet data points of different classes is called margin. Following diagrams will give you an insight about these SVM concepts − SVM in Scikit-learn supports both sparse and dense sample vectors as input. Classification of SVM Scikit-learn provides three classes namely SVC, NuSVC and LinearSVC which can perform multiclass-class classification. SVC It is C-support vector classification whose implementation is based on libsvm. The module used by scikit-learn is sklearn.svm.SVC. This class handles the multiclass support according to one-vs-one scheme. Parameters Followings table consist the parameters used by sklearn.svm.SVC class − Sr.No Parameter & Description 1 C − float, optional, default = 1.0 It is the penalty parameter of the error term. 2 kernel − string, optional, default = ‘rbf’ This parameter specifies the type of kernel to be used in the algorithm. we can choose any one among, ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. The default value of kernel would be ‘rbf’. 3 degree − int, optional, default = 3 It represents the degree of the ‘poly’ kernel function and will be ignored by all other kernels. 4 gamma − {‘scale’, ‘auto’} or float, It is the kernel coefficient for kernels ‘rbf’, ‘poly’ and ‘sigmoid’. 5 optinal default − = ‘scale’ If you choose default i.e. gamma = ‘scale’ then the value of gamma to be used by SVC is 1/(𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠∗𝑋.𝑣𝑎𝑟()). On the other hand, if gamma= ‘auto’, it uses 1/𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 6 coef0 − float, optional, Default=0.0 An independent term in kernel function which is only significant in ‘poly’ and ‘sigmoid’. 7 tol − float, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. 8 shrinking − Boolean, optional, default = True This parameter represents that whether we want to use shrinking heuristic or not. 9 verbose − Boolean, default: false It enables or disable verbose output. Its default value is false. 10 probability − boolean, optional, default = true This parameter enables or disables probability estimates. The default value is false, but it must be enabled before we call fit. 11 max_iter − int, optional, default = -1 As name suggest, it represents the maximum number of iterations within the solver. Value -1 means there is no limit on the number of iterations. 12 cache_size − float, optional This parameter will specify the size of the kernel cache. The value will be in MB(MegaBytes). 13 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. 14 class_weight − {dict, ‘balanced’}, optional This parameter will set the parameter C of class j to 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔ℎ𝑡[𝑗]∗𝐶 for SVC. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight:balanced, it will use the values of y to automatically adjust weights. 15 decision_function_shape − ovo’, ‘ovr’, default = ‘ovr’ This parameter will decide whether the algorithm will return ‘ovr’ (one-vs-rest) decision function of shape as all other classifiers, or the original ovo(one-vs-one) decision function of libsvm. 16 break_ties − boolean, optional, default = false True − The predict will break ties according to the confidence values of decision_function False − The predict will return the first class among the tied classes. Attributes Followings table consist the attributes used by sklearn.svm.SVC class − Sr.No Attributes & Description 1 support_ − array-like, shape = [n_SV] It returns the indices of support vectors. 2 support_vectors_ − array-like, shape = [n_SV, n_features] It returns the support vectors. 3 n_support_ − array-like, dtype=int32, shape = [n_class] It represents the number of support vectors for each class. 4 dual_coef_ − array, shape = [n_class-1,n_SV] These are the coefficient of the support vectors in the decision function. 5 coef_ − array, shape = [n_class * (n_class-1)/2, n_features] This attribute, only available in case of linear kernel, provides the weight assigned to the features. 6 intercept_ − array, shape = [n_class * (n_class-1)/2] It represents the independent term (constant) in decision function. 7 fit_status_ − int The output would be 0 if it is correctly fitted. The output would be 1 if it is incorrectly fitted. 8 classes_ − array of shape = [n_classes] It gives the labels of the classes. Implementation Example Like other classifiers, SVC also has to be fitted with following two arrays − An array X holding the training samples. It is of size [n_samples, n_features]. An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples]. Following Python script uses sklearn.svm.SVC class − import
Scikit Image – Numpy Images NumPy (also known as “Numerical Python”) is one of the most crucial fundamental packages in Python for numerical computing. The core data structure of NumPy is the ndarray (N-dimensional array), which is a homogeneous collection of elements of the same data type. These arrays can be of any dimension, such as 1D, 2D, or even higher-dimensional arrays. NumPy provides a vast collection of mathematical functions to operate on these N-dimensional arrays efficiently. Images in scikit-image are represented as NumPy ndarrays (multidimensional arrays). The scikit-image library is built on top of NumPy and it uses the NumPy arrays to represent images. Hence, the scikit-image library can perform various image-processing tasks effectively. Representing images as NumPy arrays Representing images as NumPy arrays, provide a convenient and efficient way to store and manipulate image data. Here, the dimensions of the NumPy array correspond to the image dimensions, such as height, width, and color channels. For grayscale images, the array is typically two-dimensional (height x width). For color images, the array is three-dimensional (height x width x 3), where the last dimension represents the Red, Green, and Blue color channels. Example 1 The following example demonstrates how a color image can be represented as a NumPy array in scikit-image. from skimage import io # Read an image as a grayscale image img_array = io.imread(”Images/Dog.jpg”) # Display image properties from the image array print(”The following are the properties of the loaded image:”) print(“Data type of the image object:”, type(img_array)) print(“Image shape:”, img_array.shape) print(“Image data type:”, img_array.dtype) Input Image Output The following are the properties of the loaded image: Data type of the image object: <class ”numpy.ndarray”> Image shape: (479, 500, 3) Image data type: uint8 Example 2 Let”s see the NumPy array representation of a grayscale image. from skimage import io # Read an image as a grayscale image img_array = io.imread(”Images/dog.jpg”, as_gray=True) # Display image properties from the image array print(”The following are the properties of the loaded image:”) print(“Data type of the image object:”, type(img_array)) print(“Image shape:”, img_array.shape) print(“Image data type:”, img_array.dtype) Output The following are the properties of the loaded image: Data type of the image object: <class ”numpy.ndarray”> Image shape: (479, 500) Image data type: float64 Indexing and Slicing NumPy”s indexing and slicing feature can be used to access and manipulate image data. Croping images, selecting specific color channels, or applying operations to specific regions within the image can be possible by using NumPy”s flexible indexing and slicing syntax. Example The following example demonstrates how the indexing and slicing syntax of NumPy can be used to modify an image in Scikit-image. from skimage import io # Read an image as a grayscale image img_array = io.imread(”Images/Tajmahal.jpg”) # Get the value of the pixel at the 10th row and 20th column pixel_value = img_array[10, 20] print(”The pixel at the 10th row and 20th column of the image array”, pixel_value) # Set value 0 to the pixel at the 3rd row and 10th column img_array[3, 10] = 0 # Select a region in the image roi = img_array[100:200, 200:300] # Set the pixel values in the selected region to red (255, 0, 0) roi[:] = (255, 0, 0) # Display the modified image io.imshow(img_array) io.show() Input Image Output Running the above code gives us the following result − The pixel at the 10th row and 20th column of the image array [ 81 97 110] In addition it generates the following image −
Scikit Image – Using Mayavi Mayavi is an application and library for interactive scientific data visualization and 3D plotting in Python. It provides a simple and clean scripting interface in Python for 3D visualization. It offers ready-to-use 3D visualization functionality similar to MATLAB or Matplotlib, especially when using the mlab module. This module provides a high-level interface that allows you to easily create various types of 3D plots and visualizations. Mayavi also offers an object-oriented programming interface, allowing you to have more control and flexibility over your 3D visualizations. And it can work natively and transparently with numpy arrays, which makes it convenient to visualize scientific data stored in NumPy arrays without the need for data conversion or preprocessing. Scikit Image with Mayavi To use Mayavi as a plotting engine in your Python scripts, you can use the mlab scripting API, which provides a simple and convenient way to work with Mayavi and generate TVTK datasets using NumPy arrays or other sequences. Installing Mayavi To set up Mayavi and run the visualizations generated by the code, you need to install PyQt along with the Mayavi library. PyQt is a dependency that provides the necessary graphical user interface (GUI) functionality for displaying the visualizations created with Mayavi. pip install mayavi pip install PyQt5 It is recommended to use pip, the Python Package Installer, for installing Python packages from the PyPI. This installs the latest version of Mayavi available on PyPI. Once the required packages are installed successfully, you can import Mayavi into your Python scripts or interactive sessions using − from mayavi import mlab This imports the necessary modules from Mayavi for 3D visualization and scientific data plotting in your Python scripts. Below are a few basic Python programs that demonstrate how to use the scikit-image along with Mayavi to perform data visualization in image processing tasks effectively. Example 1 The following example demonstrates how to display an image using Mayavi”s mlab.imshow() function. from mayavi import mlab from skimage import io import numpy as np # Read an image image = np.random.random((10, 10)) # Display the masked image using Mayavi mlab.figure(fgcolor=(0, 0, 0), bgcolor=(1, 1, 1)) mlab.imshow(image) mlab.show() Output Example 2 Here is another example that demonstrates how to use Mayavi and scikit-image (skimage) together to display a grayscale image using Mayavi”s visualization capabilities. from mayavi import mlab from skimage import io # Read an image image = io.imread(”Images/logo-w.png”, as_gray=True) # Display the masked image using Mayavi mlab.figure(fgcolor=(0, 0, 0), bgcolor=(1, 1, 1)) mlab.imshow(image) mlab.show() Output