Machine Learning – Naive Bayes Algorithm The Naive Bayes algorithm is a classification algorithm based on Bayes” theorem. The algorithm assumes that the features are independent of each other, which is why it is called “naive.” It calculates the probability of a sample belonging to a particular class based on the probabilities of its features. For example, a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all these features are dependent on each other, but all these features independently contribute to the probability of that the phone is a smart phone. In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, P(πΏL | features). With the help of Bayes theorem, we can express this in quantitative form as follows β $$Pleft ( L| featuresright )=frac{Pleft ( L right )Pleft (features| Lright )}{Pleft (featuresright )}$$ Here, $Pleft ( L| featuresright )$ is the posterior probability of class. $Pleft ( L right )$ is the prior probability of class. $Pleft (features| Lright )$ is the likelihood which is the probability of predictor given class. $Pleft (featuresright )$ is the prior probability of predictor. In the Naive Bayes algorithm, we use Bayes” theorem to calculate the probability of a sample belonging to a particular class. We calculate the probability of each feature of the sample given the class and multiply them to get the likelihood of the sample belonging to the class. We then multiply the likelihood with the prior probability of the class to get the posterior probability of the sample belonging to the class. We repeat this process for each class and choose the class with the highest probability as the class of the sample. Types of Naive Bayes Algorithm There are three types of Naive Bayes algorithm β Gaussian Naive Bayes β This algorithm is used when the features are continuous variables that follow a normal distribution. It assumes that the probability distribution of each feature is Gaussian, which means it is a bell-shaped curve. Multinomial Naive Bayes β This algorithm is used when the features are discrete variables. It is commonly used in text classification tasks where the features are the frequency of words in a document. Bernoulli Naive Bayes β This algorithm is used when the features are binary variables. It is also commonly used in text classification tasks where the features are whether a word is present or not in a document. Implementation in Python Here we will implement the Gaussian Naive Bayes algorithm in Python. We will use the iris dataset, which is a popular dataset for classification tasks. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica. First, we will import the necessary libraries and load the datase β import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB # load the iris dataset iris = load_iris() # split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.35, random_state=0) We then create an instance of the Gaussian Naive Bayes classifier and train it on the training set β # Create a Gaussian Naive Bayes classifier gnb = GaussianNB() #fit the classifier to the training data: gnb.fit(X_train, y_train) We can now use the trained classifier to make predictions on the testing set β #make predictions on the testing data y_pred = gnb.predict(X_test) We can evaluate the performance of the classifier by calculating its accuracy β #Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) Complete Implementation Example Given below is the complete implementation example of NaΓ―ve Bayes Classification algorithm in python using the iris dataset β import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB # load the iris dataset iris = load_iris() # split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.35, random_state=0) # Create a Gaussian Naive Bayes classifier gnb = GaussianNB() #fit the classifier to the training data: gnb.fit(X_train, y_train) #make predictions on the testing data y_pred = gnb.predict(X_test) #Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) Output When you execute this program, it will produce the following output β Accuracy: 0.9622641509433962
Category: Machine Learning
Machine Learning – Mean-Shift Clustering The Mean-Shift clustering algorithm is a non-parametric clustering algorithm that works by iteratively shifting the mean of a data point towards the densest area of the data. The densest area of the data is determined by the kernel function, which is a function that assigns weights to the data points based on their distance from the mean. The kernel function used in Mean-Shift clustering is usually a Gaussian function. The steps involved in the Mean-Shift clustering algorithm are as follows β Initialize the mean of each data point to its own value. For each data point, compute the mean shift vector, which is the vector that points towards the densest area of the data. Update the mean of each data point by shifting it towards the densest area of the data. Repeat steps 2 and 3 until convergence is reached. The Mean-Shift clustering algorithm is a density-based clustering algorithm, which means that it identifies clusters based on the density of the data points rather than the distance between them. In other words, the algorithm identifies clusters based on the areas where the density of the data points is highest. Implementation of Mean-Shift Clustering in Python The Mean-Shift clustering algorithm can be implemented in Python programming language using the scikit-learn library. The scikit-learn library is a popular machine learning library in Python that provides various tools for data analysis and machine learning. The following steps are involved in implementing the Mean-Shift clustering algorithm in Python using the scikit-learn library β Step 1 β Import the necessary libraries The numpy library is used for scientific computing in Python, while the matplotlib library is used for data visualization. The sklearn.cluster library contains the MeanShift class, which is used for implementing the Mean-Shift clustering algorithm in Python. The estimate_bandwidth function is used to estimate the bandwidth of the kernel function, which is an important parameter in the Mean-Shift clustering algorithm. import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth Step 2 β Generate the data In this step, we generate a random dataset with 500 data points and 2 features. We use the numpy.random.randn function to generate the data. # Generate the data X = np.random.randn(500,2) Step 3 β Estimate the bandwidth of the kernel function In this step, we estimate the bandwidth of the kernel function using the estimate_bandwidth function. The bandwidth is an important parameter in the Mean-Shift clustering algorithm, which determines the width of the kernel function. # Estimate the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100) Step 4 β Initialize the Mean-Shift clustering algorithm In this step, we initialize the Mean-Shift clustering algorithm using the MeanShift class. We pass the bandwidth parameter to the class to set the width of the kernel function. # Initialize the Mean-Shift algorithm ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) Step 5 β Train the model In this step, we train the Mean-Shift clustering algorithm on the dataset using the fit method of the MeanShift class. # Train the model ms.fit(X) Step 6 β Visualize the results # Visualize the results labels = ms.labels_ cluster_centers = ms.cluster_centers_ n_clusters_ = len(np.unique(labels)) print(“Number of estimated clusters:”, n_clusters_) # Plot the data points and the centroids plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:,0], X[:,1], c=labels, cmap=”viridis”) plt.scatter(cluster_centers[:,0], cluster_centers[:,1], marker=”*”, s=300, c=”r”) plt.show() In this step, we visualize the results of the Mean-Shift clustering algorithm. We extract the cluster labels and the cluster centers from the trained model. We then print the number of estimated clusters. Finally, we plot the data points and the centroids using the matplotlib library. Example Here is the complete implementation example of Mean-Shift Clustering Algorithm in python β import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth # Generate the data X = np.random.randn(500,2) # Estimate the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100) # Initialize the Mean-Shift algorithm ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) # Train the model ms.fit(X) # Visualize the results labels = ms.labels_ cluster_centers = ms.cluster_centers_ n_clusters_ = len(np.unique(labels)) print(“Number of estimated clusters:”, n_clusters_) # Plot the data points and the centroids plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:,0], X[:,1], c=labels, cmap=”summer”) plt.scatter(cluster_centers[:,0], cluster_centers[:,1], marker=”*”, s=200, c=”r”) plt.show() Output When you execute the program, it will produce the following plot as the output β Applications of Mean-Shift Clustering The Mean-Shift clustering algorithm has several applications in various fields. Some of the applications of Mean-Shift clustering are as follows β Computer vision β Mean-Shift clustering is widely used in computer vision for object tracking, image segmentation, and feature extraction. Image processing β Mean-Shift clustering is used for image segmentation, which is the process of dividing an image into multiple segments based on the similarity of the pixels. Anomaly detection β Mean-Shift clustering can be used for detecting anomalies in data by identifying the areas with low density. Customer segmentation β Mean-Shift clustering can be used for customer segmentation in marketing by identifying groups of customers with similar behavior and preferences. Social network analysis β Mean-Shift clustering can be used for clustering users in social networks based on their interests and interactions.
Machine Learning – Distribution-Based Clustering Distribution-based clustering algorithms, also known as probabilistic clustering algorithms, are a class of machine learning algorithms that assume that the data points are generated from a mixture of probability distributions. These algorithms aim to identify the underlying probability distributions that generate the data, and use this information to cluster the data into groups with similar properties. One common distribution-based clustering algorithm is the Gaussian Mixture Model (GMM). GMM assumes that the data points are generated from a mixture of Gaussian distributions, and aims to estimate the parameters of these distributions, including the means and covariances of each distribution. Let”s see below what is GMM in ML and how we can implement in Python programming language. Gaussian Mixture Model Gaussian Mixture Models (GMM) is a popular clustering algorithm used in machine learning that assumes that the data is generated from a mixture of Gaussian distributions. In other words, GMM tries to fit a set of Gaussian distributions to the data, where each Gaussian distribution represents a cluster in the data. GMM has several advantages over other clustering algorithms, such as the ability to handle overlapping clusters, model the covariance structure of the data, and provide probabilistic cluster assignments for each data point. This makes GMM a popular choice in many applications, such as image segmentation, pattern recognition, and anomaly detection. Implementation in Python In Python, the Scikit-learn library provides the GaussianMixture class for implementing the GMM algorithm. The class takes several parameters, including the number of components (i.e., the number of clusters to identify), the covariance type, and the initialization method. Here is an example of how to implement GMM using the Scikit-learn library in Python β Example from sklearn.mixture import GaussianMixture from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # generate a dataset X, _ = make_blobs(n_samples=200, centers=4, random_state=0) # create an instance of the GaussianMixture class gmm = GaussianMixture(n_components=4) # fit the model to the dataset gmm.fit(X) # predict the cluster labels for the data points labels = gmm.predict(X) # print the cluster labels print(“Cluster labels:”, labels) plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=”viridis”) plt.show() In this example, we first generate a synthetic dataset using the make_blobs() function from Scikit-learn. We then create an instance of the GaussianMixture class with 4 components and fit the model to the dataset using the fit() method. Finally, we predict the cluster labels for the data points using the predict() method and print the resulting labels. Output When you execute this program, it will produce the following plot as the output β In addition, you will get the following output on the terminal β Cluster labels: [2 0 1 3 2 1 0 1 1 1 1 2 0 0 2 1 3 3 3 1 3 1 2 0 2 2 3 2 2 1 3 1 0 2 0 1 0 1 1 3 3 3 3 1 2 0 1 3 3 1 3 0 0 3 2 3 0 2 3 2 3 1 2 1 3 1 2 3 0 0 2 2 1 1 0 3 0 0 2 2 3 1 2 2 0 1 1 2 0 0 3 3 3 1 1 2 0 3 2 1 3 2 2 3 3 0 1 2 2 1 3 0 0 2 2 1 2 0 3 1 3 0 1 2 1 0 1 0 2 1 0 2 1 3 3 0 3 3 2 3 2 0 2 2 2 2 1 2 0 3 3 3 1 0 2 1 3 0 3 2 3 2 2 0 0 3 1 2 2 0 1 1 0 3 3 3 1 3 0 0 1 2 1 2 1 0 0 3 1 3 2 2 1 3 0 0 0 1 3 1] The covariance type parameter in GMM controls the type of covariance matrix to use for the Gaussian distributions. The available options include “full” (full covariance matrix), “tied” (tied covariance matrix for all clusters), “diag” (diagonal covariance matrix), and “spherical” (a single variance parameter for all dimensions). The initialization method parameter controls the method used to initialize the parameters of the Gaussian distributions. Advantages of Gaussian Mixture Models Following are the advantages of using Gaussian Mixture Models β Gaussian Mixture Models (GMM) can model arbitrary distributions of data, making it a flexible clustering algorithm. It can handle datasets with missing or incomplete data. It provides a probabilistic framework for clustering, which can provide more information about the uncertainty of the clustering results. It can be used for density estimation and generation of new data points that follow the same distribution as the original data. It can be used for semi-supervised learning, where some data points have known labels and are used to train the model. Disadvantages of Gaussian Mixture Models Following are some of the disadvantages of using Gaussian Mixture Models β GMM can be sensitive to the choice of initial parameters, such as the number of clusters and the initial values for the means and covariances of the clusters. It can be computationally expensive for high-dimensional datasets, as it involves computing the inverse of the covariance matrix, which can be expensive for large matrices. It assumes that the data is generated from a mixture of Gaussian distributions, which may not be true for all datasets. It may be prone to overfitting, especially when the number of parameters is large or the dataset is small. It can be difficult to interpret the resulting clusters, especially when the covariance matrices are complex.
Machine Learning – Decision Trees Algorithm The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data. The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best feature is the one that provides the most information gain or the most reduction in entropy. Information gain is a measure of the amount of information gained by splitting the data at a particular feature, while entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to determine the best feature to split the data at each node. The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below β In the above decision tree, the question are decision nodes and final outcomes are leaves. Types of Decision Tree Algorithm There are two main types of Decision Tree algorithm β Classification Tree β A classification tree is used to classify data into different classes or categories. It works by splitting the data into subsets based on the values of the input features and assigning each subset to a different class. Regression Tree β A regression tree is used to predict numerical values or continuous variables. It works by splitting the data into subsets based on the values of the input features and assigning each subset a numerical value. Implementation in Python Let”s implement the Decision Tree algorithm in Python using a popular dataset for classification tasks named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica. First, we will import the necessary libraries and load the dataset β import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Load the iris dataset iris = load_iris() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) We then create an instance of the Decision Tree classifier and train it on the training set β # Create a Decision Tree classifier dtc = DecisionTreeClassifier() # Fit the classifier to the training data dtc.fit(X_train, y_train) We can now use the trained classifier to make predictions on the testing set β # Make predictions on the testing data y_pred = dtc.predict(X_test) We can evaluate the performance of the classifier by calculating its accuracy β # Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) We can visualize the Decision Tree using Matplotlib library β import matplotlib.pyplot as plt from sklearn.tree import plot_tree # Visualize the Decision Tree using Matplotlib plt.figure(figsize=(20,10)) plot_tree(dtc, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show() The plot_tree function from the sklearn.tree module can be used to plot the Decision Tree. We can pass in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names argument to label the features, and the class_names argument to label the target classes. We also specify the figsize argument to set the size of the figure and call the show function to display the plot. Complete Implementation Example Given below is the complete implementation example of Decision Tree Classification algorithm in python using the iris dataset β import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Load the iris dataset iris = load_iris() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) # Create a Decision Tree classifier dtc = DecisionTreeClassifier() # Fit the classifier to the training data dtc.fit(X_train, y_train) # Make predictions on the testing data y_pred = dtc.predict(X_test) # Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) # Visualize the Decision Tree using Matplotlib import matplotlib.pyplot as plt from sklearn.tree import plot_tree plt.figure(figsize=(20,10)) plot_tree(dtc, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show() Output This will create a plot of the Decision Tree that looks like this β Accuracy: 0.9777777777777777 As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision based on the value of a feature, and each leaf node representing a class or numerical value. The color of each node indicates the majority class or value of the samples in that node, and the numbers at the bottom indicate the number of samples that reach that node.
Machine Learning – Dimensionality Reduction Dimensionality reduction in machine learning is the process of reducing the number of features or variables in a dataset while retaining as much of the original information as possible. In other words, it is a way of simplifying the data by reducing its complexity. The need for dimensionality reduction arises when a dataset has a large number of features or variables. Having too many features can lead to overfitting and increase the complexity of the model. It can also make it difficult to visualize the data and can slow down the training process. There are two main approaches to dimensionality reduction β Feature Selection This involves selecting a subset of the original features based on certain criteria, such as their importance or relevance to the target variable. The following are some commonly used feature selection techniques β Filter Methods Wrapper Methods Embedded Methods Feature Extraction Feature extraction is a process of transforming raw data into a set of meaningful features that can be used for machine learning models. It involves reducing the dimensionality of the input data by selecting, combining or transforming features to create a new set of features that are more useful for the machine learning model. Dimensionality reduction can improve the accuracy and speed of machine learning models, reduce overfitting, and simplify data visualization.
Machine Learning – OPTICS Clustering OPTICS is like DBSCAN (Density-Based Spatial Clustering of Applications with Noise), another popular density-based clustering algorithm. However, OPTICS has several advantages over DBSCAN, including the ability to identify clusters of varying densities, the ability to handle noise, and the ability to produce a hierarchical clustering structure. Implementation of OPTICS in Python To implement OPTICS clustering in Python, we can use the scikit-learn library. The scikit-learn library provides a class called OPTICS that implements the OPTICS algorithm. Here”s an example of how to use the OPTICS class in scikit-learn to cluster a dataset β Example from sklearn.cluster import OPTICS from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Generate sample data X, y = make_blobs(n_samples=2000, centers=4, cluster_std=0.60, random_state=0) # Cluster the data using OPTICS optics = OPTICS(min_samples=50, xi=.05) optics.fit(X) # Plot the results labels = optics.labels_ plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=”turbo”) plt.show() In this example, we first generate a sample dataset using the make_blobs function from scikit-learn. We then instantiate an OPTICS object with the min_samples parameter set to 50 and the xi parameter set to 0.05. The min_samples parameter specifies the minimum number of samples required for a cluster to be formed, and the xi parameter controls the steepness of the cluster hierarchy. We then fit the OPTICS object to the dataset using the fit method. Finally, we plot the results using a scatter plot, where each data point is colored according to its cluster label. Output When you execute this program, it will produce the following plot as the output β Advantages of OPTICS Clustering Following are the advantages of using OPTICS clustering β Ability to handle clusters of varying densities β OPTICS can handle clusters that have varying densities, unlike some other clustering algorithms that require clusters to have uniform densities. Ability to handle noise β OPTICS can identify noise data points that do not belong to any cluster, which is useful for removing outliers from the dataset. Hierarchical clustering structure β OPTICS produces a hierarchical clustering structure that can be useful for analyzing the dataset at different levels of granularity. Disadvantages of OPTICS Clustering Following are some of the disadvantages of using OPTICS clustering. Sensitivity to parameters β OPTICS requires careful tuning of its parameters, such as the min_samples and xi parameters, which can be challenging. Computational complexity β OPTICS can be computationally expensive for large datasets, especially when using a high min_samples value.
Machine Learning – Hierarchical Clustering Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following two categories β Agglomerative hierarchical algorithms β In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure. Divisive hierarchical algorithms β On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters. Steps to Perform Agglomerative Hierarchical Clustering We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows β Step 1 β Treat each data point as single cluster. Hence, we will be having say K clusters at start. The number of data points will also be K at start. Step 2 β Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in total of K-1 clusters. Step 3 β Now, to form more clusters we need to join two closet clusters. This will result in total of K-2 clusters. Step 4 β Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join. Step 5 β At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem. Role of Dendrograms in Agglomerative Hierarchical Clustering As we discussed in the last step, the role of dendrogram started once the big cluster is formed. Dendrogram will be used to split the clusters into multiple cluster of related data points depending upon our problem. It can be understood with the help of following example β Example 1 To understand, let”s start with importing the required libraries as follows β %matplotlib inline import matplotlib.pyplot as plt import numpy as np Next, we will be plotting the datapoints we have taken for this example β X = np.array([[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],]) labels = range(1, 11) plt.figure(figsize=(10, 7)) plt.subplots_adjust(bottom=0.1) plt.scatter(X[:,0],X[:,1], label=”True Position”) for label, x, y in zip(labels, X[:, 0], X[:, 1]): plt.annotate(label,xy=(x, y), xytext=(-3, 3),textcoords=”offset points”, ha=”right”, va=”bottom”) plt.show() Output When you execute this code, it will produce the following plot as the output β From the above diagram, it is very easy to see we have two clusters in our datapoints but in real-world data, there can be thousands of clusters. Next, we will be plotting the dendrograms of our datapoints by using Scipy library β from scipy.cluster.hierarchy import dendrogram, linkage from matplotlib import pyplot as plt linked = linkage(X, ”single”) labelList = range(1, 11) plt.figure(figsize=(10, 7)) dendrogram(linked, orientation=”top”,labels=labelList, distance_sort=”descending”,show_leaf_counts=True) plt.show() It will produce the following plot β Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at two points hence the number of clusters would be two. Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We are importing AgglomerativeClustering class of sklearn.cluster library β from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=2, affinity=”euclidean”, linkage=”ward”) cluster.fit_predict(X) Next, plot the cluster with the help of following code β plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap=”rainbow”) The following diagram shows the two clusters from our datapoints. Example 2 As we understood the concept of dendrograms from the simple example above, let”s move to another example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hierarchical clustering β import matplotlib.pyplot as plt import pandas as pd %matplotlib inline import numpy as np from pandas import read_csv path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] patient_data = data.iloc[:, 3:5].values import scipy.cluster.hierarchy as shc plt.figure(figsize=(10, 7)) plt.title(“Patient Dendograms”) dend = shc.dendrogram(shc.linkage(data, method=”ward”)) from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=4, affinity=”euclidean”, linkage=”ward”) cluster.fit_predict(patient_data) plt.figure(figsize=(7.2, 5.5)) plt.scatter(patient_data[:,0], patient_data[:,1], c=cluster.labels_, cmap=”rainbow”) Output When you run this code, it will produce the following two plots as the output β
Machine Learning – Simple Linear Regression Simple linear regression is a type of regression analysis in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable. Python Implementation Given below is an example that shows how to implement simple linear regression using the Pima-Indian-Diabetes dataset in Python. We will also plot the regression line. Data Preparation First, we need to import the Diabetes dataset from scikit-learn and split it into training and testing sets. We will use 80% of the data for training the model and the remaining 20% for testing. from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split # Load the Diabetes dataset diabetes = load_diabetes() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:, 2], diabetes.target, test_size=0.2, random_state=0) # Reshape the input data X_train = X_train.reshape(-1, 1) X_test = X_test.reshape(-1, 1) Here, we are using the third feature (column) of the dataset, which represents the mean blood pressure, as our independent variable (predictor variable) and the target variable as our dependent variable (response variable). Model Training We will use scikit-learn”s LinearRegression class to train a simple linear regression model on the training data. The code for this is as follows β from sklearn.linear_model import LinearRegression # Create a linear regression object lr_model = LinearRegression() # Fit the model on the training data lr_model.fit(X_train, y_train) Here, X_train represents the input feature (mean blood pressure) of the training data and y_train represents the output variable (target variable). Model Testing Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows β # Make predictions on the test data y_pred = lr_model.predict(X_test) Here, X_test represents the input feature of the test data and y_pred represents the predicted output variable (target variable). Model Evaluation We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared error (MSE) and the coefficient of determination (R^2) as evaluation metrics. The code for this is as follows β from sklearn.metrics import mean_squared_error, r2_score # Calculate the mean squared error mse = mean_squared_error(y_test, y_pred) # Calculate the coefficient of determination r2 = r2_score(y_test, y_pred) print(”Mean Squared Error:”, mse) print(”Coefficient of Determination:”, r2) Here, y_test represents the actual output variable of the test data. Plotting the Regression Line We can also visualize the regression line to see how well it fits the data. The code for this is as follows β import matplotlib.pyplot as plt # Plot the training data plt.scatter(X_train, y_train, color=”gray”) # Plot the regression line plt.plot(X_train, lr_model.predict(X_train), color=”red”, linewidth=2) # Add axis labels plt.xlabel(”Mean Blood Pressure”) plt.ylabel(”Disease Progression”) # Show the plot plt.show() Here, we are using the scatter() function from the matplotlib library to plot the training data points and the plot() function to plot the regression line. The xlabel() and ylabel() functions are used to label the x-axis and y-axis of the plot, respectively. Finally, we use the show() function to display the plot. Complete Implementation Example The complete code for implementing simple linear regression in Python is as follows β from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt # Load the Diabetes dataset diabetes = load_diabetes() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:, 2], diabetes.target, test_size=0.2, random_state=0) # Reshape the input data X_train = X_train.reshape(-1, 1) X_test = X_test.reshape(-1, 1) # Create a linear regression object lr_model = LinearRegression() # Fit the model on the training data lr_model.fit(X_train, y_train) # Make predictions on the test data y_pred = lr_model.predict(X_test) # Calculate the mean squared error mse = mean_squared_error(y_test, y_pred) # Calculate the coefficient of determination r2 = r2_score(y_test, y_pred) print(”Mean Squared Error:”, mse) print(”Coefficient of Determination:”, r2) # Plot the training data plt.figure(figsize=(7.5, 3.5)) plt.scatter(X_train, y_train, color=”gray”) # Plot the regression line plt.plot(X_train, lr_model.predict(X_train), color=”red”, linewidth=2) # Add axis labels plt.xlabel(”Mean Blood Pressure”) plt.ylabel(”Disease Progression”) # Show the plot plt.show() Output On executing this code, you will get the following plot as the output and it will also print the Mean Squared Error and the Coefficient of Determination on the terminal β Mean Squared Error: 4150.680189329983 Coefficient of Determination: 0.19057346847560164
Machine Learning – K-Nearest Neighbors (KNN) KNN is a supervised learning algorithm that can be used for both classification and regression problems. The main idea behind KNN is to find the k-nearest data points to a given test data point and use these nearest neighbors to make a prediction. The value of k is a hyperparameter that needs to be tuned, and it represents the number of neighbors to consider. For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. In other words, the class with the highest number of neighbors is the predicted class. For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors” values. The distance metric used to measure the similarity between two data points is an essential factor that affects the KNN algorithm”s performance. The most commonly used distance metrics are Euclidean distance, Manhattan distance, and Minkowski distance. Working of KNN Algorithm The KNN algorithm can be summarized in the following steps β Load the data β The first step is to load the dataset into memory. This can be done using various libraries such as pandas or numpy. Split the data β The next step is to split the data into training and test sets. The training set is used to train the KNN algorithm, while the test set is used to evaluate its performance. Normalize the data β Before training the KNN algorithm, it is essential to normalize the data to ensure that each feature contributes equally to the distance metric calculation. Calculate distances β Once the data is normalized, the KNN algorithm calculates the distances between the test data point and each data point in the training set. Select k-nearest neighbors β The KNN algorithm selects the k-nearest neighbors based on the distances calculated in the previous step. Make a prediction β For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors” values. Evaluate performance β Finally, the KNN algorithm”s performance is evaluated using various metrics such as accuracy, precision, recall, and F1-score. Implementation in Python Now that we have discussed the KNN algorithm”s theory, let”s implement it in Python using scikit-learn. Scikit-learn is a popular library for Machine Learning in Python and provides various algorithms for classification and regression problems. We will use the Iris dataset, which is a popular dataset in Machine Learning and contains information about three different species of Iris flowers. The dataset has four features, including the sepal length, sepal width, petal length, and petal width, and a target variable, which is the species of the flower. To implement KNN in Python, we need to follow the steps mentioned earlier. Here”s the Python code for implementing KNN on the Iris dataset β Example # import libraries from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # load the Iris dataset iris = load_iris() #split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.35, random_state=42) #normalize the data scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) #initialize the KNN algorithm knn = KNeighborsClassifier(n_neighbors=5) #train the KNN algorithm knn.fit(X_train, y_train) #make predictions on the test set y_pred = knn.predict(X_test) #evaluate the performance of the KNN algorithm accuracy = accuracy_score(y_test, y_pred) print(“Accuracy: {:.2f}%”.format(accuracy*100)) Output When you execute this code, it will produce the following output β Accuracy: 98.11%
Machine Learning – Random Forest Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make predictions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the algorithm is to create a large number of decision trees, each of which is trained on a different subset of the data. The predictions of these individual trees are then combined to produce a final prediction. Working of Random Forest Algorithm We can understand the working of Random Forest algorithm with the help of following steps β Step 1 β First, start with the selection of random samples from a given dataset. Step 2 β Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree. Step 3 β In this step, voting will be performed for every predicted result. Step 4 β At last, select the most voted prediction result as the final prediction result. The following diagram illustrates how the Random Forest Algorithm works β Random Forest is a flexible algorithm that can be used for both classification and regression tasks. In classification tasks, the algorithm uses the mode of the predictions of the individual trees to make the final prediction. In regression tasks, the algorithm uses the mean of the predictions of the individual trees. Advantages of Random Forest Algorithm Random Forest algorithm has several advantages over other machine learning algorithms. Some of the key advantages are β Robustness to Overfitting β Random Forest algorithm is known for its robustness to overfitting. This is because the algorithm uses an ensemble of decision trees, which helps to reduce the impact of outliers and noise in the data. High Accuracy β Random Forest algorithm is known for its high accuracy. This is because the algorithm combines the predictions of multiple decision trees, which helps to reduce the impact of individual decision trees that may be biased or inaccurate. Handles Missing Data β Random Forest algorithm can handle missing data without the need for imputation. This is because the algorithm only considers the features that are available for each data point and does not require all features to be present for all data points. Non-Linear Relationships β Random Forest algorithm can handle non-linear relationships between the features and the target variable. This is because the algorithm uses decision trees, which can model non-linear relationships. Feature Importance β Random Forest algorithm can provide information about the importance of each feature in the model. This information can be used to identify the most important features in the data and can be used for feature selection and feature engineering. Implementation of Random Forest Algorithm in Python Let”s take a look at the implementation of Random Forest Algorithm in Python. We will be using the scikit-learn library to implement the algorithm. The scikit-learn library is a popular machine learning library that provides a wide range of algorithms and tools for machine learning. Step 1 β Importing the Libraries We will begin by importing the necessary libraries. We will be using the pandas library for data manipulation, and the scikit-learn library for implementing the Random Forest algorithm. import pandas as pd from sklearn.ensemble import RandomForestClassifier Step 2 β Loading the Data Next, we will load the data into a pandas dataframe. For this tutorial, we will be using the famous Iris dataset, which is a classic dataset for classification tasks. # Loading the iris dataset iris = pd.read_csv(”https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data”, header=None) iris.columns = [”sepal_length”, ”sepal_width”, ”petal_length”,”petal_width”, ”species”] Step 3 β Data Preprocessing Before we can use the data to train our model, we need to preprocess it. This involves separating the features and the target variable and splitting the data into training and testing sets. # Separating the features and target variable X = iris.iloc[:, :-1] y = iris.iloc[:, -1] # Splitting the data into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42) Step 4 β Training the Model Next, we will train our Random Forest classifier on the training data. # Creating the Random Forest classifier object rfc = RandomForestClassifier(n_estimators=100) # Training the model on the training data rfc.fit(X_train, y_train) Step 5 β Making Predictions Once we have trained our model, we can use it to make predictions on the test data. # Making predictions on the test data y_pred = rfc.predict(X_test) Step 6 β Evaluating the Model Finally, we will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score. # Importing the metrics library from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Calculating the accuracy, precision, recall, and F1-score accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average=”weighted”) recall = recall_score(y_test, y_pred, average=”weighted”) f1 = f1_score(y_test, y_pred, average=”weighted”) print(“Accuracy:”, accuracy) print(“Precision:”, precision) print(“Recall:”, recall) print(“F1-score:”, f1) Complete Implementation Example Below is the complete implementation example of Random Forest Algorithm in python using the iris dataset β import pandas as pd from sklearn.ensemble import RandomForestClassifier # Loading the iris dataset iris = pd.read_csv(”https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data”, header=None) iris.columns = [”sepal_length”, ”sepal_width”, ”petal_length”, ”petal_width”, ”species”] # Separating the features and target variable X = iris.iloc[:, :-1] y = iris.iloc[:, -1] # Splitting the data into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42) # Creating the Random Forest classifier object rfc = RandomForestClassifier(n_estimators=100) # Training the model on the training data rfc.fit(X_train, y_train) # Making predictions on the test data y_pred = rfc.predict(X_test) # Importing the metrics library from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Calculating the accuracy, precision, recall, and F1-score accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average=”weighted”) recall = recall_score(y_test, y_pred, average=”weighted”) f1 = f1_score(y_test, y_pred, average=”weighted”) print(“Accuracy:”, accuracy) print(“Precision:”, precision) print(“Recall:”, recall) print(“F1-score:”, f1) Output This will give us the performance metrics of our Random Forest classifier as follows β Accuracy: 0.9811320754716981 Precision: 0.9821802935010483 Recall: 0.9811320754716981 F1-score: 0.9811157396063056