Machine Learning – Performance Metrics ”; Previous Next Performance metrics in machine learning are used to evaluate the performance of a machine learning model. These metrics provide quantitative measures to assess how well a model is performing and to compare the performance of different models. Performance metrics are important because they help us understand how well our model is performing and whether it is meeting our requirements. In this way, we can make informed decisions about whether to use a particular model or not. There are many performance metrics that can be used in machine learning, depending on the type of problem being solved and the specific requirements of the problem. Some common performance metrics include − Accuracy − Accuracy is one of the most basic performance metrics and measures the proportion of correctly classified instances in the dataset. It is calculated as the number of correctly classified instances divided by the total number of instances in the dataset. Precision − Precision measures the proportion of true positive instances out of all predicted positive instances. It is calculated as the number of true positive instances divided by the sum of true positive and false positive instances. Recall − Recall measures the proportion of true positive instances out of all actual positive instances. It is calculated as the number of true positive instances divided by the sum of true positive and false negative instances. F1 Score − F1 score is the harmonic mean of precision and recall. It is a balanced measure that takes into account both precision and recall. It is calculated as 2 * (precision × recall) / (precision + recall). ROC AUC Score − ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is a measure of the ability of a classifier to distinguish between positive and negative instances. It is calculated by plotting the true positive rate against the false positive rate at different classification thresholds and calculating the area under the curve. Confusion Matrix − A confusion matrix is a table that is used to evaluate the performance of a classification model. It shows the number of true positives, true negatives, false positives, and false negatives for each class in the dataset. Example Here is an example code snippet to calculate the accuracy, precision, recall, and F1 score for a binary classification problem − from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Load the iris dataset iris = load_iris() X = iris.data y = iris.target # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train a logistic regression model on the training set model = LogisticRegression() model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Compute performance metrics accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average=”macro”) recall = recall_score(y_test, y_pred, average=”macro”) f1 = f1_score(y_test, y_pred, average=”macro”) # Print the performance metrics print(“Accuracy:”, accuracy) print(“Precision:”, precision) print(“Recall:”, recall) print(“F1 Score:”, f1) Output When you execute this code, it will produce the following output − Accuracy: 1.0 Precision: 1.0 Recall: 1.0 F1 Score: 1.0 Print Page Previous Next Advertisements ”;
Category: machine Learning With Python
Overview
Clustering Algorithms – Overview ”; Previous Next Introduction to Clustering Clustering methods are one of the most useful unsupervised ML methods. These methods are used to find similarity as well as the relationship patterns among data samples and then cluster those samples into groups having similarity based on features. Clustering is important because it determines the intrinsic grouping among the present unlabeled data. They basically make some assumptions about data points to constitute their similarity. Each assumption will construct different but equally valid clusters. For example, below is the diagram which shows clustering system grouped together the similar kind of data in different clusters − Cluster Formation Methods It is not necessary that clusters will be formed in spherical form. Followings are some other cluster formation methods − Density-based In these methods, the clusters are formed as the dense region. The advantage of these methods is that they have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc. Hierarchical-based In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc. Partitioning In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized Search (CLARANS). Grid In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that all the clustering operation done on these grids are fast and independent of the number of data objects. Ex. Statistical Information Grid (STING), Clustering in Quest (CLIQUE). Measuring Clustering Performance One of the most important consideration regarding ML model is assessing its performance or you can say model’s quality. In case of supervised learning algorithms, assessing the quality of our model is easy because we already have labels for every example. On the other hand, in case of unsupervised learning algorithms we are not that much blessed because we deal with unlabeled data. But still we have some metrics that give the practitioner an insight about the happening of change in clusters depending on algorithm. Before we deep dive into such metrics, we must understand that these metrics only evaluates the comparative performance of models against each other rather than measuring the validity of the model’s prediction. Followings are some of the metrics that we can deploy on clustering algorithms to measure the quality of model − Silhouette Analysis Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score. This score measures how close each point in one cluster is to points in the neighboring clusters. Analysis of Silhouette Score The range of Silhouette score is [-1, 1]. Its analysis is as follows − +1 Score − Near +1 Silhouette score indicates that the sample is far away from its neighboring cluster. 0 Score − 0 Silhouette score indicates that the sample is on or very close to the decision boundary separating two neighboring clusters. -1 Score &minusl -1 Silhouette score indicates that the samples have been assigned to the wrong clusters. The calculation of Silhouette score can be done by using the following formula − 𝒔𝒊𝒍𝒉𝒐𝒖𝒆𝒕𝒕𝒆 𝒔𝒄𝒐𝒓𝒆=(𝒑−𝒒)/𝐦𝐚𝐱 (𝒑,𝒒) Here, 𝑝 = mean distance to the points in the nearest cluster And, 𝑞 = mean intra-cluster distance to all the points. Davis-Bouldin Index DB index is another good metric to perform the analysis of clustering algorithms. With the help of DB index, we can understand the following points about clustering model − Weather the clusters are well-spaced from each other or not? How much dense the clusters are? We can calculate DB index with the help of following formula − $$DB=frac{1}{n}displaystylesumlimits_{i=1}^n max_{jneq{i}}left(frac{sigma_{i}+sigma_{j}}{d(c_{i},c_{j})}right)$$ Here, 𝑛 = number of clusters σi = average distance of all points in cluster 𝑖 from the cluster centroid 𝑐𝑖. Less the DB index, better the clustering model is. Dunn Index It works same as DB index but there are following points in which both differs − The Dunn index considers only the worst case i.e. the clusters that are close together while DB index considers dispersion and separation of all the clusters in clustering model. Dunn index increases as the performance increases while DB index gets better when clusters are well-spaced and dense. We can calculate Dunn index with the help of following formula − $$D=frac{min_{1leq i <{j}leq{n}}P(i,j)}{mix_{1leq i < k leq n}q(k)}$$ Here, 𝑖,𝑗,𝑘 = each indices for clusters 𝑝 = inter-cluster distance q = intra-cluster distance Types of ML Clustering Algorithms The following are the most important and useful ML clustering algorithms − K-means Clustering This clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means. Mean-Shift Algorithm It is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions hence it is a non-parametric algorithm. Hierarchical Clustering It is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. We will be discussing all these algorithms in detail in the upcoming chapters.
K-means Algorithm
Clustering Algorithms – K-means Algorithm ”; Previous Next Introduction to K-Means Algorithm K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means. In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within same cluster. Working of K-Means Algorithm We can understand the working of K-Means clustering algorithm with the help of following steps − Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm. Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points. Step 3 − Now it will compute the cluster centroids. Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more − 4.1 − First, the sum of squared distance between data points and centroids would be computed. 4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid). 4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster. K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster. While working with K-means algorithm we need to take care of the following things − While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points. Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids. Implementation in Python The following two examples of implementing K-Means clustering algorithm will help us in its better understanding − Example 1 It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result. First, we will start by importing the necessary packages − %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans The following code will generate the 2D, containing four blobs − from sklearn.datasets.samples_generator import make_blobs X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0) Next, the following code will help us to visualize the dataset − plt.scatter(X[:, 0], X[:, 1], s=20); plt.show() Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows − kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator − plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap=”summer”) centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c=”blue”, s=100, alpha=0.9); plt.show() Example 2 Let us move to another example in which we are going to apply K-means clustering on simple digits dataset. K-means will try to identify similar digits without using the original label information. First, we will start by importing the necessary packages − %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and columns in this dataset as follows − from sklearn.datasets import load_digits digits = load_digits() digits.data.shape Output (1797, 64) The above output shows that this dataset is having 1797 samples with 64 features. We can perform the clustering as we did in Example 1 above − kmeans = KMeans(n_clusters=10, random_state=0) clusters = kmeans.fit_predict(digits.data) kmeans.cluster_centers_.shape Output (10, 64) The above output shows that K-means created 10 clusters with 64 features. fig, ax = plt.subplots(2, 5, figsize=(8, 3)) centers = kmeans.cluster_centers_.reshape(10, 8, 8) for axi, center in zip(ax.flat, centers): axi.set(xticks=[], yticks=[]) axi.imshow(center, interpolation=”nearest”, cmap=plt.cm.binary) Output As output, we will get following image showing clusters centers learned by k-means. The following lines of code will match the learned cluster labels with the true labels found in them − from scipy.stats import mode labels = np.zeros_like(clusters) for i in range(10): mask = (clusters == i) labels[mask] = mode(digits.target[mask])[0] Next, we can check the accuracy as follows − from sklearn.metrics import accuracy_score accuracy_score(digits.target, labels) Output 0.7935447968836951 The above output shows that the accuracy is around 80%. Advantages and Disadvantages Advantages The following are some advantages of K-Means clustering algorithms − It is very easy to understand and implement. If we have large number of variables then, K-means would be faster than Hierarchical clustering. On re-computation of centroids, an instance can change the cluster. Tighter clusters are formed with K-means as compared to Hierarchical clustering. Disadvantages The following are some disadvantages of K-Means clustering algorithms −
Hierarchical Clustering
Clustering Algorithms – Hierarchical Clustering ”; Previous Next Introduction to Hierarchical Clustering Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following two categories − Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure. Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters. Steps to Perform Agglomerative Hierarchical Clustering We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows − Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters at start. The number of data points will also be K at start. Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in total of K-1 clusters. Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in total of K-2 clusters. Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join. Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem. Role of Dendrograms in Agglomerative Hierarchical Clustering As we discussed in the last step, the role of dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple cluster of related data points depending upon our problem. It can be understood with the help of following example − Example 1 To understand, let us start with importing the required libraries as follows − %matplotlib inline import matplotlib.pyplot as plt import numpy as np Next, we will be plotting the datapoints we have taken for this example − X = np.array([[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],]) labels = range(1, 11) plt.figure(figsize=(10, 7)) plt.subplots_adjust(bottom=0.1) plt.scatter(X[:,0],X[:,1], label=”True Position”) for label, x, y in zip(labels, X[:, 0], X[:, 1]): plt.annotate(label,xy=(x, y), xytext=(-3, 3),textcoords=”offset points”, ha=”right”, va=”bottom”) plt.show() From the above diagram, it is very easy to see that we have two clusters in out datapoints but in the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms of our datapoints by using Scipy library − from scipy.cluster.hierarchy import dendrogram, linkage from matplotlib import pyplot as plt linked = linkage(X, ”single”) labelList = range(1, 11) plt.figure(figsize=(10, 7)) dendrogram(linked, orientation=”top”,labels=labelList, distance_sort=”descending”,show_leaf_counts=True) plt.show() Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at two points, the number of clusters would be two. Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We are importing AgglomerativeClustering class of sklearn.cluster library − from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=2, affinity=”euclidean”, linkage=”ward”) cluster.fit_predict(X) Next, plot the cluster with the help of following code − plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap=”rainbow”) The above diagram shows the two clusters from our datapoints. Example2 As we understood the concept of dendrograms from the simple example discussed above, let us move to another example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hierarchical clustering − import matplotlib.pyplot as plt import pandas as pd %matplotlib inline import numpy as np from pandas import read_csv path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] data.shape (768, 9) data.head() slno. preg Plas Pres skin test mass pedi age class 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 patient_data = data.iloc[:, 3:5].values import scipy.cluster.hierarchy as shc plt.figure(figsize=(10, 7)) plt.title(“Patient Dendograms”) dend = shc.dendrogram(shc.linkage(data, method=”ward”)) from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=4, affinity=”euclidean”, linkage=”ward”) cluster.fit_predict(patient_data) plt.figure(figsize=(10, 7)) plt.scatter(patient_data[:,0], patient_data[:,1], c=cluster.labels_, cmap=”rainbow”) Print Page Previous Next Advertisements ”;
Machine Learning With Python – Useful Resources ”; Previous Next The following resources contain additional information on Machine Learning With Python. Please use them to get more in-depth knowledge on this. Useful Video Courses Machine Learning with Python: The Complete Course Best Seller 63 Lectures 11 hours TELCOMA Global More Detail Machine Learning Course with Python – All-in-One Bootcamp Best Seller 16 Lectures 14.5 hours GreyCampus Inc. More Detail Fundamentals of Machine Learning With Python By Spotle.ai Featured 33 Lectures 3.5 hours Spotle Learn More Detail Machine Learning A-Z with Python with Project (Beginner) Featured 40 Lectures 13.5 hours Selfcode Academy More Detail Complete Machine Learning In Python With Projects By Spotle.ai 59 Lectures 6 hours Spotle Learn More Detail Supervised Machine Learning With Python By Spotle.ai 45 Lectures 4 hours Spotle Learn More Detail Print Page Previous Next Advertisements ”;
Automatic Workflows
Machine Learning – Automatic Workflows ”; Previous Next Introduction In order to execute and produce results successfully, a machine learning model must automate some standard workflows. The process of automate these standard workflows can be done with the help of Scikit-learn Pipelines. From a data scientist’s perspective, pipeline is a generalized, but very important concept. It basically allows data flow from its raw format to some useful information. The working of pipelines can be understood with the help of following diagram − The blocks of ML pipelines are as follows − Data ingestion − As the name suggests, it is the process of importing the data for use in ML project. The data can be extracted in real time or batches from single or multiple systems. It is one of the most challenging steps because the quality of data can affect the whole ML model. Data Preparation − After importing the data, we need to prepare data to be used for our ML model. Data preprocessing is one of the most important technique of data preparation. ML Model Training − Next step is to train our ML model. We have various ML algorithms like supervised, unsupervised, reinforcement to extract the features from data, and make predictions. Model Evaluation − Next, we need to evaluate the ML model. In case of AutoML pipeline, ML model can be evaluated with the help of various statistical methods and business rules. ML Model retraining − In case of AutoML pipeline, it is not necessary that the first model is best one. The first model is considered as a baseline model and we can train it repeatably to increase model’s accuracy. Deployment − At last, we need to deploy the model. This step involves applying and migrating the model to business operations for their use. Challenges Accompanying ML Pipelines In order to create ML pipelines, data scientists face many challenges. These challenges fall into the following three categories − Quality of Data The success of any ML model depends heavily on the quality of data. If the data we are providing to ML model is not accurate, reliable and robust, then we are going to end with wrong or misleading output. Data Reliability Another challenge associated with ML pipelines is the reliability of data we are providing to the ML model. As we know, there can be various sources from which data scientist can acquire data but to get the best results, it must be assured that the data sources are reliable and trusted. Data Accessibility To get the best results out of ML pipelines, the data itself must be accessible which requires consolidation, cleansing and curation of data. As a result of data accessibility property, metadata will be updated with new tags. Modelling ML Pipeline and Data Preparation Data leakage, happening from training dataset to testing dataset, is an important issue for data scientist to deal with while preparing data for ML model. Generally, at the time of data preparation, data scientist uses techniques like standardization or normalization on entire dataset before learning. But these techniques cannot help us from the leakage of data because the training dataset would have been influenced by the scale of the data in the testing dataset. By using ML pipelines, we can prevent this data leakage because pipelines ensure that data preparation like standardization is constrained to each fold of our cross-validation procedure. Example The following is an example in Python that demonstrate data preparation and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, we will be creating pipeline that standardized the data. Then a Linear Discriminative analysis model will be created and at last the pipeline will be evaluated using 10-fold cross validation. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.discriminant_analysis import LinearDiscriminantAnalysis Now, we need to load the Pima diabetes dataset as did in previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values Next, we will create a pipeline with the help of the following code − estimators = [] estimators.append((”standardize”, StandardScaler())) estimators.append((”lda”, LinearDiscriminantAnalysis())) model = Pipeline(estimators) At last, we are going to evaluate this pipeline and output its accuracy as follows − kfold = KFold(n_splits=20, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7790148448043184 The above output is the summary of accuracy of the setup on the dataset. Modelling ML Pipeline and Feature Extraction Data leakage can also happen at feature extraction step of ML model. That is why feature extraction procedures should also be restricted to stop data leakage in our training dataset. As in the case of data preparation, by using ML pipelines, we can prevent this data leakage also. FeatureUnion, a tool provided by ML pipelines can be used for this purpose. Example The following is an example in Python that demonstrates feature extraction and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, 3 features will be extracted with PCA (Principal Component Analysis). Then, 6 features will be extracted with Statistical Analysis. After feature extraction, result of multiple feature selection and extraction procedures will be combined by using FeatureUnion tool. At last, a Logistic Regression model will be created, and the pipeline will be evaluated using 10-fold cross validation. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.linear_model import LogisticRegression from sklearn.decomposition import
Discuss Machine Learning With Python ”; Previous Next Machine Learning (ML) is basically that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do. In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method. The key focus of ML is to allow computer systems to learn from experience without being explicitly programmed or human intervention. Print Page Previous Next Advertisements ”;
Finding Nearest Neighbors
KNN Algorithm – Finding Nearest Neighbors ”; Previous Next Introduction K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well − Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification. Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data. Working of KNN Algorithm K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps − Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data. Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer. Step 3 − For each point in the test data do the following − 3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean. 3.2 − Now, based on the distance value, sort them in ascending order. 3.3 − Next, it will choose the top K rows from the sorted array. 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows. Step 4 − End Example The following is an example to understand the concept of K and working of KNN algorithm − Suppose we have a dataset which can be plotted as follows − Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram − We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those three, two of them lies in Red class hence the black dot will also be assigned in red class. Implementation in Python As we know K-nearest neighbors (KNN) algorithm can be used for both classification as well as regression. The following are the recipes in Python to use KNN as classifier as well as regressor − KNN as Classifier First, start with importing necessary python packages − import numpy as np import matplotlib.pyplot as plt import pandas as pd Next, download the iris dataset from its weblink as follows − path = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data” Next, we need to assign column names to the dataset as follows − headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] Now, we need to read dataset to pandas dataframe as follows − dataset = pd.read_csv(path, names=headernames) dataset.head() slno. sepal-length sepal-width petal-length petal-width Class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa Data Preprocessing will be done with the help of following script lines − X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values Next, we will divide the data into train and test split. Following code will split the dataset into 60% training data and 40% of testing data − from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40) Next, data scaling will be done as follows − from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) Next, train the model with the help of KNeighborsClassifier class of sklearn as follows − from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=8) classifier.fit(X_train, y_train) At last we need to make prediction. It can be done with the help of following script − y_pred = classifier.predict(X_test) Next, print the results as follows − from sklearn.metrics import classification_report, confusion_matrix, accuracy_score result = confusion_matrix(y_test, y_pred) print(“Confusion Matrix:”) print(result) result1 = classification_report(y_test, y_pred) print(“Classification Report:”,) print (result1) result2 = accuracy_score(y_test,y_pred) print(“Accuracy:”,result2) Output Confusion Matrix: [[21 0 0] [ 0 16 0] [ 0 7 16]] Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 21 Iris-versicolor 0.70 1.00 0.82 16 Iris-virginica 1.00 0.70 0.82 23 micro avg 0.88 0.88 0.88 60 macro avg 0.90 0.90 0.88 60 weighted avg 0.92 0.88 0.88 60 Accuracy: 0.8833333333333333 KNN as Regressor First, start with importing necessary Python packages − import numpy as np import pandas as pd Next, download the iris dataset from its weblink as follows − path = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data” Next, we need to assign column names to the dataset as follows − headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] Now, we need to read dataset to pandas dataframe as follows − data = pd.read_csv(url, names=headernames) array = data.values X = array[:,:2] Y = array[:,2] data.shape output:(150, 5) Next, import KNeighborsRegressor from
Linear Regression
Regression Algorithms – Linear Regression ”; Previous Next Introduction to Linear Regression Linear regression may be defined as the statistical model that analyzes the linear relationship between a dependent variable with given set of independent variables. Linear relationship between variables means that when the value of one or more independent variables will change (increase or decrease), the value of dependent variable will also change accordingly (increase or decrease). Mathematically the relationship can be represented with the help of following equation − Y = mX + b Here, Y is the dependent variable we are trying to predict X is the dependent variable we are using to make predictions. m is the slop of the regression line which represents the effect X has on Y b is a constant, known as the Y-intercept. If X = 0,Y would be equal to b. Furthermore, the linear relationship can be positive or negative in nature as explained below − Positive Linear Relationship A linear relationship will be called positive if both independent and dependent variable increases. It can be understood with the help of following graph − Negative Linear relationship A linear relationship will be called positive if independent increases and dependent variable decreases. It can be understood with the help of following graph − Types of Linear Regression Linear regression is of the following two types − Simple Linear Regression Multiple Linear Regression Simple Linear Regression (SLR) It is the most basic version of linear regression which predicts a response using a single feature. The assumption in SLR is that the two variables are linearly related. Python implementation We can implement SLR in Python in two ways, one is to provide your own dataset and other is to use dataset from scikit-learn python library. Example 1 − In the following Python implementation example, we are using our own dataset. First, we will start with importing necessary packages as follows − %matplotlib inline import numpy as np import matplotlib.pyplot as plt Next, define a function which will calculate the important values for SLR − def coef_estimation(x, y): The following script line will give number of observations n − n = np.size(x) The mean of x and y vector can be calculated as follows − m_x, m_y = np.mean(x), np.mean(y) We can find cross-deviation and deviation about x as follows − SS_xy = np.sum(y*x) – n*m_y*m_x SS_xx = np.sum(x*x) – n*m_x*m_x Next, regression coefficients i.e. b can be calculated as follows − b_1 = SS_xy / SS_xx b_0 = m_y – b_1*m_x return(b_0, b_1) Next, we need to define a function which will plot the regression line as well as will predict the response vector − def plot_regression_line(x, y, b): The following script line will plot the actual points as scatter plot − plt.scatter(x, y, color = “m”, marker = “o”, s = 30) The following script line will predict response vector − y_pred = b[0] + b[1]*x The following script lines will plot the regression line and will put the labels on them − plt.plot(x, y_pred, color = “g”) plt.xlabel(”x”) plt.ylabel(”y”) plt.show() At last, we need to define main() function for providing dataset and calling the function we defined above − def main(): x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([100, 300, 350, 500, 750, 800, 850, 900, 1050, 1250]) b = coef_estimation(x, y) print(“Estimated coefficients:nb_0 = {} nb_1 = {}”.format(b[0], b[1])) plot_regression_line(x, y, b) if __name__ == “__main__”: main() Output Estimated coefficients: b_0 = 154.5454545454545 b_1 = 117.87878787878788 Example 2 − In the following Python implementation example, we are using diabetes dataset from scikit-learn. First, we will start with importing necessary packages as follows − %matplotlib inline import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score Next, we will load the diabetes dataset and create its object − diabetes = datasets.load_diabetes() As we are implementing SLR, we will be using only one feature as follows − X = diabetes.data[:, np.newaxis, 2] Next, we need to split the data into training and testing sets as follows − X_train = X[:-30] X_test = X[-30:] Next, we need to split the target into training and testing sets as follows − y_train = diabetes.target[:-30] y_test = diabetes.target[-30:] Now, to train the model we need to create linear regression object as follows − regr = linear_model.LinearRegression() Next, train the model using the training sets as follows − regr.fit(X_train, y_train) Next, make predictions using the testing set as follows − y_pred = regr.predict(X_test) Next, we will be printing some coefficient like MSE, Variance score etc. as follows − print(”Coefficients: n”, regr.coef_) print(“Mean squared error: %.2f” % mean_squared_error(y_test, y_pred)) print(”Variance score: %.2f” % r2_score(y_test, y_pred)) Now, plot the outputs as follows − plt.scatter(X_test, y_test, color=”blue”) plt.plot(X_test, y_pred, color=”red”, linewidth=3) plt.xticks(()) plt.yticks(()) plt.show() Output Coefficients: [941.43097333] Mean squared error: 3035.06 Variance score: 0.41 Multiple Linear Regression (MLR) It is the extension of simple linear regression that predicts a response using two or more features. Mathematically we can explain it as follows − Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows − $$h(x_{i})=b_{0}+b_{1}x_{i1}+b_{2}x_{i2}+…+b_{p}x_{ip}$$ Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients. Multiple Linear Regression models always includes the
Improving Performance of ML Models ”; Previous Next Performance Improvement with Ensembles Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups − Sequential ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The motivation of such methods is to exploit the dependency among base learners. Parallel ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The motivation of such methods is to exploit the independence among base learners. Ensemble Learning Methods The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models − Bagging The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests. Boosting In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost. Voting In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction. Bagging Ensemble Algorithms The following are three bagging ensemble algorithms − Bagged Decision Tree As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier Now, we need to load the Pima diabetes dataset as we did in the previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] Next, give the input for 10-fold cross validation as follows − seed = 7 kfold = KFold(n_splits=10, random_state=seed) cart = DecisionTreeClassifier() We need to provide the number of trees we are going to build. Here we are building 150 trees − num_trees = 150 Next, build the model with the help of following script − model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed) Calculate and print the result as follows − results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7733766233766234 The output above shows that we got around 77% accuracy of our bagged decision tree classifier model. Random Forest It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are taken with replacement, but the trees are constructed in such a way that reduces the correlation between them. Also, a random subset of features is considered to choose each split point rather than greedily choosing the best split point in construction of each tree. In the following Python recipe, we are going to build bagged random forest ensemble model by using RandomForestClassifier class of sklearn on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier Now, we need to load the Pima diabetes dataset as did in previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] Next, give the input for 10-fold cross validation as follows − seed = 7 kfold = KFold(n_splits=10, random_state=seed) We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features − num_trees = 150 max_features = 5 Next, build the model with the help of following script − model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features) Calculate and print the result as follows − results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7629357484620642 The output above shows that we got around 76% accuracy of our bagged random forest classifier model. Extra Trees It is another extension of bagged decision tree ensemble method. In this method, the random trees are constructed from the samples of the training dataset. In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import ExtraTreesClassifier Now, we need to load the Pima diabetes dataset as did in previous examples −