Regression Algorithms – Linear Regression ”; Previous Next Introduction to Linear Regression Linear regression may be defined as the statistical model that analyzes the linear relationship between a dependent variable with given set of independent variables. Linear relationship between variables means that when the value of one or more independent variables will change (increase or decrease), the value of dependent variable will also change accordingly (increase or decrease). Mathematically the relationship can be represented with the help of following equation − Y = mX + b Here, Y is the dependent variable we are trying to predict X is the dependent variable we are using to make predictions. m is the slop of the regression line which represents the effect X has on Y b is a constant, known as the Y-intercept. If X = 0,Y would be equal to b. Furthermore, the linear relationship can be positive or negative in nature as explained below − Positive Linear Relationship A linear relationship will be called positive if both independent and dependent variable increases. It can be understood with the help of following graph − Negative Linear relationship A linear relationship will be called positive if independent increases and dependent variable decreases. It can be understood with the help of following graph − Types of Linear Regression Linear regression is of the following two types − Simple Linear Regression Multiple Linear Regression Simple Linear Regression (SLR) It is the most basic version of linear regression which predicts a response using a single feature. The assumption in SLR is that the two variables are linearly related. Python implementation We can implement SLR in Python in two ways, one is to provide your own dataset and other is to use dataset from scikit-learn python library. Example 1 − In the following Python implementation example, we are using our own dataset. First, we will start with importing necessary packages as follows − %matplotlib inline import numpy as np import matplotlib.pyplot as plt Next, define a function which will calculate the important values for SLR − def coef_estimation(x, y): The following script line will give number of observations n − n = np.size(x) The mean of x and y vector can be calculated as follows − m_x, m_y = np.mean(x), np.mean(y) We can find cross-deviation and deviation about x as follows − SS_xy = np.sum(y*x) – n*m_y*m_x SS_xx = np.sum(x*x) – n*m_x*m_x Next, regression coefficients i.e. b can be calculated as follows − b_1 = SS_xy / SS_xx b_0 = m_y – b_1*m_x return(b_0, b_1) Next, we need to define a function which will plot the regression line as well as will predict the response vector − def plot_regression_line(x, y, b): The following script line will plot the actual points as scatter plot − plt.scatter(x, y, color = “m”, marker = “o”, s = 30) The following script line will predict response vector − y_pred = b[0] + b[1]*x The following script lines will plot the regression line and will put the labels on them − plt.plot(x, y_pred, color = “g”) plt.xlabel(”x”) plt.ylabel(”y”) plt.show() At last, we need to define main() function for providing dataset and calling the function we defined above − def main(): x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([100, 300, 350, 500, 750, 800, 850, 900, 1050, 1250]) b = coef_estimation(x, y) print(“Estimated coefficients:nb_0 = {} nb_1 = {}”.format(b[0], b[1])) plot_regression_line(x, y, b) if __name__ == “__main__”: main() Output Estimated coefficients: b_0 = 154.5454545454545 b_1 = 117.87878787878788 Example 2 − In the following Python implementation example, we are using diabetes dataset from scikit-learn. First, we will start with importing necessary packages as follows − %matplotlib inline import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score Next, we will load the diabetes dataset and create its object − diabetes = datasets.load_diabetes() As we are implementing SLR, we will be using only one feature as follows − X = diabetes.data[:, np.newaxis, 2] Next, we need to split the data into training and testing sets as follows − X_train = X[:-30] X_test = X[-30:] Next, we need to split the target into training and testing sets as follows − y_train = diabetes.target[:-30] y_test = diabetes.target[-30:] Now, to train the model we need to create linear regression object as follows − regr = linear_model.LinearRegression() Next, train the model using the training sets as follows − regr.fit(X_train, y_train) Next, make predictions using the testing set as follows − y_pred = regr.predict(X_test) Next, we will be printing some coefficient like MSE, Variance score etc. as follows − print(”Coefficients: n”, regr.coef_) print(“Mean squared error: %.2f” % mean_squared_error(y_test, y_pred)) print(”Variance score: %.2f” % r2_score(y_test, y_pred)) Now, plot the outputs as follows − plt.scatter(X_test, y_test, color=”blue”) plt.plot(X_test, y_pred, color=”red”, linewidth=3) plt.xticks(()) plt.yticks(()) plt.show() Output Coefficients: [941.43097333] Mean squared error: 3035.06 Variance score: 0.41 Multiple Linear Regression (MLR) It is the extension of simple linear regression that predicts a response using two or more features. Mathematically we can explain it as follows − Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows − $$h(x_{i})=b_{0}+b_{1}x_{i1}+b_{2}x_{i2}+…+b_{p}x_{ip}$$ Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients. Multiple Linear Regression models always includes the
Category: Machine Learning
Improving Performance of ML Models ”; Previous Next Performance Improvement with Ensembles Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups − Sequential ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The motivation of such methods is to exploit the dependency among base learners. Parallel ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The motivation of such methods is to exploit the independence among base learners. Ensemble Learning Methods The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models − Bagging The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests. Boosting In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost. Voting In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction. Bagging Ensemble Algorithms The following are three bagging ensemble algorithms − Bagged Decision Tree As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier Now, we need to load the Pima diabetes dataset as we did in the previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] Next, give the input for 10-fold cross validation as follows − seed = 7 kfold = KFold(n_splits=10, random_state=seed) cart = DecisionTreeClassifier() We need to provide the number of trees we are going to build. Here we are building 150 trees − num_trees = 150 Next, build the model with the help of following script − model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed) Calculate and print the result as follows − results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7733766233766234 The output above shows that we got around 77% accuracy of our bagged decision tree classifier model. Random Forest It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are taken with replacement, but the trees are constructed in such a way that reduces the correlation between them. Also, a random subset of features is considered to choose each split point rather than greedily choosing the best split point in construction of each tree. In the following Python recipe, we are going to build bagged random forest ensemble model by using RandomForestClassifier class of sklearn on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier Now, we need to load the Pima diabetes dataset as did in previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] Next, give the input for 10-fold cross validation as follows − seed = 7 kfold = KFold(n_splits=10, random_state=seed) We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features − num_trees = 150 max_features = 5 Next, build the model with the help of following script − model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features) Calculate and print the result as follows − results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7629357484620642 The output above shows that we got around 76% accuracy of our bagged random forest classifier model. Extra Trees It is another extension of bagged decision tree ensemble method. In this method, the random trees are constructed from the samples of the training dataset. In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import ExtraTreesClassifier Now, we need to load the Pima diabetes dataset as did in previous examples −
Machine Learning With Python – Quick Guide ”; Previous Next Machine Learning with Python – Basics We are living in the ‘age of data’ that is enriched with better computational power and more storage resources,. This data or information is increasing day by day, but the real challenge is to make sense of all the data. Businesses & organizations are trying to deal with it by building intelligent systems using the concepts and methodologies from Data science, Data Mining and Machine learning. Among them, machine learning is the most exciting field of computer science. It would not be wrong if we call machine learning the application and science of algorithms that provides sense to the data. What is Machine Learning? Machine Learning (ML) is that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do. In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method. The main focus of ML is to allow computer systems learn from experience without being explicitly programmed or human intervention. Need for Machine Learning Human beings, at this moment, are the most intelligent and advanced species on earth because they can think, evaluate and solve complex problems. On the other side, AI is still in its initial stage and haven’t surpassed human intelligence in many aspects. Then the question is that what is the need to make machine learn? The most suitable reason for doing this is, “to make decisions, based on data, with efficiency and scale”. Lately, organizations are investing heavily in newer technologies like Artificial Intelligence, Machine Learning and Deep Learning to get the key information from data to perform several real-world tasks and solve problems. We can call it data-driven decisions taken by machines, particularly to automate the process. These data-driven decisions can be used, instead of using programing logic, in the problems that cannot be programmed inherently. The fact is that we can’t do without human intelligence, but other aspect is that we all need to solve real-world problems with efficiency at a huge scale. That is why the need for machine learning arises. Why & When to Make Machines Learn? We have already discussed the need for machine learning, but another question arises that in what scenarios we must make the machine learn? There can be several circumstances where we need machines to take data-driven decisions with efficiency and at a huge scale. The followings are some of such circumstances where making machines learn would be more effective − Lack of human expertise The very first scenario in which we want a machine to learn and take data-driven decisions, can be the domain where there is a lack of human expertise. The examples can be navigations in unknown territories or spatial planets. Dynamic scenarios There are some scenarios which are dynamic in nature i.e. they keep changing over time. In case of these scenarios and behaviors, we want a machine to learn and take data-driven decisions. Some of the examples can be network connectivity and availability of infrastructure in an organization. Difficulty in translating expertise into computational tasks There can be various domains in which humans have their expertise,; however, they are unable to translate this expertise into computational tasks. In such circumstances we want machine learning. The examples can be the domains of speech recognition, cognitive tasks etc. Machine Learning Model Before discussing the machine learning model, we must need to understand the following formal definition of ML given by professor Mitchell − “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” The above definition is basically focusing on three parameters, also the main components of any learning algorithm, namely Task(T), Performance(P) and experience (E). In this context, we can simplify this definition as − ML is a field of AI consisting of learning algorithms that − Improve their performance (P) At executing some task (T) Over time with experience (E) Based on the above, the following diagram represents a Machine Learning Model − Let us discuss them more in detail now − Task(T) From the perspective of problem, we may define the task T as the real-world problem to be solved. The problem can be anything like finding best house price in a specific location or to find best marketing strategy etc. On the other hand, if we talk about machine learning, the definition of task is different because it is difficult to solve ML based tasks by conventional programming approach. A task T is said to be a ML based task when it is based on the process and the system must follow for operating on data points. The examples of ML based tasks are Classification, Regression, Structured annotation, Clustering, Transcription etc. Experience (E) As name suggests, it is the knowledge gained from data points provided to the algorithm or model. Once provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learning thus acquired is called experience(E). Making an analogy with human learning, we can think of this situation as in which a human being is learning or gaining some experience from various attributes like situation, relationships etc. Supervised, unsupervised and reinforcement learning are some ways to learn or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T. Performance (P) An ML algorithm is supposed to perform task and gain experience with the passage of time. The measure which tells whether ML
Introduction
Classification – Introduction ”; Previous Next Introduction to Classification Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”. Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set. An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification. To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email. Types of Learners in Classification We have two types of learners in respective to classification problems − Lazy Learners As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning. Eager Learners As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN). Building a Classifier in Python Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows − Step1: Importing necessary python package For building a classifier using scikit-learn, we need to import it. We can import it by using following script − import sklearn Step2: Importing dataset After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script − from sklearn.datasets import load_breast_cancer The following script will load the dataset; data = load_breast_cancer() We also need to organize the data and it can be done with the help of following scripts − label_names = data[”target_names”] labels = data[”target”] feature_names = data[”feature_names”] features = data[”data”] The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database. print(label_names) The output of the above command is the names of the labels − [”malignant” ”benign”] These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1. The feature names and feature values of these labels can be seen with the help of following commands − print(feature_names[0]) The output of the above command is the names of the features for label 0 i.e. Malignant cancer − mean radius Similarly, names of the features for label can be produced as follows − print(feature_names[1]) The output of the above command is the names of the features for label 1 i.e. Benign cancer − mean texture We can print the features for these labels with the help of following command − print(features[0]) This will give the following output − [ 1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01 4.601e-01 1.189e-01 ] We can print the features for these labels with the help of following command − print(features[1]) This will give the following output − [ 2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02 7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01 5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01 2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01 2.750e-01 8.902e-02 ] Step3: Organizing data into training & testing sets As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function − from sklearn.model_selection import train_test_split Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose − train, test, train_labels, test_labels = train_test_split( features,labels,test_size = 0.40, random_state = 42 ) Step4: Model evaluation After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module − from sklearn.naive_bayes import GaussianNB Now, initialize the model as follows − gnb = GaussianNB() Next, with the help of following command we can train the model − model = gnb.fit(train, train_labels) Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows − preds = gnb.predict(test) print(preds) This will give the following output − [ 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1
Decision Tree
Machine Learning – Decision Trees Algorithm ”; Previous Next The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data. The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best feature is the one that provides the most information gain or the most reduction in entropy. Information gain is a measure of the amount of information gained by splitting the data at a particular feature, while entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to determine the best feature to split the data at each node. The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below − In the above decision tree, the question are decision nodes and final outcomes are leaves. Types of Decision Tree Algorithm There are two main types of Decision Tree algorithm − Classification Tree − A classification tree is used to classify data into different classes or categories. It works by splitting the data into subsets based on the values of the input features and assigning each subset to a different class. Regression Tree − A regression tree is used to predict numerical values or continuous variables. It works by splitting the data into subsets based on the values of the input features and assigning each subset a numerical value. Implementation in Python Let”s implement the Decision Tree algorithm in Python using a popular dataset for classification tasks named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica. First, we will import the necessary libraries and load the dataset − import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Load the iris dataset iris = load_iris() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) We then create an instance of the Decision Tree classifier and train it on the training set − # Create a Decision Tree classifier dtc = DecisionTreeClassifier() # Fit the classifier to the training data dtc.fit(X_train, y_train) We can now use the trained classifier to make predictions on the testing set − # Make predictions on the testing data y_pred = dtc.predict(X_test) We can evaluate the performance of the classifier by calculating its accuracy − # Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) We can visualize the Decision Tree using Matplotlib library − import matplotlib.pyplot as plt from sklearn.tree import plot_tree # Visualize the Decision Tree using Matplotlib plt.figure(figsize=(20,10)) plot_tree(dtc, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show() The plot_tree function from the sklearn.tree module can be used to plot the Decision Tree. We can pass in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names argument to label the features, and the class_names argument to label the target classes. We also specify the figsize argument to set the size of the figure and call the show function to display the plot. Complete Implementation Example Given below is the complete implementation example of Decision Tree Classification algorithm in python using the iris dataset − import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Load the iris dataset iris = load_iris() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) # Create a Decision Tree classifier dtc = DecisionTreeClassifier() # Fit the classifier to the training data dtc.fit(X_train, y_train) # Make predictions on the testing data y_pred = dtc.predict(X_test) # Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) # Visualize the Decision Tree using Matplotlib import matplotlib.pyplot as plt from sklearn.tree import plot_tree plt.figure(figsize=(20,10)) plot_tree(dtc, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show() Output This will create a plot of the Decision Tree that looks like this − Accuracy: 0.9777777777777777 As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision based on the value of a feature, and each leaf node representing a class or numerical value. The color of each node indicates the majority class or value of the samples in that node, and the numbers at the bottom indicate the number of samples that reach that node. Print Page Previous Next Advertisements ”;
Mean Shift Algorithm
Machine Learning – Mean-Shift Clustering ”; Previous Next The Mean-Shift clustering algorithm is a non-parametric clustering algorithm that works by iteratively shifting the mean of a data point towards the densest area of the data. The densest area of the data is determined by the kernel function, which is a function that assigns weights to the data points based on their distance from the mean. The kernel function used in Mean-Shift clustering is usually a Gaussian function. The steps involved in the Mean-Shift clustering algorithm are as follows − Initialize the mean of each data point to its own value. For each data point, compute the mean shift vector, which is the vector that points towards the densest area of the data. Update the mean of each data point by shifting it towards the densest area of the data. Repeat steps 2 and 3 until convergence is reached. The Mean-Shift clustering algorithm is a density-based clustering algorithm, which means that it identifies clusters based on the density of the data points rather than the distance between them. In other words, the algorithm identifies clusters based on the areas where the density of the data points is highest. Implementation of Mean-Shift Clustering in Python The Mean-Shift clustering algorithm can be implemented in Python programming language using the scikit-learn library. The scikit-learn library is a popular machine learning library in Python that provides various tools for data analysis and machine learning. The following steps are involved in implementing the Mean-Shift clustering algorithm in Python using the scikit-learn library − Step 1 − Import the necessary libraries The numpy library is used for scientific computing in Python, while the matplotlib library is used for data visualization. The sklearn.cluster library contains the MeanShift class, which is used for implementing the Mean-Shift clustering algorithm in Python. The estimate_bandwidth function is used to estimate the bandwidth of the kernel function, which is an important parameter in the Mean-Shift clustering algorithm. import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth Step 2 − Generate the data In this step, we generate a random dataset with 500 data points and 2 features. We use the numpy.random.randn function to generate the data. # Generate the data X = np.random.randn(500,2) Step 3 − Estimate the bandwidth of the kernel function In this step, we estimate the bandwidth of the kernel function using the estimate_bandwidth function. The bandwidth is an important parameter in the Mean-Shift clustering algorithm, which determines the width of the kernel function. # Estimate the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100) Step 4 − Initialize the Mean-Shift clustering algorithm In this step, we initialize the Mean-Shift clustering algorithm using the MeanShift class. We pass the bandwidth parameter to the class to set the width of the kernel function. # Initialize the Mean-Shift algorithm ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) Step 5 − Train the model In this step, we train the Mean-Shift clustering algorithm on the dataset using the fit method of the MeanShift class. # Train the model ms.fit(X) Step 6 − Visualize the results # Visualize the results labels = ms.labels_ cluster_centers = ms.cluster_centers_ n_clusters_ = len(np.unique(labels)) print(“Number of estimated clusters:”, n_clusters_) # Plot the data points and the centroids plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:,0], X[:,1], c=labels, cmap=”viridis”) plt.scatter(cluster_centers[:,0], cluster_centers[:,1], marker=”*”, s=300, c=”r”) plt.show() In this step, we visualize the results of the Mean-Shift clustering algorithm. We extract the cluster labels and the cluster centers from the trained model. We then print the number of estimated clusters. Finally, we plot the data points and the centroids using the matplotlib library. Example Here is the complete implementation example of Mean-Shift Clustering Algorithm in python − import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth # Generate the data X = np.random.randn(500,2) # Estimate the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100) # Initialize the Mean-Shift algorithm ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) # Train the model ms.fit(X) # Visualize the results labels = ms.labels_ cluster_centers = ms.cluster_centers_ n_clusters_ = len(np.unique(labels)) print(“Number of estimated clusters:”, n_clusters_) # Plot the data points and the centroids plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:,0], X[:,1], c=labels, cmap=”summer”) plt.scatter(cluster_centers[:,0], cluster_centers[:,1], marker=”*”, s=200, c=”r”) plt.show() Output When you execute the program, it will produce the following plot as the output − Applications of Mean-Shift Clustering The Mean-Shift clustering algorithm has several applications in various fields. Some of the applications of Mean-Shift clustering are as follows − Computer vision − Mean-Shift clustering is widely used in computer vision for object tracking, image segmentation, and feature extraction. Image processing − Mean-Shift clustering is used for image segmentation, which is the process of dividing an image into multiple segments based on the similarity of the pixels. Anomaly detection − Mean-Shift clustering can be used for detecting anomalies in data by identifying the areas with low density. Customer segmentation − Mean-Shift clustering can be used for customer segmentation in marketing by identifying groups of customers with similar behavior and preferences. Social network analysis − Mean-Shift clustering can be used for clustering users in social networks based on their interests and interactions. Print Page Previous Next Advertisements ”;
Random Forest
Classification Algorithms – Random Forest ”; Previous Next Introduction Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result. Working of Random Forest Algorithm We can understand the working of Random Forest algorithm with the help of following steps − Step 1 − First, start with the selection of random samples from a given dataset. Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree. Step 3 − In this step, voting will be performed for every predicted result. Step 4 − At last, select the most voted prediction result as the final prediction result. The following diagram will illustrate its working − Implementation in Python First, start with importing necessary Python packages − import numpy as np import matplotlib.pyplot as plt import pandas as pd Next, download the iris dataset from its weblink as follows − path = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data” Next, we need to assign column names to the dataset as follows − headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] Now, we need to read dataset to pandas dataframe as follows − dataset = pd.read_csv(path, names=headernames) dataset.head() sepal-length sepal-width petal-length petal-width Class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa Data Preprocessing will be done with the help of following script lines − X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data − from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) Next, train the model with the help of RandomForestClassifier class of sklearn as follows − from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=50) classifier.fit(X_train, y_train) At last, we need to make prediction. It can be done with the help of following script − y_pred = classifier.predict(X_test) Next, print the results as follows − from sklearn.metrics import classification_report, confusion_matrix, accuracy_score result = confusion_matrix(y_test, y_pred) print(“Confusion Matrix:”) print(result) result1 = classification_report(y_test, y_pred) print(“Classification Report:”,) print (result1) result2 = accuracy_score(y_test,y_pred) print(“Accuracy:”,result2) Output Confusion Matrix: [ [14 0 0] [ 0 18 1] [ 0 0 12] ] Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 14 Iris-versicolor 1.00 0.95 0.97 19 Iris-virginica 0.92 1.00 0.96 12 micro avg 0.98 0.98 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Accuracy: 0.9777777777777777 Pros and Cons of Random Forest Pros The following are the advantages of Random Forest algorithm − It overcomes the problem of overfitting by averaging or combining the results of different decision trees. Random forests work well for a large range of data items than a single decision tree does. Random forest has less variance then single decision tree. Random forests are very flexible and possess very high accuracy. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling. Cons The following are the disadvantages of Random Forest algorithm − Complexity is the main disadvantage of Random forest algorithms. Construction of Random forests are much harder and time-consuming than decision trees. More computational resources are required to implement Random Forest algorithm. It is less intuitive in case when we have a large collection of decision trees . The prediction process using random forests is very time-consuming in comparison with other algorithms. Print Page Previous Next Advertisements ”;
Data Loading for ML Projects
Machine Learning – Data Loading ”; Previous Next Suppose if you want to start a ML project then what is the first and most important thing you would require? It is the data that we need to load for starting any of the ML project. In machine learning, data loading refers to the process of importing or reading data from external sources and converting it into a format that can be used by the machine learning algorithm. The data is then preprocessed to remove any inconsistencies, missing values, or outliers. Once the data is preprocessed, it is split into training and testing sets, which are then used for model training and evaluation. The data can come from various sources such as CSV files, databases, web APIs, cloud storage, etc. The most common file formats for machine learning projects is CSV (Comma Separated Values). Consideration While Loading CSV data CSV is a plain text format that stores tabular data, where each row represents a record, and each column represents a field or attribute. It is widely used because it is simple, lightweight, and can be easily read and processed by programming languages such as Python, R, and Java. In Python, we can load CSV data into ML projects with different ways but before loading CSV data we must have to take care about some considerations. In this chapter, let”s understand the main parts of a CSV file, how they might affect the loading and analysis of data, and some consideration we should take care before loading CSV data into ML projects. File Header This is the first row of the CSV file, and it typically contains the names of the columns in the table. When loading CSV data into an ML project, the file header (also known as column headers or variable names) can play an important role in data analysis and model training. Here are some considerations to keep in mind regarding the file header − Consistency − The header row should be consistent across the entire CSV file. This means that the number of columns and their names should be the same for each row. Inconsistencies can cause issues with parsing and analysis. Meaningful names − Column names should be meaningful and descriptive. This can help with understanding the data and building more accurate models. Avoid using generic names like “column1”, “column2”, etc. Case sensitivity − Depending on the tool or library being used to load the CSV file, the column names may be case sensitive. It”s important to ensure that the case of the header row matches the expected case sensitivity of the tool or library being used. Special characters − Column names should not contain any special characters, such as spaces, commas, or quotation marks. These characters can cause issues with parsing and analysis. Instead, use underscores or camelCase to separate words. Missing header − If the CSV file does not have a header row, it”s important to specify the column names manually or provide a separate file or documentation that includes the column names. Encoding − The encoding of the header row can affect its interpretation when loading the CSV file. It”s important to ensure that the encoding of the header row is compatible with the tool or library being used to read the file. Comments These are optional lines that begin with a specified character, such as “#” or “//”, and are ignored by most programs that read CSV files. They can be used to provide additional information or context about the data in the file. Comments in a CSV file are not typically used to represent data that would be used in a machine learning project. However, if comments are present in a CSV file, it”s important to consider how they might affect the loading and analysis of the data. Here are some considerations − Comment markers − In a CSV file, comments can be indicated using a specific marker, such as “#” or “//”. It”s important to know what marker is being used, so that the loading process can ignore comments properly. Placement − Comments should be placed in a separate line from the actual data. If a comment is included in a line with actual data, it may cause issues with parsing and analysis. Consistency − If comments are used in a CSV file, it”s important to ensure that the comment marker is used consistently throughout the entire file. Inconsistencies can cause issues with parsing and analysis. Handling comments − Depending on the tool or library being used to load the CSV file, comments may be ignored by default or may require a specific parameter to be set. It”s important to understand how comments are handled by the tool or library being used. Effect on analysis − If comments contain important information about the data, it may be necessary to process them separately from the data itself. This can add complexity to the loading and analysis process. Delimiter This is the character that separates the fields in each row. While the name suggests that a comma is used as the delimiter, other characters such as tabs, semicolons, or pipes can also be used depending on the file. The delimiter used in a CSV file can significantly affect the accuracy and performance of a machine learning model, so it is important to consider the following while loading data into an ML project − Delimiter choice − The delimiter used in a CSV file should be carefully chosen based on the data being used. For example, if the data contains commas within the values (e.g. “New York, NY”), then using a comma as a delimiter may cause issues. In this case, a different delimiter, such as a tab or semicolon, may be more appropriate.
Data Feature Selection
Machine Learning – Feature Selection ”; Previous Next Feature selection is an important step in machine learning that involves selecting a subset of the available features to improve the performance of the model. The following are some commonly used feature selection techniques − Filter Methods This method involves evaluating the relevance of each feature by calculating a statistical measure (e.g., correlation, mutual information, chi-square, etc.) and ranking the features based on their scores. Features that have low scores are then removed from the model. To implement filter methods in Python, you can use the SelectKBest or SelectPercentile functions from the sklearn.feature_selection module. Below is a small code snippet to implement Feature selection. from sklearn.feature_selection import SelectPercentile, chi2 selector = SelectPercentile(chi2, percentile=10) X_new = selector.fit_transform(X, y) Wrapper Methods This method involves evaluating the model”s performance by adding or removing features and selecting the subset of features that yields the best performance. This approach is computationally expensive, but it is more accurate than filter methods. To implement wrapper methods in Python, you can use the RFE (Recursive Feature Elimination) function from the sklearn.feature_selection module. Below is a small code snippet to implement Wrapper method. from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression estimator = LogisticRegression() selector = RFE(estimator, n_features_to_select=5) selector = selector.fit(X, y) X_new = selector.transform(X) Embedded Methods This method involves incorporating feature selection into the model building process itself. This can be done using techniques such as Lasso regression, Ridge regression, or Decision Trees. These methods assign weights to each feature and features with low weights are removed from the model. To implement embedded methods in Python, you can use the Lasso or Ridge regression functions from the sklearn.linear_model module. Below is a small code snippet for implementing embedded methods − from sklearn.linear_model import Lasso lasso = Lasso(alpha=0.1) lasso.fit(X, y) coef = pd.Series(lasso.coef_, index = X.columns) important_features = coef[coef != 0] Principal Component Analysis (PCA) This is a type of unsupervised learning method that involves transforming the original features into a set of uncorrelated principal components that explain the maximum variance in the data. The number of principal components can be selected based on a threshold value, which can reduce the dimensionality of the dataset. To implement PCA in Python, you can use the PCA function from the sklearn.decomposition module. For example, to reduce the number of features you can use PCA as given the following code − from sklearn.decomposition import PCA pca = PCA(n_components=3) X_new = pca.fit_transform(X) Recursive Feature Elimination (RFE) This method involves recursively eliminating the least significant features until a subset of the most important features is identified. It uses a model-based approach and can be computationally expensive, but it can yield good results in high-dimensional datasets. To implement RFE in Python, you can use the RFECV (Recursive Feature Elimination with Cross Validation) function from the sklearn.feature_selection module. For example, below is a small code snippet with the help of which we can implement to use Recursive Feature Elimination − from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier estimator = DecisionTreeClassifier() selector = RFECV(estimator, step=1, cv=5) selector = selector.fit(X, y) X_new = selector.transform(X) These feature selection techniques can be used alone or in combination to improve the performance of machine learning models. It is important to choose the appropriate technique based on the size of the dataset, the nature of the features, and the type of model being used. Example In the below example, we will implement three feature selection methods − univariate feature selection using the chi-square test, recursive feature elimination with cross-validation (RFECV), and principal component analysis (PCA). We will use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is included in scikit-learn. This dataset contains 569 samples with 30 features, and the task is to classify whether a tumor is malignant or benign based on these features. Here is the Python code to implement these feature selection methods on the Breast Cancer Wisconsin (Diagnostic) Dataset − # Import necessary libraries and dataset import pandas as pd from sklearn.datasets import load_diabetes from sklearn.feature_selection import SelectKBest, chi2 from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Load the dataset diabetes = pd.read_csv(r”C:UsersLeekhaDesktopdiabetes.csv”) # Split the dataset into features and target variable X = diabetes.drop(”Outcome”, axis=1) y = diabetes[”Outcome”] # Apply univariate feature selection using the chi-square test selector = SelectKBest(chi2, k=4) X_new = selector.fit_transform(X, y) # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42) # Fit a logistic regression model on the selected features clf = LogisticRegression() clf.fit(X_train, y_train) # Evaluate the model on the test set accuracy = clf.score(X_test, y_test) print(“Accuracy using univariate feature selection: {:.2f}”.format(accuracy)) # Recursive feature elimination with cross-validation (RFECV) estimator = LogisticRegression() selector = RFECV(estimator, step=1, cv=5) selector.fit(X, y) X_new = selector.transform(X) scores = cross_val_score(LogisticRegression(), X_new, y, cv=5) print(“Accuracy using RFECV feature selection: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2)) # PCA implementation pca = PCA(n_components=5) X_new = pca.fit_transform(X) scores = cross_val_score(LogisticRegression(), X_new, y, cv=5) print(“Accuracy using PCA feature selection: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2)) Output When you execute this code, it will produce the following output on the terminal − Accuracy using univariate feature selection: 0.74 Accuracy using RFECV feature selection: 0.77 (+/- 0.03) Accuracy using PCA feature selection: 0.75 (+/- 0.07) Print Page Previous Next Advertisements ”;
Machine Learning – Data Visualization ”; Previous Next Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate patterns, trends, and insights in the data. Data visualization involves creating graphical representations of the data, which can help to identify patterns and relationships that may not be apparent from the raw data. Here are some of the ways data visualization is used in machine learning − Exploring Data − Data visualization is an essential tool for exploring and understanding data. Visualization can help to identify patterns, correlations, and outliers, and can also help to detect data quality issues such as missing values and inconsistencies. Feature Selection − Data visualization can help to select relevant features for the ML model. By visualizing the data and its relationship with the target variable, you can identify features that are strongly correlated with the target variable and exclude irrelevant features that have little predictive power. Model Evaluation − Data visualization can be used to evaluate the performance of the ML model. Visualization techniques such as ROC curves, precision-recall curves, and confusion matrices can help to understand the accuracy, precision, recall, and F1 score of the model. Communicating Insights − Data visualization is an effective way to communicate insights and results to stakeholders who may not have a technical background. Visualizations such as scatter plots, line charts, and bar charts can help to convey complex information in an easily understandable format. Some popular libraries used for data visualization in Python include Matplotlib, Seaborn, Plotly, and Bokeh. These libraries provide a wide range of visualization techniques and customization options to suit different needs and preferences. Univariate Plots: Understanding Attributes Independently The simplest type of visualization is single-variable or “univariate” visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization − Histograms Density Plots Box and Whisker Plots Multivariate Plots: Interaction Among Multiple Variables Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivariate visualization, we can understand interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization − Correlation Matrix Plot Scatter Matrix Plot In the next few chapters, we will look at some of the popular and widely used visualization techniques available in machine learning. Print Page Previous Next Advertisements ”;