machine Learning With Python Archives - Donotsad where can learn any thing work project and make money

Aug 09

Performance Metrics

Machine Learning – Performance Metrics ”; Previous Next Performance metrics in machine learning are used to evaluate the performance of a machine learning model. These metrics provide quantitative measures to assess how well a model is performing and to compare the performance of different models. Performance metrics are important because they help us understand how well our model is performing and whether it is meeting our requirements. In this way, we can make informed decisions about whether to use a particular model or not. There are many performance metrics that can be used in machine learning, depending on the type of problem being solved and the specific requirements of the problem. Some common performance metrics include − Accuracy − Accuracy is one of the most basic performance metrics and measures the proportion of correctly classified instances in the dataset. It is calculated as the number of correctly classified instances divided by the total number of instances in the dataset. Precision − Precision measures the proportion of true positive instances out of all predicted positive instances. It is calculated as the number of true positive instances divided by the sum of true positive and false positive instances. Recall − Recall measures the proportion of true positive instances out of all actual positive instances. It is calculated as the number of true positive instances divided by the sum of true positive and false negative instances. F1 Score − F1 score is the harmonic mean of precision and recall. It is a balanced measure that takes into account both precision and recall. It is calculated as 2 * (precision × recall) / (precision + recall). ROC AUC Score − ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is a measure of the ability of a classifier to distinguish between positive and negative instances. It is calculated by plotting the true positive rate against the false positive rate at different classification thresholds and calculating the area under the curve. Confusion Matrix − A confusion matrix is a table that is used to evaluate the performance of a classification model. It shows the number of true positives, true negatives, false positives, and false negatives for each class in the dataset. Example Here is an example code snippet to calculate the accuracy, precision, recall, and F1 score for a binary classification problem − from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Load the iris dataset iris = load_iris() X = iris.data y = iris.target # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train a logistic regression model on the training set model = LogisticRegression() model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Compute performance metrics accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average=”macro”) recall = recall_score(y_test, y_pred, average=”macro”) f1 = f1_score(y_test, y_pred, average=”macro”) # Print the performance metrics print(“Accuracy:”, accuracy) print(“Precision:”, precision) print(“Recall:”, recall) print(“F1 Score:”, f1) Output When you execute this code, it will produce the following output − Accuracy: 1.0 Precision: 1.0 Recall: 1.0 F1 Score: 1.0 Print Page Previous Next Advertisements ”;

Aug 09

Overview

Clustering Algorithms – Overview ”; Previous Next Introduction to Clustering Clustering methods are one of the most useful unsupervised ML methods. These methods are used to find similarity as well as the relationship patterns among data samples and then cluster those samples into groups having similarity based on features. Clustering is important because it determines the intrinsic grouping among the present unlabeled data. They basically make some assumptions about data points to constitute their similarity. Each assumption will construct different but equally valid clusters. For example, below is the diagram which shows clustering system grouped together the similar kind of data in different clusters − Cluster Formation Methods It is not necessary that clusters will be formed in spherical form. Followings are some other cluster formation methods − Density-based In these methods, the clusters are formed as the dense region. The advantage of these methods is that they have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc. Hierarchical-based In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc. Partitioning In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized Search (CLARANS). Grid In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that all the clustering operation done on these grids are fast and independent of the number of data objects. Ex. Statistical Information Grid (STING), Clustering in Quest (CLIQUE). Measuring Clustering Performance One of the most important consideration regarding ML model is assessing its performance or you can say model’s quality. In case of supervised learning algorithms, assessing the quality of our model is easy because we already have labels for every example. On the other hand, in case of unsupervised learning algorithms we are not that much blessed because we deal with unlabeled data. But still we have some metrics that give the practitioner an insight about the happening of change in clusters depending on algorithm. Before we deep dive into such metrics, we must understand that these metrics only evaluates the comparative performance of models against each other rather than measuring the validity of the model’s prediction. Followings are some of the metrics that we can deploy on clustering algorithms to measure the quality of model − Silhouette Analysis Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score. This score measures how close each point in one cluster is to points in the neighboring clusters. Analysis of Silhouette Score The range of Silhouette score is [-1, 1]. Its analysis is as follows − +1 Score − Near +1 Silhouette score indicates that the sample is far away from its neighboring cluster. 0 Score − 0 Silhouette score indicates that the sample is on or very close to the decision boundary separating two neighboring clusters. -1 Score &minusl -1 Silhouette score indicates that the samples have been assigned to the wrong clusters. The calculation of Silhouette score can be done by using the following formula − 𝒔𝒊𝒍𝒉𝒐𝒖𝒆𝒕𝒕𝒆 𝒔𝒄𝒐𝒓𝒆=(𝒑−𝒒)/𝐦𝐚𝐱 (𝒑,𝒒) Here, 𝑝 = mean distance to the points in the nearest cluster And, 𝑞 = mean intra-cluster distance to all the points. Davis-Bouldin Index DB index is another good metric to perform the analysis of clustering algorithms. With the help of DB index, we can understand the following points about clustering model − Weather the clusters are well-spaced from each other or not? How much dense the clusters are? We can calculate DB index with the help of following formula − $$DB=frac{1}{n}displaystylesumlimits_{i=1}^n max_{jneq{i}}left(frac{sigma_{i}+sigma_{j}}{d(c_{i},c_{j})}right)$$ Here, 𝑛 = number of clusters σi = average distance of all points in cluster 𝑖 from the cluster centroid 𝑐𝑖. Less the DB index, better the clustering model is. Dunn Index It works same as DB index but there are following points in which both differs − The Dunn index considers only the worst case i.e. the clusters that are close together while DB index considers dispersion and separation of all the clusters in clustering model. Dunn index increases as the performance increases while DB index gets better when clusters are well-spaced and dense. We can calculate Dunn index with the help of following formula − $$D=frac{min_{1leq i <{j}leq{n}}P(i,j)}{mix_{1leq i < k leq n}q(k)}$$ Here, 𝑖,𝑗,𝑘 = each indices for clusters 𝑝 = inter-cluster distance q = intra-cluster distance Types of ML Clustering Algorithms The following are the most important and useful ML clustering algorithms − K-means Clustering This clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means. Mean-Shift Algorithm It is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions hence it is a non-parametric algorithm. Hierarchical Clustering It is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. We will be discussing all these algorithms in detail in the upcoming chapters.

Aug 09

K-means Algorithm

Clustering Algorithms – K-means Algorithm ”; Previous Next Introduction to K-Means Algorithm K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means. In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within same cluster. Working of K-Means Algorithm We can understand the working of K-Means clustering algorithm with the help of following steps − Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm. Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points. Step 3 − Now it will compute the cluster centroids. Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more − 4.1 − First, the sum of squared distance between data points and centroids would be computed. 4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid). 4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster. K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster. While working with K-means algorithm we need to take care of the following things − While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points. Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids. Implementation in Python The following two examples of implementing K-Means clustering algorithm will help us in its better understanding − Example 1 It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result. First, we will start by importing the necessary packages − %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans The following code will generate the 2D, containing four blobs − from sklearn.datasets.samples_generator import make_blobs X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0) Next, the following code will help us to visualize the dataset − plt.scatter(X[:, 0], X[:, 1], s=20); plt.show() Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows − kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator − plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap=”summer”) centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c=”blue”, s=100, alpha=0.9); plt.show() Example 2 Let us move to another example in which we are going to apply K-means clustering on simple digits dataset. K-means will try to identify similar digits without using the original label information. First, we will start by importing the necessary packages − %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and columns in this dataset as follows − from sklearn.datasets import load_digits digits = load_digits() digits.data.shape Output (1797, 64) The above output shows that this dataset is having 1797 samples with 64 features. We can perform the clustering as we did in Example 1 above − kmeans = KMeans(n_clusters=10, random_state=0) clusters = kmeans.fit_predict(digits.data) kmeans.cluster_centers_.shape Output (10, 64) The above output shows that K-means created 10 clusters with 64 features. fig, ax = plt.subplots(2, 5, figsize=(8, 3)) centers = kmeans.cluster_centers_.reshape(10, 8, 8) for axi, center in zip(ax.flat, centers): axi.set(xticks=[], yticks=[]) axi.imshow(center, interpolation=”nearest”, cmap=plt.cm.binary) Output As output, we will get following image showing clusters centers learned by k-means. The following lines of code will match the learned cluster labels with the true labels found in them − from scipy.stats import mode labels = np.zeros_like(clusters) for i in range(10): mask = (clusters == i) labels[mask] = mode(digits.target[mask])[0] Next, we can check the accuracy as follows − from sklearn.metrics import accuracy_score accuracy_score(digits.target, labels) Output 0.7935447968836951 The above output shows that the accuracy is around 80%. Advantages and Disadvantages Advantages The following are some advantages of K-Means clustering algorithms − It is very easy to understand and implement. If we have large number of variables then, K-means would be faster than Hierarchical clustering. On re-computation of centroids, an instance can change the cluster. Tighter clusters are formed with K-means as compared to Hierarchical clustering. Disadvantages The following are some disadvantages of K-Means clustering algorithms −

Aug 09

Hierarchical Clustering

Clustering Algorithms – Hierarchical Clustering ”; Previous Next Introduction to Hierarchical Clustering Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following two categories − Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure. Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters. Steps to Perform Agglomerative Hierarchical Clustering We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows − Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters at start. The number of data points will also be K at start. Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in total of K-1 clusters. Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in total of K-2 clusters. Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join. Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem. Role of Dendrograms in Agglomerative Hierarchical Clustering As we discussed in the last step, the role of dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple cluster of related data points depending upon our problem. It can be understood with the help of following example − Example 1 To understand, let us start with importing the required libraries as follows − %matplotlib inline import matplotlib.pyplot as plt import numpy as np Next, we will be plotting the datapoints we have taken for this example − X = np.array([[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],]) labels = range(1, 11) plt.figure(figsize=(10, 7)) plt.subplots_adjust(bottom=0.1) plt.scatter(X[:,0],X[:,1], label=”True Position”) for label, x, y in zip(labels, X[:, 0], X[:, 1]): plt.annotate(label,xy=(x, y), xytext=(-3, 3),textcoords=”offset points”, ha=”right”, va=”bottom”) plt.show() From the above diagram, it is very easy to see that we have two clusters in out datapoints but in the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms of our datapoints by using Scipy library − from scipy.cluster.hierarchy import dendrogram, linkage from matplotlib import pyplot as plt linked = linkage(X, ”single”) labelList = range(1, 11) plt.figure(figsize=(10, 7)) dendrogram(linked, orientation=”top”,labels=labelList, distance_sort=”descending”,show_leaf_counts=True) plt.show() Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at two points, the number of clusters would be two. Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We are importing AgglomerativeClustering class of sklearn.cluster library − from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=2, affinity=”euclidean”, linkage=”ward”) cluster.fit_predict(X) Next, plot the cluster with the help of following code − plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap=”rainbow”) The above diagram shows the two clusters from our datapoints. Example2 As we understood the concept of dendrograms from the simple example discussed above, let us move to another example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hierarchical clustering − import matplotlib.pyplot as plt import pandas as pd %matplotlib inline import numpy as np from pandas import read_csv path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] data.shape (768, 9) data.head() slno. preg Plas Pres skin test mass pedi age class 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 patient_data = data.iloc[:, 3:5].values import scipy.cluster.hierarchy as shc plt.figure(figsize=(10, 7)) plt.title(“Patient Dendograms”) dend = shc.dendrogram(shc.linkage(data, method=”ward”)) from sklearn.cluster import AgglomerativeClustering cluster = AgglomerativeClustering(n_clusters=4, affinity=”euclidean”, linkage=”ward”) cluster.fit_predict(patient_data) plt.figure(figsize=(10, 7)) plt.scatter(patient_data[:,0], patient_data[:,1], c=cluster.labels_, cmap=”rainbow”) Print Page Previous Next Advertisements ”;

Aug 09

Machine Learning with Python – Resources

Machine Learning With Python – Useful Resources ”; Previous Next The following resources contain additional information on Machine Learning With Python. Please use them to get more in-depth knowledge on this. Useful Video Courses Machine Learning with Python: The Complete Course Best Seller 63 Lectures 11 hours TELCOMA Global More Detail Machine Learning Course with Python – All-in-One Bootcamp Best Seller 16 Lectures 14.5 hours GreyCampus Inc. More Detail Fundamentals of Machine Learning With Python By Spotle.ai Featured 33 Lectures 3.5 hours Spotle Learn More Detail Machine Learning A-Z with Python with Project (Beginner) Featured 40 Lectures 13.5 hours Selfcode Academy More Detail Complete Machine Learning In Python With Projects By Spotle.ai 59 Lectures 6 hours Spotle Learn More Detail Supervised Machine Learning With Python By Spotle.ai 45 Lectures 4 hours Spotle Learn More Detail Print Page Previous Next Advertisements ”;

Aug 09

Automatic Workflows

Machine Learning – Automatic Workflows ”; Previous Next Introduction In order to execute and produce results successfully, a machine learning model must automate some standard workflows. The process of automate these standard workflows can be done with the help of Scikit-learn Pipelines. From a data scientist’s perspective, pipeline is a generalized, but very important concept. It basically allows data flow from its raw format to some useful information. The working of pipelines can be understood with the help of following diagram − The blocks of ML pipelines are as follows − Data ingestion − As the name suggests, it is the process of importing the data for use in ML project. The data can be extracted in real time or batches from single or multiple systems. It is one of the most challenging steps because the quality of data can affect the whole ML model. Data Preparation − After importing the data, we need to prepare data to be used for our ML model. Data preprocessing is one of the most important technique of data preparation. ML Model Training − Next step is to train our ML model. We have various ML algorithms like supervised, unsupervised, reinforcement to extract the features from data, and make predictions. Model Evaluation − Next, we need to evaluate the ML model. In case of AutoML pipeline, ML model can be evaluated with the help of various statistical methods and business rules. ML Model retraining − In case of AutoML pipeline, it is not necessary that the first model is best one. The first model is considered as a baseline model and we can train it repeatably to increase model’s accuracy. Deployment − At last, we need to deploy the model. This step involves applying and migrating the model to business operations for their use. Challenges Accompanying ML Pipelines In order to create ML pipelines, data scientists face many challenges. These challenges fall into the following three categories − Quality of Data The success of any ML model depends heavily on the quality of data. If the data we are providing to ML model is not accurate, reliable and robust, then we are going to end with wrong or misleading output. Data Reliability Another challenge associated with ML pipelines is the reliability of data we are providing to the ML model. As we know, there can be various sources from which data scientist can acquire data but to get the best results, it must be assured that the data sources are reliable and trusted. Data Accessibility To get the best results out of ML pipelines, the data itself must be accessible which requires consolidation, cleansing and curation of data. As a result of data accessibility property, metadata will be updated with new tags. Modelling ML Pipeline and Data Preparation Data leakage, happening from training dataset to testing dataset, is an important issue for data scientist to deal with while preparing data for ML model. Generally, at the time of data preparation, data scientist uses techniques like standardization or normalization on entire dataset before learning. But these techniques cannot help us from the leakage of data because the training dataset would have been influenced by the scale of the data in the testing dataset. By using ML pipelines, we can prevent this data leakage because pipelines ensure that data preparation like standardization is constrained to each fold of our cross-validation procedure. Example The following is an example in Python that demonstrate data preparation and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, we will be creating pipeline that standardized the data. Then a Linear Discriminative analysis model will be created and at last the pipeline will be evaluated using 10-fold cross validation. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.discriminant_analysis import LinearDiscriminantAnalysis Now, we need to load the Pima diabetes dataset as did in previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values Next, we will create a pipeline with the help of the following code − estimators = [] estimators.append((”standardize”, StandardScaler())) estimators.append((”lda”, LinearDiscriminantAnalysis())) model = Pipeline(estimators) At last, we are going to evaluate this pipeline and output its accuracy as follows − kfold = KFold(n_splits=20, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7790148448043184 The above output is the summary of accuracy of the setup on the dataset. Modelling ML Pipeline and Feature Extraction Data leakage can also happen at feature extraction step of ML model. That is why feature extraction procedures should also be restricted to stop data leakage in our training dataset. As in the case of data preparation, by using ML pipelines, we can prevent this data leakage also. FeatureUnion, a tool provided by ML pipelines can be used for this purpose. Example The following is an example in Python that demonstrates feature extraction and model evaluation workflow. For this purpose, we are using Pima Indian Diabetes dataset from Sklearn. First, 3 features will be extracted with PCA (Principal Component Analysis). Then, 6 features will be extracted with Statistical Analysis. After feature extraction, result of multiple feature selection and extraction procedures will be combined by using FeatureUnion tool. At last, a Logistic Regression model will be created, and the pipeline will be evaluated using 10-fold cross validation. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.linear_model import LogisticRegression from sklearn.decomposition import

Aug 09

Linear Regression

Regression Algorithms – Linear Regression ”; Previous Next Introduction to Linear Regression Linear regression may be defined as the statistical model that analyzes the linear relationship between a dependent variable with given set of independent variables. Linear relationship between variables means that when the value of one or more independent variables will change (increase or decrease), the value of dependent variable will also change accordingly (increase or decrease). Mathematically the relationship can be represented with the help of following equation − Y = mX + b Here, Y is the dependent variable we are trying to predict X is the dependent variable we are using to make predictions. m is the slop of the regression line which represents the effect X has on Y b is a constant, known as the Y-intercept. If X = 0,Y would be equal to b. Furthermore, the linear relationship can be positive or negative in nature as explained below − Positive Linear Relationship A linear relationship will be called positive if both independent and dependent variable increases. It can be understood with the help of following graph − Negative Linear relationship A linear relationship will be called positive if independent increases and dependent variable decreases. It can be understood with the help of following graph − Types of Linear Regression Linear regression is of the following two types − Simple Linear Regression Multiple Linear Regression Simple Linear Regression (SLR) It is the most basic version of linear regression which predicts a response using a single feature. The assumption in SLR is that the two variables are linearly related. Python implementation We can implement SLR in Python in two ways, one is to provide your own dataset and other is to use dataset from scikit-learn python library. Example 1 − In the following Python implementation example, we are using our own dataset. First, we will start with importing necessary packages as follows − %matplotlib inline import numpy as np import matplotlib.pyplot as plt Next, define a function which will calculate the important values for SLR − def coef_estimation(x, y): The following script line will give number of observations n − n = np.size(x) The mean of x and y vector can be calculated as follows − m_x, m_y = np.mean(x), np.mean(y) We can find cross-deviation and deviation about x as follows − SS_xy = np.sum(y*x) – n*m_y*m_x SS_xx = np.sum(x*x) – n*m_x*m_x Next, regression coefficients i.e. b can be calculated as follows − b_1 = SS_xy / SS_xx b_0 = m_y – b_1*m_x return(b_0, b_1) Next, we need to define a function which will plot the regression line as well as will predict the response vector − def plot_regression_line(x, y, b): The following script line will plot the actual points as scatter plot − plt.scatter(x, y, color = “m”, marker = “o”, s = 30) The following script line will predict response vector − y_pred = b[0] + b[1]*x The following script lines will plot the regression line and will put the labels on them − plt.plot(x, y_pred, color = “g”) plt.xlabel(”x”) plt.ylabel(”y”) plt.show() At last, we need to define main() function for providing dataset and calling the function we defined above − def main(): x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([100, 300, 350, 500, 750, 800, 850, 900, 1050, 1250]) b = coef_estimation(x, y) print(“Estimated coefficients:nb_0 = {} nb_1 = {}”.format(b[0], b[1])) plot_regression_line(x, y, b) if __name__ == “__main__”: main() Output Estimated coefficients: b_0 = 154.5454545454545 b_1 = 117.87878787878788 Example 2 − In the following Python implementation example, we are using diabetes dataset from scikit-learn. First, we will start with importing necessary packages as follows − %matplotlib inline import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score Next, we will load the diabetes dataset and create its object − diabetes = datasets.load_diabetes() As we are implementing SLR, we will be using only one feature as follows − X = diabetes.data[:, np.newaxis, 2] Next, we need to split the data into training and testing sets as follows − X_train = X[:-30] X_test = X[-30:] Next, we need to split the target into training and testing sets as follows − y_train = diabetes.target[:-30] y_test = diabetes.target[-30:] Now, to train the model we need to create linear regression object as follows − regr = linear_model.LinearRegression() Next, train the model using the training sets as follows − regr.fit(X_train, y_train) Next, make predictions using the testing set as follows − y_pred = regr.predict(X_test) Next, we will be printing some coefficient like MSE, Variance score etc. as follows − print(”Coefficients: n”, regr.coef_) print(“Mean squared error: %.2f” % mean_squared_error(y_test, y_pred)) print(”Variance score: %.2f” % r2_score(y_test, y_pred)) Now, plot the outputs as follows − plt.scatter(X_test, y_test, color=”blue”) plt.plot(X_test, y_pred, color=”red”, linewidth=3) plt.xticks(()) plt.yticks(()) plt.show() Output Coefficients: [941.43097333] Mean squared error: 3035.06 Variance score: 0.41 Multiple Linear Regression (MLR) It is the extension of simple linear regression that predicts a response using two or more features. Mathematically we can explain it as follows − Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows − $$h(x_{i})=b_{0}+b_{1}x_{i1}+b_{2}x_{i2}+…+b_{p}x_{ip}$$ Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients. Multiple Linear Regression models always includes the

Aug 09

Improving Performance of ML Models

Improving Performance of ML Models ”; Previous Next Performance Improvement with Ensembles Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups − Sequential ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The motivation of such methods is to exploit the dependency among base learners. Parallel ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The motivation of such methods is to exploit the independence among base learners. Ensemble Learning Methods The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models − Bagging The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests. Boosting In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost. Voting In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction. Bagging Ensemble Algorithms The following are three bagging ensemble algorithms − Bagged Decision Tree As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier Now, we need to load the Pima diabetes dataset as we did in the previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] Next, give the input for 10-fold cross validation as follows − seed = 7 kfold = KFold(n_splits=10, random_state=seed) cart = DecisionTreeClassifier() We need to provide the number of trees we are going to build. Here we are building 150 trees − num_trees = 150 Next, build the model with the help of following script − model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed) Calculate and print the result as follows − results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7733766233766234 The output above shows that we got around 77% accuracy of our bagged decision tree classifier model. Random Forest It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are taken with replacement, but the trees are constructed in such a way that reduces the correlation between them. Also, a random subset of features is considered to choose each split point rather than greedily choosing the best split point in construction of each tree. In the following Python recipe, we are going to build bagged random forest ensemble model by using RandomForestClassifier class of sklearn on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier Now, we need to load the Pima diabetes dataset as did in previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] Next, give the input for 10-fold cross validation as follows − seed = 7 kfold = KFold(n_splits=10, random_state=seed) We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features − num_trees = 150 max_features = 5 Next, build the model with the help of following script − model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features) Calculate and print the result as follows − results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) Output 0.7629357484620642 The output above shows that we got around 76% accuracy of our bagged random forest classifier model. Extra Trees It is another extension of bagged decision tree ensemble method. In this method, the random trees are constructed from the samples of the training dataset. In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import ExtraTreesClassifier Now, we need to load the Pima diabetes dataset as did in previous examples −

Aug 09

Machine Learning With Python – Quick Guide

Machine Learning With Python – Quick Guide ”; Previous Next Machine Learning with Python – Basics We are living in the ‘age of data’ that is enriched with better computational power and more storage resources,. This data or information is increasing day by day, but the real challenge is to make sense of all the data. Businesses & organizations are trying to deal with it by building intelligent systems using the concepts and methodologies from Data science, Data Mining and Machine learning. Among them, machine learning is the most exciting field of computer science. It would not be wrong if we call machine learning the application and science of algorithms that provides sense to the data. What is Machine Learning? Machine Learning (ML) is that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do. In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method. The main focus of ML is to allow computer systems learn from experience without being explicitly programmed or human intervention. Need for Machine Learning Human beings, at this moment, are the most intelligent and advanced species on earth because they can think, evaluate and solve complex problems. On the other side, AI is still in its initial stage and haven’t surpassed human intelligence in many aspects. Then the question is that what is the need to make machine learn? The most suitable reason for doing this is, “to make decisions, based on data, with efficiency and scale”. Lately, organizations are investing heavily in newer technologies like Artificial Intelligence, Machine Learning and Deep Learning to get the key information from data to perform several real-world tasks and solve problems. We can call it data-driven decisions taken by machines, particularly to automate the process. These data-driven decisions can be used, instead of using programing logic, in the problems that cannot be programmed inherently. The fact is that we can’t do without human intelligence, but other aspect is that we all need to solve real-world problems with efficiency at a huge scale. That is why the need for machine learning arises. Why & When to Make Machines Learn? We have already discussed the need for machine learning, but another question arises that in what scenarios we must make the machine learn? There can be several circumstances where we need machines to take data-driven decisions with efficiency and at a huge scale. The followings are some of such circumstances where making machines learn would be more effective − Lack of human expertise The very first scenario in which we want a machine to learn and take data-driven decisions, can be the domain where there is a lack of human expertise. The examples can be navigations in unknown territories or spatial planets. Dynamic scenarios There are some scenarios which are dynamic in nature i.e. they keep changing over time. In case of these scenarios and behaviors, we want a machine to learn and take data-driven decisions. Some of the examples can be network connectivity and availability of infrastructure in an organization. Difficulty in translating expertise into computational tasks There can be various domains in which humans have their expertise,; however, they are unable to translate this expertise into computational tasks. In such circumstances we want machine learning. The examples can be the domains of speech recognition, cognitive tasks etc. Machine Learning Model Before discussing the machine learning model, we must need to understand the following formal definition of ML given by professor Mitchell − “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” The above definition is basically focusing on three parameters, also the main components of any learning algorithm, namely Task(T), Performance(P) and experience (E). In this context, we can simplify this definition as − ML is a field of AI consisting of learning algorithms that − Improve their performance (P) At executing some task (T) Over time with experience (E) Based on the above, the following diagram represents a Machine Learning Model − Let us discuss them more in detail now − Task(T) From the perspective of problem, we may define the task T as the real-world problem to be solved. The problem can be anything like finding best house price in a specific location or to find best marketing strategy etc. On the other hand, if we talk about machine learning, the definition of task is different because it is difficult to solve ML based tasks by conventional programming approach. A task T is said to be a ML based task when it is based on the process and the system must follow for operating on data points. The examples of ML based tasks are Classification, Regression, Structured annotation, Clustering, Transcription etc. Experience (E) As name suggests, it is the knowledge gained from data points provided to the algorithm or model. Once provided with the dataset, the model will run iteratively and will learn some inherent pattern. The learning thus acquired is called experience(E). Making an analogy with human learning, we can think of this situation as in which a human being is learning or gaining some experience from various attributes like situation, relationships etc. Supervised, unsupervised and reinforcement learning are some ways to learn or gain experience. The experience gained by out ML model or algorithm will be used to solve the task T. Performance (P) An ML algorithm is supposed to perform task and gain experience with the passage of time. The measure which tells whether ML

Aug 09

Introduction

Classification – Introduction ”; Previous Next Introduction to Classification Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”. Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set. An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification. To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email. Types of Learners in Classification We have two types of learners in respective to classification problems − Lazy Learners As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning. Eager Learners As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN). Building a Classifier in Python Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows − Step1: Importing necessary python package For building a classifier using scikit-learn, we need to import it. We can import it by using following script − import sklearn Step2: Importing dataset After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script − from sklearn.datasets import load_breast_cancer The following script will load the dataset; data = load_breast_cancer() We also need to organize the data and it can be done with the help of following scripts − label_names = data[”target_names”] labels = data[”target”] feature_names = data[”feature_names”] features = data[”data”] The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database. print(label_names) The output of the above command is the names of the labels − [”malignant” ”benign”] These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1. The feature names and feature values of these labels can be seen with the help of following commands − print(feature_names[0]) The output of the above command is the names of the features for label 0 i.e. Malignant cancer − mean radius Similarly, names of the features for label can be produced as follows − print(feature_names[1]) The output of the above command is the names of the features for label 1 i.e. Benign cancer − mean texture We can print the features for these labels with the help of following command − print(features[0]) This will give the following output − [ 1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01 4.601e-01 1.189e-01 ] We can print the features for these labels with the help of following command − print(features[1]) This will give the following output − [ 2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02 7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01 5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01 2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01 2.750e-01 8.902e-02 ] Step3: Organizing data into training & testing sets As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function − from sklearn.model_selection import train_test_split Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose − train, test, train_labels, test_labels = train_test_split( features,labels,test_size = 0.40, random_state = 42 ) Step4: Model evaluation After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module − from sklearn.naive_bayes import GaussianNB Now, initialize the model as follows − gnb = GaussianNB() Next, with the help of following command we can train the model − model = gnb.fit(train, train_labels) Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows − preds = gnb.predict(test) print(preds) This will give the following output − [ 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1