Discuss Scikit Learn Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
Category: scikit Learn
Scikit Learn – Clustering Methods Here, we will study about the clustering methods in Sklearn which will help in identification of any similarity in the data samples. Clustering methods, one of the most useful unsupervised ML methods, used to find similarity & relationship patterns among data samples. After that, they cluster those samples into groups having similarity based on features. Clustering determines the intrinsic grouping among the present unlabeled data, that’s why it is important. The Scikit-learn library have sklearn.cluster to perform clustering of unlabeled data. Under this module scikit-leran have the following clustering methods − KMeans This algorithm computes the centroids and iterates until it finds optimal centroid. It requires the number of clusters to be specified that’s why it assumes that they are already known. The main logic of this algorithm is to cluster the data separating samples in n number of groups of equal variances by minimizing the criteria known as the inertia. The number of clusters identified by algorithm is represented by ‘K. Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering. While computing cluster centers and value of inertia, the parameter named sample_weight allows sklearn.cluster.KMeans module to assign more weight to some samples. Affinity Propagation This algorithm is based on the concept of ‘message passing’ between different pairs of samples until convergence. It does not require the number of clusters to be specified before running the algorithm. The algorithm has a time complexity of the order 𝑂(𝑁2𝑇), which is the biggest disadvantage of it. Scikit-learn have sklearn.cluster.AffinityPropagation module to perform Affinity Propagation clustering. Mean Shift This algorithm mainly discovers blobs in a smooth density of samples. It assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints. Instead of relying on a parameter named bandwidth dictating the size of the region to search through, it automatically sets the number of clusters. Scikit-learn have sklearn.cluster.MeanShift module to perform Mean Shift clustering. Spectral Clustering Before clustering, this algorithm basically uses the eigenvalues i.e. spectrum of the similarity matrix of the data to perform dimensionality reduction in fewer dimensions. The use of this algorithm is not advisable when there are large number of clusters. Scikit-learn have sklearn.cluster.SpectralClustering module to perform Spectral clustering. Hierarchical Clustering This algorithm builds nested clusters by merging or splitting the clusters successively. This cluster hierarchy is represented as dendrogram i.e. tree. It falls into following two categories − Agglomerative hierarchical algorithms − In this kind of hierarchical algorithm, every data point is treated like a single cluster. It then successively agglomerates the pairs of clusters. This uses the bottom-up approach. Divisive hierarchical algorithms − In this hierarchical algorithm, all data points are treated as one big cluster. In this the process of clustering involves dividing, by using top-down approach, the one big cluster into various small clusters. Scikit-learn have sklearn.cluster.AgglomerativeClustering module to perform Agglomerative Hierarchical clustering. DBSCAN It stands for “Density-based spatial clustering of applications with noise”. This algorithm is based on the intuitive notion of “clusters” & “noise” that clusters are dense regions of the lower density in the data space, separated by lower density regions of data points. Scikit-learn have sklearn.cluster.DBSCAN module to perform DBSCAN clustering. There are two important parameters namely min_samples and eps used by this algorithm to define dense. Higher value of parameter min_samples or lower value of the parameter eps will give an indication about the higher density of data points which is necessary to form a cluster. OPTICS It stands for “Ordering points to identify the clustering structure”. This algorithm also finds density-based clusters in spatial data. It’s basic working logic is like DBSCAN. It addresses a major weakness of DBSCAN algorithm-the problem of detecting meaningful clusters in data of varying density-by ordering the points of the database in such a way that spatially closest points become neighbors in the ordering. Scikit-learn have sklearn.cluster.OPTICS module to perform OPTICS clustering. BIRCH It stands for Balanced iterative reducing and clustering using hierarchies. It is used to perform hierarchical clustering over large data sets. It builds a tree named CFT i.e. Characteristics Feature Tree, for the given data. The advantage of CFT is that the data nodes called CF (Characteristics Feature) nodes holds the necessary information for clustering which further prevents the need to hold the entire input data in memory. Scikit-learn have sklearn.cluster.Birch module to perform BIRCH clustering. Comparing Clustering Algorithms Following table will give a comparison (based on parameters, scalability and metric) of the clustering algorithms in scikit-learn. Sr.No Algorithm Name Parameters Scalability Metric Used 1 K-Means No. of clusters Very large n_samples The distance between points. 2 Affinity Propagation Damping It’s not scalable with n_samples Graph Distance 3 Mean-Shift Bandwidth It’s not scalable with n_samples. The distance between points. 4 Spectral Clustering No.of clusters Medium level of scalability with n_samples. Small level of scalability with n_clusters. Graph Distance 5 Hierarchical Clustering Distance threshold or No.of clusters Large n_samples Large n_clusters The distance between points. 6 DBSCAN Size of neighborhood Very large n_samples and medium n_clusters. Nearest point distance 7 OPTICS Minimum cluster membership Very large n_samples and large n_clusters. The distance between points. 8 BIRCH Threshold, Branching factor Large n_samples Large n_clusters The Euclidean distance between points. K-Means Clustering on Scikit-learn Digit dataset In this example, we will apply K-means clustering on digits dataset. This algorithm will identify similar digits without using the original label information. Implementation is done on Jupyter notebook. %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import load_digits digits = load_digits() digits.data.shape Output 1797, 64) This output shows that digit dataset is having 1797 samples with 64 features. Example Now, perform the K-Means clustering as follows − kmeans = KMeans(n_clusters = 10, random_state = 0) clusters = kmeans.fit_predict(digits.data) kmeans.cluster_centers_.shape Output (10, 64) This output shows that K-means clustering created 10 clusters with 64 features. Example fig, ax = plt.subplots(2, 5, figsize = (8, 3)) centers = kmeans.cluster_centers_.reshape(10, 8, 8) for axi, center in zip(ax.flat,
Scikit Learn – K-Nearest Neighbors (KNN) This chapter will help you in understanding the nearest neighbor methods in Sklearn. Neighbor based learning method are of both types namely supervised and unsupervised. Supervised neighbors-based learning can be used for both classification as well as regression predictive problems but, it is mainly used for classification predictive problems in industry. Neighbors based learning methods do not have a specialised training phase and uses all the data for training while classification. It also does not assume anything about the underlying data. That’s the reason they are lazy and non-parametric in nature. The main principle behind nearest neighbor methods is − To find a predefined number of training samples closet in distance to the new data point Predict the label from these number of training samples. Here, the number of samples can be a user-defined constant like in K-nearest neighbor learning or vary based on the local density of point like in radius-based neighbor learning. sklearn.neighbors Module Scikit-learn have sklearn.neighbors module that provides functionality for both unsupervised and supervised neighbors-based learning methods. As input, the classes in this module can handle either NumPy arrays or scipy.sparse matrices. Types of algorithms Different types of algorithms which can be used in neighbor-based methods’ implementation are as follows − Brute Force The brute-force computation of distances between all pairs of points in the dataset provides the most naïve neighbor search implementation. Mathematically, for N samples in D dimensions, brute-force approach scales as 0[DN2] For small data samples, this algorithm can be very useful, but it becomes infeasible as and when number of samples grows. Brute force neighbor search can be enabled by writing the keyword algorithm=’brute’. K-D Tree One of the tree-based data structures that have been invented to address the computational inefficiencies of the brute-force approach, is KD tree data structure. Basically, the KD tree is a binary tree structure which is called K-dimensional tree. It recursively partitions the parameters space along the data axes by dividing it into nested orthographic regions into which the data points are filled. Advantages Following are some advantages of K-D tree algorithm − Construction is fast − As the partitioning is performed only along the data axes, K-D tree’s construction is very fast. Less distance computations − This algorithm takes very less distance computations to determine the nearest neighbor of a query point. It only takes 𝑶[𝐥𝐨𝐠 (𝑵)] distance computations. Disadvantages Fast for only low-dimensional neighbor searches − It is very fast for low-dimensional (D < 20) neighbor searches but as and when D grow it becomes inefficient. As the partitioning is performed only along the data axes, K-D tree neighbor searches can be enabled by writing the keyword algorithm=’kd_tree’. Ball Tree As we know that KD Tree is inefficient in higher dimensions, hence, to address this inefficiency of KD Tree, Ball tree data structure was developed. Mathematically, it recursively divides the data, into nodes defined by a centroid C and radius r, in such a way that each point in the node lies within the hyper-sphere defined by centroid C and radius r. It uses triangle inequality, given below, which reduces the number of candidate points for a neighbor search $$arrowvert X+Yarrowvertleq arrowvert Xarrowvert+arrowvert Yarrowvert$$ Advantages Following are some advantages of Ball Tree algorithm − Efficient on highly structured data − As ball tree partition the data in a series of nesting hyper-spheres, it is efficient on highly structured data. Out-performs KD-tree − Ball tree out-performs KD tree in high dimensions because it has spherical geometry of the ball tree nodes. Disadvantages Costly − Partition the data in a series of nesting hyper-spheres makes its construction very costly. Ball tree neighbor searches can be enabled by writing the keyword algorithm=’ball_tree’. Choosing Nearest Neighbors Algorithm The choice of an optimal algorithm for a given dataset depends upon the following factors − Number of samples (N) and Dimensionality (D) These are the most important factors to be considered while choosing Nearest Neighbor algorithm. It is because of the reasons given below − The query time of Brute Force algorithm grows as O[DN]. The query time of Ball tree algorithm grows as O[D log(N)]. The query time of KD tree algorithm changes with D in a strange manner that is very difficult to characterize. When D < 20, the cost is O[D log(N)] and this algorithm is very efficient. On the other hand, it is inefficient in case when D > 20 because the cost increases to nearly O[DN]. Data Structure Another factor that affect the performance of these algorithms is intrinsic dimensionality of the data or sparsity of the data. It is because the query times of Ball tree and KD tree algorithms can be greatly influenced by it. Whereas, the query time of Brute Force algorithm is unchanged by data structure. Generally, Ball tree and KD tree algorithms produces faster query time when implanted on sparser data with smaller intrinsic dimensionality. Number of Neighbors (k) The number of neighbors (k) requested for a query point affects the query time of Ball tree and KD tree algorithms. Their query time becomes slower as number of neighbors (k) increases. Whereas the query time of Brute Force will remain unaffected by the value of k. Number of query points Because, they need construction phase, both KD tree and Ball tree algorithms will be effective if there are large number of query points. On the other hand, if there are a smaller number of query points, Brute Force algorithm performs better than KD tree and Ball tree algorithms.
Scikit Learn – Anomaly Detection Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points. Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories − Point anomalies − It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data. Contextual anomalies − Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context. Collective anomalies − It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values. Methods Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them. Outlier detection The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection. Novelty detection It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection. There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows − estimator.fit(X_train) Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows − estimator.fit(X_test) The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter. We can also define decision_function method that defines outliers as negative value and inliers as non-negative value. estimator.decision_function(X_test) Sklearn algorithms for Outlier Detection Let us begin by understanding what an elliptic envelop is. Fitting an elliptic envelop This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop. This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode. Parameters Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method − Sr.No Parameter & Description 1 store_precision − Boolean, optional, default = True We can specify it if the estimated precision is stored. 2 assume_centered − Boolean, optional, default = False If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian. 3 support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates. 4 contamination − float in (0., 1.), optional, default = 0.1 It provides the proportion of the outliers in the data set. 5 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. Attributes Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method − Sr.No Attributes & Description 1 support_ − array-like, shape(n_samples,) It represents the mask of the observations used to compute robust estimates of location and shape. 2 location_ − array-like, shape (n_features) It returns the estimated robust location. 3 covariance_ − array-like, shape (n_features, n_features) It returns the estimated robust covariance matrix. 4 precision_ − array-like, shape (n_features, n_features) It returns the estimated pseudo inverse matrix. 5 offset_ − float It is used to define the decision function from the raw scores. decision_function = score_samples -offset_ Implementation Example import numpy as np^M from sklearn.covariance import EllipticEnvelope^M true_cov = np.array([[.5, .6],[.6, .4]]) X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500) cov = EllipticEnvelope(random_state = 0).fit(X)^M # Now we can use predict method. It will return 1 for an inlier and -1 for an outlier. cov.predict([[0, 0],[2, 2]]) Output array([ 1, -1]) Isolation Forest In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features. Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node. Parameters Followings table consist the parameters used by sklearn. ensemble.IsolationForest method − Sr.No Parameter & Description 1 n_estimators − int, optional, default = 100 It represents the number of base estimators in the ensemble. 2 max_samples − int or float, optional, default = “auto” It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples ∗ 𝑋.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples). 3 support_fraction − float in (0., 1.), optional, default =
Scikit Learn – Boosting Methods In this chapter, we will learn about the boosting methods in Sklearn, which enables building an ensemble model. Boosting methods build ensemble model in an increment way. The main principle is to build the model incrementally by training each base model estimator sequentially. In order to build powerful ensemble, these methods basically combine several week learners which are sequentially trained over multiple iterations of training data. The sklearn.ensemble module is having following two boosting methods. AdaBoost It is one of the most successful boosting ensemble method whose main key is in the way they give weights to the instances in dataset. That’s why the algorithm needs to pay less attention to the instances while constructing subsequent models. Classification with AdaBoost For creating a AdaBoost classifier, the Scikit-learn module provides sklearn.ensemble.AdaBoostClassifier. While building this classifier, the main parameter this module use is base_estimator. Here, base_estimator is the value of the base estimator from which the boosted ensemble is built. If we choose this parameter’s value to none then, the base estimator would be DecisionTreeClassifier(max_depth=1). Implementation example In the following example, we are building a AdaBoost classifier by using sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score. from sklearn.ensemble import AdaBoostClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples = 1000, n_features = 10,n_informative = 2, n_redundant = 0,random_state = 0, shuffle = False) ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0) ADBclf.fit(X, y) Output AdaBoostClassifier(algorithm = ”SAMME.R”, base_estimator = None, learning_rate = 1.0, n_estimators = 100, random_state = 0) Example Once fitted, we can predict for new values as follows − print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [1] Example Now we can check the score as follows − ADBclf.score(X, y) Output 0.995 Example We can also use the sklearn dataset to build classifier using Extra-Tree method. For example, in an example given below, we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean()) Output 0.7851435406698566 Regression with AdaBoost For creating a regressor with Ada Boost method, the Scikit-learn library provides sklearn.ensemble.AdaBoostRegressor. While building regressor, it will use the same parameters as used by sklearn.ensemble.AdaBoostClassifier. Implementation example In the following example, we are building a AdaBoost regressor by using sklearn.ensemble.AdaBoostregressor and also predicting for new values by using predict() method. from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100) ADBregr.fit(X, y) Output AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = ”linear”, n_estimators = 100, random_state = 0) Example Once fitted we can predict from regression model as follows − print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [85.50955817] Gradient Tree Boosting It is also called Gradient Boosted Regression Trees (GRBT). It is basically a generalization of boosting to arbitrary differentiable loss functions. It produces a prediction model in the form of an ensemble of week prediction models. It can be used for the regression and classification problems. Their main advantage lies in the fact that they naturally handle the mixed type data. Classification with Gradient Tree Boost For creating a Gradient Tree Boost classifier, the Scikit-learn module provides sklearn.ensemble.GradientBoostingClassifier. While building this classifier, the main parameter this module use is ‘loss’. Here, ‘loss’ is the value of loss function to be optimized. If we choose loss = deviance, it refers to deviance for classification with probabilistic outputs. On the other hand, if we choose this parameter’s value to exponential then it recovers the AdaBoost algorithm. The parameter n_estimators will control the number of week learners. A hyper-parameter named learning_rate (in the range of (0.0, 1.0]) will control overfitting via shrinkage. Implementation example In the following example, we are building a Gradient Boosting classifier by using sklearn.ensemble.GradientBoostingClassifier. We are fitting this classifier with 50 week learners. from sklearn.datasets import make_hastie_10_2 from sklearn.ensemble import GradientBoostingClassifier X, y = make_hastie_10_2(random_state = 0) X_train, X_test = X[:5000], X[5000:] y_train, y_test = y[:5000], y[5000:] GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1, random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test) Output 0.8724285714285714 Example We can also use the sklearn dataset to build classifier using Gradient Boosting Classifier. As in the following example we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import GradientBoostingClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean()) Output 0.7946582356674234 Regression with Gradient Tree Boost For creating a regressor with Gradient Tree Boost method, the Scikit-learn library provides sklearn.ensemble.GradientBoostingRegressor. It can specify the loss function for regression via the parameter name loss. The default value for loss is ‘ls’. Implementation example In the following example, we are building a Gradient Boosting regressor by using sklearn.ensemble.GradientBoostingregressor and also finding the mean squared error by using mean_squared_error() method. import numpy as np from sklearn.metrics import mean_squared_error from sklearn.datasets import make_friedman1 from sklearn.ensemble import GradientBoostingRegressor X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:] y_train, y_test = y[:1000], y[1000:] GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss = ”ls”).fit(X_train, y_train) Once fitted we can find the mean squared error as follows − mean_squared_error(y_test, GDBreg.predict(X_test)) Output 5.391246106657164
Scikit Learn – Support Vector Machines This chapter deals with a machine learning method termed as Support Vector Machines (SVMs). Introduction Support vector machines (SVMs) are powerful yet flexible supervised machine learning methods used for classification, regression, and, outliers’ detection. SVMs are very efficient in high dimensional spaces and generally are used in classification problems. SVMs are popular and memory efficient because they use a subset of training points in the decision function. The main goal of SVMs is to divide the datasets into number of classes in order to find a maximum marginal hyperplane (MMH) which can be done in the following two steps − Support Vector Machines will first generate hyperplanes iteratively that separates the classes in the best way. After that it will choose the hyperplane that segregate the classes correctly. Some important concepts in SVM are as follows − Support Vectors − They may be defined as the datapoints which are closest to the hyperplane. Support vectors help in deciding the separating line. Hyperplane − The decision plane or space that divides set of objects having different classes. Margin − The gap between two lines on the closet data points of different classes is called margin. Following diagrams will give you an insight about these SVM concepts − SVM in Scikit-learn supports both sparse and dense sample vectors as input. Classification of SVM Scikit-learn provides three classes namely SVC, NuSVC and LinearSVC which can perform multiclass-class classification. SVC It is C-support vector classification whose implementation is based on libsvm. The module used by scikit-learn is sklearn.svm.SVC. This class handles the multiclass support according to one-vs-one scheme. Parameters Followings table consist the parameters used by sklearn.svm.SVC class − Sr.No Parameter & Description 1 C − float, optional, default = 1.0 It is the penalty parameter of the error term. 2 kernel − string, optional, default = ‘rbf’ This parameter specifies the type of kernel to be used in the algorithm. we can choose any one among, ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. The default value of kernel would be ‘rbf’. 3 degree − int, optional, default = 3 It represents the degree of the ‘poly’ kernel function and will be ignored by all other kernels. 4 gamma − {‘scale’, ‘auto’} or float, It is the kernel coefficient for kernels ‘rbf’, ‘poly’ and ‘sigmoid’. 5 optinal default − = ‘scale’ If you choose default i.e. gamma = ‘scale’ then the value of gamma to be used by SVC is 1/(𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠∗𝑋.𝑣𝑎𝑟()). On the other hand, if gamma= ‘auto’, it uses 1/𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 6 coef0 − float, optional, Default=0.0 An independent term in kernel function which is only significant in ‘poly’ and ‘sigmoid’. 7 tol − float, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. 8 shrinking − Boolean, optional, default = True This parameter represents that whether we want to use shrinking heuristic or not. 9 verbose − Boolean, default: false It enables or disable verbose output. Its default value is false. 10 probability − boolean, optional, default = true This parameter enables or disables probability estimates. The default value is false, but it must be enabled before we call fit. 11 max_iter − int, optional, default = -1 As name suggest, it represents the maximum number of iterations within the solver. Value -1 means there is no limit on the number of iterations. 12 cache_size − float, optional This parameter will specify the size of the kernel cache. The value will be in MB(MegaBytes). 13 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. 14 class_weight − {dict, ‘balanced’}, optional This parameter will set the parameter C of class j to 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔ℎ𝑡[𝑗]∗𝐶 for SVC. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight:balanced, it will use the values of y to automatically adjust weights. 15 decision_function_shape − ovo’, ‘ovr’, default = ‘ovr’ This parameter will decide whether the algorithm will return ‘ovr’ (one-vs-rest) decision function of shape as all other classifiers, or the original ovo(one-vs-one) decision function of libsvm. 16 break_ties − boolean, optional, default = false True − The predict will break ties according to the confidence values of decision_function False − The predict will return the first class among the tied classes. Attributes Followings table consist the attributes used by sklearn.svm.SVC class − Sr.No Attributes & Description 1 support_ − array-like, shape = [n_SV] It returns the indices of support vectors. 2 support_vectors_ − array-like, shape = [n_SV, n_features] It returns the support vectors. 3 n_support_ − array-like, dtype=int32, shape = [n_class] It represents the number of support vectors for each class. 4 dual_coef_ − array, shape = [n_class-1,n_SV] These are the coefficient of the support vectors in the decision function. 5 coef_ − array, shape = [n_class * (n_class-1)/2, n_features] This attribute, only available in case of linear kernel, provides the weight assigned to the features. 6 intercept_ − array, shape = [n_class * (n_class-1)/2] It represents the independent term (constant) in decision function. 7 fit_status_ − int The output would be 0 if it is correctly fitted. The output would be 1 if it is incorrectly fitted. 8 classes_ − array of shape = [n_classes] It gives the labels of the classes. Implementation Example Like other classifiers, SVC also has to be fitted with following two arrays − An array X holding the training samples. It is of size [n_samples, n_features]. An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples]. Following Python script uses sklearn.svm.SVC class − import
Scikit Learn – Modelling Process This chapter deals with the modelling process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading. Dataset Loading A collection of data is called dataset. It is having the following two components − Features − The variables of data are called its features. They are also known as predictors, inputs or attributes. Feature matrix − It is the collection of features, in case there are more than one. Feature Names − It is the list of all the names of the features. Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. Response Vector − It is used to represent response column. Generally, we have just one response column. Target Names − It represent the possible values taken by a response vector. Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression. Example Following is an example to load iris dataset − from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names print(“Feature names:”, feature_names) print(“Target names:”, target_names) print(“nFirst 10 rows of X:n”, X[:10]) Output Feature names: [”sepal length (cm)”, ”sepal width (cm)”, ”petal length (cm)”, ”petal width (cm)”] Target names: [”setosa” ”versicolor” ”virginica”] First 10 rows of X: [ [5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] ] Splitting the dataset To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did. Example The following example will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example. from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 1 ) print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape) Output (105, 4) (45, 4) (105,) (45,) As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments − X, y − Here, X is the feature matrix and y is the response vector, which need to be split. test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows. random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results. Train the Model Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc. Example In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go into the details of KNN algorithms, as there will be a separate chapter for that. This example is used to make you understand the implementation part only. from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.4, random_state=1 ) from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics classifier_knn = KNeighborsClassifier(n_neighbors = 3) classifier_knn.fit(X_train, y_train) y_pred = classifier_knn.predict(X_test) # Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred) print(“Accuracy:”, metrics.accuracy_score(y_test, y_pred)) # Providing sample data and the model will make prediction out of that data sample = [[5, 5, 3, 2], [2, 4, 3, 5]] preds = classifier_knn.predict(sample) pred_species = [iris.target_names[p] for p in preds] print(“Predictions:”, pred_species) Output Accuracy: 0.9833333333333333 Predictions: [”versicolor”, ”virginica”] Model Persistence Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of joblib package. Consider the example below in which we will be saving the above trained model (classifier_knn) for future use − from sklearn.externals import joblib joblib.dump(classifier_knn, ”iris_classifier_knn.joblib”) The above code will save the model into file named iris_classifier_knn.joblib. Now, the object can be reloaded from the file with the help of following code − joblib.load(”iris_classifier_knn.joblib”) Preprocessing the Data As we are dealing with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques − Binarisation This preprocessing technique is used when we need to convert our numerical values into Boolean values. Example import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] ) data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data) print(“nBinarized data:n”, data_binarized) In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0. Output Binarized data: [ [ 1. 0. 1.] [ 0. 1. 1.] [ 0. 0. 1.] [ 1. 1. 0.] ] Mean Removal This technique is used to eliminate the mean from feature vector so that every feature centered on zero. Example import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] )
Scikit Learn – KNN Learning k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-parametric and lazy in nature. Non-parametric means that there is no assumption for the underlying data distribution i.e. the model structure is determined from the dataset. Lazy or instance-based learning means that for the purpose of model generation, it does not require any training data points and whole training data is used in the testing phase. The k-NN algorithm consist of the following two steps − Step 1 In this step, it computes and stores the k nearest neighbors for each sample in the training set. Step 2 In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset. Then among these k-nearest neighbors, it predicts the class through voting (class with majority votes wins). The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides the functionality for unsupervised as well as supervised neighbors-based learning methods. The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is basically only step 1, which is discussed above, and the foundation of many algorithms (KNN and K-means being the famous one) which require the neighbor search. In simple words, it is Unsupervised learner for implementing neighbor searches. On the other hand, the supervised neighbors-based learning is used for classification as well as regression. Unsupervised KNN Learning As discussed, there exist many algorithms like KNN and K-Means that requires nearest neighbor searches. That is why Scikit-learn decided to implement the neighbor search part as its own “learner”. The reason behind making neighbor search as a separate learner is that computing all pairwise distance for finding a nearest neighbor is obviously not very efficient. Let’s see the module used by Sklearn to implement unsupervised nearest neighbor learning along with example. Scikit-learn module sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or Brute Force. In other words, it acts as a uniform interface to these three algorithms. Parameters Followings table consist the parameters used by NearestNeighbors module − Sr.No Parameter & Description 1 n_neighbors − int, optional The number of neighbors to get. The default value is 5. 2 radius − float, optional It limits the distance of neighbors to returns. The default value is 1.0. 3 algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional This parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide ‘auto’, it will attempt to decide the most appropriate algorithm based on the values passed to fit method. 4 leaf_size − int, optional It can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30. 5 metric − string or callable It is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string. We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows − Scikit-learn − [‘cosine’,’manhattan’,‘Euclidean’, ‘l1’,’l2’, ‘cityblock’] Scipy.spatial.distance − [‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘jaccard’, ‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rogerstanimoto’,‘russellrao’, ‘sokalmicheme’,’sokalsneath’, ‘seuclidean’, ‘sqeuclidean’, ‘yule’]. The default metric is ‘Minkowski’. 6 P − integer, optional It is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2). 7 metric_params − dict, optional This is the additional keyword arguments for the metric function. The default value is None. 8 N_jobs − int or None, optional It reprsetst the numer of parallel jobs to run for neighbor search. The default value is None. Implementation Example The example below will find the nearest neighbors between two sets of data by using the sklearn.neighbors.NearestNeighbors module. First, we need to import the required module and packages − from sklearn.neighbors import NearestNeighbors import numpy as np Now, after importing the packages, define the sets of data in between we want to find the nearest neighbors − Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]]) Next, apply the unsupervised learning algorithm, as follows − nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = ”ball_tree”) Next, fit the model with input data set. nrst_neigh.fit(Input_data) Now, find the K-neighbors of data set. It will return the indices and distances of the neighbors of each point. distances, indices = nbrs.kneighbors(Input_data) indices Output array( [ [0, 1, 3], [1, 2, 0], [2, 1, 0], [3, 4, 0], [4, 5, 3], [5, 6, 4], [6, 5, 4] ], dtype = int64 ) distances Output array( [ [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712], [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712] ] ) The above output shows that the nearest neighbor of each point is the point itself i.e. at zero. It is because the query set matches the training set. Example We can also show a connection between neighboring points by producing a sparse graph as follows − nrst_neigh.kneighbors_graph(Input_data).toarray() Output array( [ [1., 1., 0., 1., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 0., 0., 1., 1., 0., 0.], [0., 0., 0., 1., 1., 1., 0.], [0., 0., 0., 0., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1.] ] ) Once we fit the unsupervised NearestNeighbors model, the data will be stored in a data structure based on the value set for the argument ‘algorithm’. After that we can use this unsupervised learner’s kneighbors in a model which requires neighbor searches. Complete working/executable program from sklearn.neighbors import NearestNeighbors import numpy as np Input_data = np.array([[-1, 1], [-2,
Scikit Learn – Estimator API In this chapter, we will learn about Estimator API (application programming interface). Let us begin by understanding what is an Estimator API. What is Estimator API It is one of the main APIs implemented by Scikit-learn. It provides a consistent interface for a wide range of ML applications that’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any of the algorithms like classification, regression, clustering or even with a transformer, that extracts useful features from raw data. For fitting the data, all estimator objects expose a fit method that takes a dataset shown as follows − estimator.fit(data) Next, all the parameters of an estimator can be set, as follows, when it is instantiated by the corresponding attribute. estimator = Estimator (param1=1, param2=2) estimator.param1 The output of the above would be 1. Once data is fitted with an estimator, parameters are estimated from the data at hand. Now, all the estimated parameters will be the attributes of the estimator object ending by an underscore as follows − estimator.estimated_param_ Use of Estimator API Main uses of estimators are as follows − Estimation and decoding of a model Estimator object is used for estimation and decoding of a model. Furthermore, the model is estimated as a deterministic function of the following − The parameters which are provided in object construction. The global random state (numpy.random) if the estimator’s random_state parameter is set to none. Any data passed to the most recent call to fit, fit_transform, or fit_predict. Any data passed in a sequence of calls to partial_fit. Mapping non-rectangular data representation into rectangular data It maps a non-rectangular data representation into rectangular data. In simple words, it takes input where each sample is not represented as an array-like object of fixed length, and producing an array-like object of features for each sample. Distinction between core and outlying samples It models the distinction between core and outlying samples by using following methods − fit fit_predict if transductive predict if inductive Guiding Principles While designing the Scikit-Learn API, following guiding principles kept in mind − Consistency This principle states that all the objects should share a common interface drawn from a limited set of methods. The documentation should also be consistent. Limited object hierarchy This guiding principle says − Algorithms should be represented by Python classes Datasets should be represented in standard format like NumPy arrays, Pandas DataFrames, SciPy sparse matrix. Parameters names should use standard Python strings. Composition As we know that, ML algorithms can be expressed as the sequence of many fundamental algorithms. Scikit-learn makes use of these fundamental algorithms whenever needed. Sensible defaults According to this principle, the Scikit-learn library defines an appropriate default value whenever ML models require user-specified parameters. Inspection As per this guiding principle, every specified parameter value is exposed as pubic attributes. Steps in using Estimator API Followings are the steps in using the Scikit-Learn estimator API − Step 1: Choose a class of model In this first step, we need to choose a class of model. It can be done by importing the appropriate Estimator class from Scikit-learn. Step 2: Choose model hyperparameters In this step, we need to choose class model hyperparameters. It can be done by instantiating the class with desired values. Step 3: Arranging the data Next, we need to arrange the data into features matrix (X) and target vector(y). Step 4: Model Fitting Now, we need to fit the model to your data. It can be done by calling fit() method of the model instance. Step 5: Applying the model After fitting the model, we can apply it to new data. For supervised learning, use predict() method to predict the labels for unknown data. While for unsupervised learning, use predict() or transform() to infer properties of the data. Supervised Learning Example Here, as an example of this process we are taking common case of fitting a line to (x,y) data i.e. simple linear regression. First, we need to load the dataset, we are using iris dataset − Example import seaborn as sns iris = sns.load_dataset(”iris”) X_iris = iris.drop(”species”, axis = 1) X_iris.shape Output (150, 4) Example y_iris = iris[”species”] y_iris.shape Output (150,) Example Now, for this regression example, we are going to use the following sample data − %matplotlib inline import matplotlib.pyplot as plt import numpy as np rng = np.random.RandomState(35) x = 10*rng.rand(40) y = 2*x-1+rng.randn(40) plt.scatter(x,y); Output So, we have the above data for our linear regression example. Now, with this data, we can apply the above-mentioned steps. Choose a class of model Here, to compute a simple linear regression model, we need to import the linear regression class as follows − from sklearn.linear_model import LinearRegression Choose model hyperparameters Once we choose a class of model, we need to make some important choices which are often represented as hyperparameters, or the parameters that must set before the model is fit to data. Here, for this example of linear regression, we would like to fit the intercept by using the fit_intercept hyperparameter as follows − Example model = LinearRegression(fit_intercept = True) model Output LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None, normalize = False) Arranging the data Now, as we know that our target variable y is in correct form i.e. a length n_samples array of 1-D. But, we need to reshape the feature matrix X to make it a matrix of size [n_samples, n_features]. It can be done as follows − Example X = x[:, np.newaxis] X.shape Output (40, 1) Model fitting Once, we arrange the data, it is time to fit the model i.e. to apply our model to data. This can be done with the help of fit() method as follows − Example model.fit(X, y) Output LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None,normalize = False) In Scikit-learn, the fit() process have some trailing underscores. For this example, the
Scikit Learn – Linear Modeling This chapter will help you in learning about the linear modeling in Scikit-Learn. Let us begin by understanding what is linear regression in Sklearn. The following table lists out various linear models provided by Scikit-Learn − Sr.No Model & Description 1 It is one of the best statistical models that studies the relationship between a dependent variable (Y) with a given set of independent variables (X). 2 Logistic regression, despite its name, is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false). 3 Ridge regression or Tikhonov regularization is the regularization technique that performs L2 regularization. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the square of the magnitude of coefficients. 4 Bayesian regression allows a natural mechanism to survive insufficient data or poorly distributed data by formulating linear regression using probability distributors rather than point estimates. 5 LASSO is the regularisation technique that performs L1 regularisation. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the summation of the absolute value of coefficients. 6 It allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks. Sklearn provides a linear model named MultiTaskLasso, trained with a mixed L1, L2-norm for regularisation, which estimates sparse coefficients for multiple regression problems jointly. 7 The Elastic-Net is a regularized regression method that linearly combines both penalties i.e. L1 and L2 of the Lasso and Ridge regression methods. It is useful when there are multiple correlated features. 8 It is an Elastic-Net model that allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks