Discuss Scikit Learn Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
Category: scikit Learn
Scikit Learn – Clustering Methods Here, we will study about the clustering methods in Sklearn which will help in identification of any similarity in the data samples. Clustering methods, one of the most useful unsupervised ML methods, used to find similarity & relationship patterns among data samples. After that, they cluster those samples into groups having similarity based on features. Clustering determines the intrinsic grouping among the present unlabeled data, that’s why it is important. The Scikit-learn library have sklearn.cluster to perform clustering of unlabeled data. Under this module scikit-leran have the following clustering methods − KMeans This algorithm computes the centroids and iterates until it finds optimal centroid. It requires the number of clusters to be specified that’s why it assumes that they are already known. The main logic of this algorithm is to cluster the data separating samples in n number of groups of equal variances by minimizing the criteria known as the inertia. The number of clusters identified by algorithm is represented by ‘K. Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering. While computing cluster centers and value of inertia, the parameter named sample_weight allows sklearn.cluster.KMeans module to assign more weight to some samples. Affinity Propagation This algorithm is based on the concept of ‘message passing’ between different pairs of samples until convergence. It does not require the number of clusters to be specified before running the algorithm. The algorithm has a time complexity of the order 𝑂(𝑁2𝑇), which is the biggest disadvantage of it. Scikit-learn have sklearn.cluster.AffinityPropagation module to perform Affinity Propagation clustering. Mean Shift This algorithm mainly discovers blobs in a smooth density of samples. It assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints. Instead of relying on a parameter named bandwidth dictating the size of the region to search through, it automatically sets the number of clusters. Scikit-learn have sklearn.cluster.MeanShift module to perform Mean Shift clustering. Spectral Clustering Before clustering, this algorithm basically uses the eigenvalues i.e. spectrum of the similarity matrix of the data to perform dimensionality reduction in fewer dimensions. The use of this algorithm is not advisable when there are large number of clusters. Scikit-learn have sklearn.cluster.SpectralClustering module to perform Spectral clustering. Hierarchical Clustering This algorithm builds nested clusters by merging or splitting the clusters successively. This cluster hierarchy is represented as dendrogram i.e. tree. It falls into following two categories − Agglomerative hierarchical algorithms − In this kind of hierarchical algorithm, every data point is treated like a single cluster. It then successively agglomerates the pairs of clusters. This uses the bottom-up approach. Divisive hierarchical algorithms − In this hierarchical algorithm, all data points are treated as one big cluster. In this the process of clustering involves dividing, by using top-down approach, the one big cluster into various small clusters. Scikit-learn have sklearn.cluster.AgglomerativeClustering module to perform Agglomerative Hierarchical clustering. DBSCAN It stands for “Density-based spatial clustering of applications with noise”. This algorithm is based on the intuitive notion of “clusters” & “noise” that clusters are dense regions of the lower density in the data space, separated by lower density regions of data points. Scikit-learn have sklearn.cluster.DBSCAN module to perform DBSCAN clustering. There are two important parameters namely min_samples and eps used by this algorithm to define dense. Higher value of parameter min_samples or lower value of the parameter eps will give an indication about the higher density of data points which is necessary to form a cluster. OPTICS It stands for “Ordering points to identify the clustering structure”. This algorithm also finds density-based clusters in spatial data. It’s basic working logic is like DBSCAN. It addresses a major weakness of DBSCAN algorithm-the problem of detecting meaningful clusters in data of varying density-by ordering the points of the database in such a way that spatially closest points become neighbors in the ordering. Scikit-learn have sklearn.cluster.OPTICS module to perform OPTICS clustering. BIRCH It stands for Balanced iterative reducing and clustering using hierarchies. It is used to perform hierarchical clustering over large data sets. It builds a tree named CFT i.e. Characteristics Feature Tree, for the given data. The advantage of CFT is that the data nodes called CF (Characteristics Feature) nodes holds the necessary information for clustering which further prevents the need to hold the entire input data in memory. Scikit-learn have sklearn.cluster.Birch module to perform BIRCH clustering. Comparing Clustering Algorithms Following table will give a comparison (based on parameters, scalability and metric) of the clustering algorithms in scikit-learn. Sr.No Algorithm Name Parameters Scalability Metric Used 1 K-Means No. of clusters Very large n_samples The distance between points. 2 Affinity Propagation Damping It’s not scalable with n_samples Graph Distance 3 Mean-Shift Bandwidth It’s not scalable with n_samples. The distance between points. 4 Spectral Clustering No.of clusters Medium level of scalability with n_samples. Small level of scalability with n_clusters. Graph Distance 5 Hierarchical Clustering Distance threshold or No.of clusters Large n_samples Large n_clusters The distance between points. 6 DBSCAN Size of neighborhood Very large n_samples and medium n_clusters. Nearest point distance 7 OPTICS Minimum cluster membership Very large n_samples and large n_clusters. The distance between points. 8 BIRCH Threshold, Branching factor Large n_samples Large n_clusters The Euclidean distance between points. K-Means Clustering on Scikit-learn Digit dataset In this example, we will apply K-means clustering on digits dataset. This algorithm will identify similar digits without using the original label information. Implementation is done on Jupyter notebook. %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import load_digits digits = load_digits() digits.data.shape Output 1797, 64) This output shows that digit dataset is having 1797 samples with 64 features. Example Now, perform the K-Means clustering as follows − kmeans = KMeans(n_clusters = 10, random_state = 0) clusters = kmeans.fit_predict(digits.data) kmeans.cluster_centers_.shape Output (10, 64) This output shows that K-means clustering created 10 clusters with 64 features. Example fig, ax = plt.subplots(2, 5, figsize = (8, 3)) centers = kmeans.cluster_centers_.reshape(10, 8, 8) for axi, center in zip(ax.flat,
Scikit Learn – K-Nearest Neighbors (KNN) This chapter will help you in understanding the nearest neighbor methods in Sklearn. Neighbor based learning method are of both types namely supervised and unsupervised. Supervised neighbors-based learning can be used for both classification as well as regression predictive problems but, it is mainly used for classification predictive problems in industry. Neighbors based learning methods do not have a specialised training phase and uses all the data for training while classification. It also does not assume anything about the underlying data. That’s the reason they are lazy and non-parametric in nature. The main principle behind nearest neighbor methods is − To find a predefined number of training samples closet in distance to the new data point Predict the label from these number of training samples. Here, the number of samples can be a user-defined constant like in K-nearest neighbor learning or vary based on the local density of point like in radius-based neighbor learning. sklearn.neighbors Module Scikit-learn have sklearn.neighbors module that provides functionality for both unsupervised and supervised neighbors-based learning methods. As input, the classes in this module can handle either NumPy arrays or scipy.sparse matrices. Types of algorithms Different types of algorithms which can be used in neighbor-based methods’ implementation are as follows − Brute Force The brute-force computation of distances between all pairs of points in the dataset provides the most naïve neighbor search implementation. Mathematically, for N samples in D dimensions, brute-force approach scales as 0[DN2] For small data samples, this algorithm can be very useful, but it becomes infeasible as and when number of samples grows. Brute force neighbor search can be enabled by writing the keyword algorithm=’brute’. K-D Tree One of the tree-based data structures that have been invented to address the computational inefficiencies of the brute-force approach, is KD tree data structure. Basically, the KD tree is a binary tree structure which is called K-dimensional tree. It recursively partitions the parameters space along the data axes by dividing it into nested orthographic regions into which the data points are filled. Advantages Following are some advantages of K-D tree algorithm − Construction is fast − As the partitioning is performed only along the data axes, K-D tree’s construction is very fast. Less distance computations − This algorithm takes very less distance computations to determine the nearest neighbor of a query point. It only takes 𝑶[𝐥𝐨𝐠 (𝑵)] distance computations. Disadvantages Fast for only low-dimensional neighbor searches − It is very fast for low-dimensional (D < 20) neighbor searches but as and when D grow it becomes inefficient. As the partitioning is performed only along the data axes, K-D tree neighbor searches can be enabled by writing the keyword algorithm=’kd_tree’. Ball Tree As we know that KD Tree is inefficient in higher dimensions, hence, to address this inefficiency of KD Tree, Ball tree data structure was developed. Mathematically, it recursively divides the data, into nodes defined by a centroid C and radius r, in such a way that each point in the node lies within the hyper-sphere defined by centroid C and radius r. It uses triangle inequality, given below, which reduces the number of candidate points for a neighbor search $$arrowvert X+Yarrowvertleq arrowvert Xarrowvert+arrowvert Yarrowvert$$ Advantages Following are some advantages of Ball Tree algorithm − Efficient on highly structured data − As ball tree partition the data in a series of nesting hyper-spheres, it is efficient on highly structured data. Out-performs KD-tree − Ball tree out-performs KD tree in high dimensions because it has spherical geometry of the ball tree nodes. Disadvantages Costly − Partition the data in a series of nesting hyper-spheres makes its construction very costly. Ball tree neighbor searches can be enabled by writing the keyword algorithm=’ball_tree’. Choosing Nearest Neighbors Algorithm The choice of an optimal algorithm for a given dataset depends upon the following factors − Number of samples (N) and Dimensionality (D) These are the most important factors to be considered while choosing Nearest Neighbor algorithm. It is because of the reasons given below − The query time of Brute Force algorithm grows as O[DN]. The query time of Ball tree algorithm grows as O[D log(N)]. The query time of KD tree algorithm changes with D in a strange manner that is very difficult to characterize. When D < 20, the cost is O[D log(N)] and this algorithm is very efficient. On the other hand, it is inefficient in case when D > 20 because the cost increases to nearly O[DN]. Data Structure Another factor that affect the performance of these algorithms is intrinsic dimensionality of the data or sparsity of the data. It is because the query times of Ball tree and KD tree algorithms can be greatly influenced by it. Whereas, the query time of Brute Force algorithm is unchanged by data structure. Generally, Ball tree and KD tree algorithms produces faster query time when implanted on sparser data with smaller intrinsic dimensionality. Number of Neighbors (k) The number of neighbors (k) requested for a query point affects the query time of Ball tree and KD tree algorithms. Their query time becomes slower as number of neighbors (k) increases. Whereas the query time of Brute Force will remain unaffected by the value of k. Number of query points Because, they need construction phase, both KD tree and Ball tree algorithms will be effective if there are large number of query points. On the other hand, if there are a smaller number of query points, Brute Force algorithm performs better than KD tree and Ball tree algorithms.
Scikit Learn – Anomaly Detection Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points. Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories − Point anomalies − It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data. Contextual anomalies − Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context. Collective anomalies − It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values. Methods Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them. Outlier detection The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection. Novelty detection It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection. There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows − estimator.fit(X_train) Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows − estimator.fit(X_test) The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter. We can also define decision_function method that defines outliers as negative value and inliers as non-negative value. estimator.decision_function(X_test) Sklearn algorithms for Outlier Detection Let us begin by understanding what an elliptic envelop is. Fitting an elliptic envelop This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop. This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode. Parameters Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method − Sr.No Parameter & Description 1 store_precision − Boolean, optional, default = True We can specify it if the estimated precision is stored. 2 assume_centered − Boolean, optional, default = False If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian. 3 support_fraction − float in (0., 1.), optional, default = None This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates. 4 contamination − float in (0., 1.), optional, default = 0.1 It provides the proportion of the outliers in the data set. 5 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. Attributes Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method − Sr.No Attributes & Description 1 support_ − array-like, shape(n_samples,) It represents the mask of the observations used to compute robust estimates of location and shape. 2 location_ − array-like, shape (n_features) It returns the estimated robust location. 3 covariance_ − array-like, shape (n_features, n_features) It returns the estimated robust covariance matrix. 4 precision_ − array-like, shape (n_features, n_features) It returns the estimated pseudo inverse matrix. 5 offset_ − float It is used to define the decision function from the raw scores. decision_function = score_samples -offset_ Implementation Example import numpy as np^M from sklearn.covariance import EllipticEnvelope^M true_cov = np.array([[.5, .6],[.6, .4]]) X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500) cov = EllipticEnvelope(random_state = 0).fit(X)^M # Now we can use predict method. It will return 1 for an inlier and -1 for an outlier. cov.predict([[0, 0],[2, 2]]) Output array([ 1, -1]) Isolation Forest In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features. Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node. Parameters Followings table consist the parameters used by sklearn. ensemble.IsolationForest method − Sr.No Parameter & Description 1 n_estimators − int, optional, default = 100 It represents the number of base estimators in the ensemble. 2 max_samples − int or float, optional, default = “auto” It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples ∗ 𝑋.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples). 3 support_fraction − float in (0., 1.), optional, default =
Scikit Learn – Boosting Methods In this chapter, we will learn about the boosting methods in Sklearn, which enables building an ensemble model. Boosting methods build ensemble model in an increment way. The main principle is to build the model incrementally by training each base model estimator sequentially. In order to build powerful ensemble, these methods basically combine several week learners which are sequentially trained over multiple iterations of training data. The sklearn.ensemble module is having following two boosting methods. AdaBoost It is one of the most successful boosting ensemble method whose main key is in the way they give weights to the instances in dataset. That’s why the algorithm needs to pay less attention to the instances while constructing subsequent models. Classification with AdaBoost For creating a AdaBoost classifier, the Scikit-learn module provides sklearn.ensemble.AdaBoostClassifier. While building this classifier, the main parameter this module use is base_estimator. Here, base_estimator is the value of the base estimator from which the boosted ensemble is built. If we choose this parameter’s value to none then, the base estimator would be DecisionTreeClassifier(max_depth=1). Implementation example In the following example, we are building a AdaBoost classifier by using sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score. from sklearn.ensemble import AdaBoostClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples = 1000, n_features = 10,n_informative = 2, n_redundant = 0,random_state = 0, shuffle = False) ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0) ADBclf.fit(X, y) Output AdaBoostClassifier(algorithm = ”SAMME.R”, base_estimator = None, learning_rate = 1.0, n_estimators = 100, random_state = 0) Example Once fitted, we can predict for new values as follows − print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [1] Example Now we can check the score as follows − ADBclf.score(X, y) Output 0.995 Example We can also use the sklearn dataset to build classifier using Extra-Tree method. For example, in an example given below, we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean()) Output 0.7851435406698566 Regression with AdaBoost For creating a regressor with Ada Boost method, the Scikit-learn library provides sklearn.ensemble.AdaBoostRegressor. While building regressor, it will use the same parameters as used by sklearn.ensemble.AdaBoostClassifier. Implementation example In the following example, we are building a AdaBoost regressor by using sklearn.ensemble.AdaBoostregressor and also predicting for new values by using predict() method. from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100) ADBregr.fit(X, y) Output AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = ”linear”, n_estimators = 100, random_state = 0) Example Once fitted we can predict from regression model as follows − print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [85.50955817] Gradient Tree Boosting It is also called Gradient Boosted Regression Trees (GRBT). It is basically a generalization of boosting to arbitrary differentiable loss functions. It produces a prediction model in the form of an ensemble of week prediction models. It can be used for the regression and classification problems. Their main advantage lies in the fact that they naturally handle the mixed type data. Classification with Gradient Tree Boost For creating a Gradient Tree Boost classifier, the Scikit-learn module provides sklearn.ensemble.GradientBoostingClassifier. While building this classifier, the main parameter this module use is ‘loss’. Here, ‘loss’ is the value of loss function to be optimized. If we choose loss = deviance, it refers to deviance for classification with probabilistic outputs. On the other hand, if we choose this parameter’s value to exponential then it recovers the AdaBoost algorithm. The parameter n_estimators will control the number of week learners. A hyper-parameter named learning_rate (in the range of (0.0, 1.0]) will control overfitting via shrinkage. Implementation example In the following example, we are building a Gradient Boosting classifier by using sklearn.ensemble.GradientBoostingClassifier. We are fitting this classifier with 50 week learners. from sklearn.datasets import make_hastie_10_2 from sklearn.ensemble import GradientBoostingClassifier X, y = make_hastie_10_2(random_state = 0) X_train, X_test = X[:5000], X[5000:] y_train, y_test = y[:5000], y[5000:] GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1, random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test) Output 0.8724285714285714 Example We can also use the sklearn dataset to build classifier using Gradient Boosting Classifier. As in the following example we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import GradientBoostingClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean()) Output 0.7946582356674234 Regression with Gradient Tree Boost For creating a regressor with Gradient Tree Boost method, the Scikit-learn library provides sklearn.ensemble.GradientBoostingRegressor. It can specify the loss function for regression via the parameter name loss. The default value for loss is ‘ls’. Implementation example In the following example, we are building a Gradient Boosting regressor by using sklearn.ensemble.GradientBoostingregressor and also finding the mean squared error by using mean_squared_error() method. import numpy as np from sklearn.metrics import mean_squared_error from sklearn.datasets import make_friedman1 from sklearn.ensemble import GradientBoostingRegressor X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:] y_train, y_test = y[:1000], y[1000:] GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss = ”ls”).fit(X_train, y_train) Once fitted we can find the mean squared error as follows − mean_squared_error(y_test, GDBreg.predict(X_test)) Output 5.391246106657164
Scikit Learn – Support Vector Machines This chapter deals with a machine learning method termed as Support Vector Machines (SVMs). Introduction Support vector machines (SVMs) are powerful yet flexible supervised machine learning methods used for classification, regression, and, outliers’ detection. SVMs are very efficient in high dimensional spaces and generally are used in classification problems. SVMs are popular and memory efficient because they use a subset of training points in the decision function. The main goal of SVMs is to divide the datasets into number of classes in order to find a maximum marginal hyperplane (MMH) which can be done in the following two steps − Support Vector Machines will first generate hyperplanes iteratively that separates the classes in the best way. After that it will choose the hyperplane that segregate the classes correctly. Some important concepts in SVM are as follows − Support Vectors − They may be defined as the datapoints which are closest to the hyperplane. Support vectors help in deciding the separating line. Hyperplane − The decision plane or space that divides set of objects having different classes. Margin − The gap between two lines on the closet data points of different classes is called margin. Following diagrams will give you an insight about these SVM concepts − SVM in Scikit-learn supports both sparse and dense sample vectors as input. Classification of SVM Scikit-learn provides three classes namely SVC, NuSVC and LinearSVC which can perform multiclass-class classification. SVC It is C-support vector classification whose implementation is based on libsvm. The module used by scikit-learn is sklearn.svm.SVC. This class handles the multiclass support according to one-vs-one scheme. Parameters Followings table consist the parameters used by sklearn.svm.SVC class − Sr.No Parameter & Description 1 C − float, optional, default = 1.0 It is the penalty parameter of the error term. 2 kernel − string, optional, default = ‘rbf’ This parameter specifies the type of kernel to be used in the algorithm. we can choose any one among, ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. The default value of kernel would be ‘rbf’. 3 degree − int, optional, default = 3 It represents the degree of the ‘poly’ kernel function and will be ignored by all other kernels. 4 gamma − {‘scale’, ‘auto’} or float, It is the kernel coefficient for kernels ‘rbf’, ‘poly’ and ‘sigmoid’. 5 optinal default − = ‘scale’ If you choose default i.e. gamma = ‘scale’ then the value of gamma to be used by SVC is 1/(𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠∗𝑋.𝑣𝑎𝑟()). On the other hand, if gamma= ‘auto’, it uses 1/𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 6 coef0 − float, optional, Default=0.0 An independent term in kernel function which is only significant in ‘poly’ and ‘sigmoid’. 7 tol − float, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. 8 shrinking − Boolean, optional, default = True This parameter represents that whether we want to use shrinking heuristic or not. 9 verbose − Boolean, default: false It enables or disable verbose output. Its default value is false. 10 probability − boolean, optional, default = true This parameter enables or disables probability estimates. The default value is false, but it must be enabled before we call fit. 11 max_iter − int, optional, default = -1 As name suggest, it represents the maximum number of iterations within the solver. Value -1 means there is no limit on the number of iterations. 12 cache_size − float, optional This parameter will specify the size of the kernel cache. The value will be in MB(MegaBytes). 13 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. 14 class_weight − {dict, ‘balanced’}, optional This parameter will set the parameter C of class j to 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔ℎ𝑡[𝑗]∗𝐶 for SVC. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight:balanced, it will use the values of y to automatically adjust weights. 15 decision_function_shape − ovo’, ‘ovr’, default = ‘ovr’ This parameter will decide whether the algorithm will return ‘ovr’ (one-vs-rest) decision function of shape as all other classifiers, or the original ovo(one-vs-one) decision function of libsvm. 16 break_ties − boolean, optional, default = false True − The predict will break ties according to the confidence values of decision_function False − The predict will return the first class among the tied classes. Attributes Followings table consist the attributes used by sklearn.svm.SVC class − Sr.No Attributes & Description 1 support_ − array-like, shape = [n_SV] It returns the indices of support vectors. 2 support_vectors_ − array-like, shape = [n_SV, n_features] It returns the support vectors. 3 n_support_ − array-like, dtype=int32, shape = [n_class] It represents the number of support vectors for each class. 4 dual_coef_ − array, shape = [n_class-1,n_SV] These are the coefficient of the support vectors in the decision function. 5 coef_ − array, shape = [n_class * (n_class-1)/2, n_features] This attribute, only available in case of linear kernel, provides the weight assigned to the features. 6 intercept_ − array, shape = [n_class * (n_class-1)/2] It represents the independent term (constant) in decision function. 7 fit_status_ − int The output would be 0 if it is correctly fitted. The output would be 1 if it is incorrectly fitted. 8 classes_ − array of shape = [n_classes] It gives the labels of the classes. Implementation Example Like other classifiers, SVC also has to be fitted with following two arrays − An array X holding the training samples. It is of size [n_samples, n_features]. An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples]. Following Python script uses sklearn.svm.SVC class − import
Scikit Learn – Estimator API In this chapter, we will learn about Estimator API (application programming interface). Let us begin by understanding what is an Estimator API. What is Estimator API It is one of the main APIs implemented by Scikit-learn. It provides a consistent interface for a wide range of ML applications that’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any of the algorithms like classification, regression, clustering or even with a transformer, that extracts useful features from raw data. For fitting the data, all estimator objects expose a fit method that takes a dataset shown as follows − estimator.fit(data) Next, all the parameters of an estimator can be set, as follows, when it is instantiated by the corresponding attribute. estimator = Estimator (param1=1, param2=2) estimator.param1 The output of the above would be 1. Once data is fitted with an estimator, parameters are estimated from the data at hand. Now, all the estimated parameters will be the attributes of the estimator object ending by an underscore as follows − estimator.estimated_param_ Use of Estimator API Main uses of estimators are as follows − Estimation and decoding of a model Estimator object is used for estimation and decoding of a model. Furthermore, the model is estimated as a deterministic function of the following − The parameters which are provided in object construction. The global random state (numpy.random) if the estimator’s random_state parameter is set to none. Any data passed to the most recent call to fit, fit_transform, or fit_predict. Any data passed in a sequence of calls to partial_fit. Mapping non-rectangular data representation into rectangular data It maps a non-rectangular data representation into rectangular data. In simple words, it takes input where each sample is not represented as an array-like object of fixed length, and producing an array-like object of features for each sample. Distinction between core and outlying samples It models the distinction between core and outlying samples by using following methods − fit fit_predict if transductive predict if inductive Guiding Principles While designing the Scikit-Learn API, following guiding principles kept in mind − Consistency This principle states that all the objects should share a common interface drawn from a limited set of methods. The documentation should also be consistent. Limited object hierarchy This guiding principle says − Algorithms should be represented by Python classes Datasets should be represented in standard format like NumPy arrays, Pandas DataFrames, SciPy sparse matrix. Parameters names should use standard Python strings. Composition As we know that, ML algorithms can be expressed as the sequence of many fundamental algorithms. Scikit-learn makes use of these fundamental algorithms whenever needed. Sensible defaults According to this principle, the Scikit-learn library defines an appropriate default value whenever ML models require user-specified parameters. Inspection As per this guiding principle, every specified parameter value is exposed as pubic attributes. Steps in using Estimator API Followings are the steps in using the Scikit-Learn estimator API − Step 1: Choose a class of model In this first step, we need to choose a class of model. It can be done by importing the appropriate Estimator class from Scikit-learn. Step 2: Choose model hyperparameters In this step, we need to choose class model hyperparameters. It can be done by instantiating the class with desired values. Step 3: Arranging the data Next, we need to arrange the data into features matrix (X) and target vector(y). Step 4: Model Fitting Now, we need to fit the model to your data. It can be done by calling fit() method of the model instance. Step 5: Applying the model After fitting the model, we can apply it to new data. For supervised learning, use predict() method to predict the labels for unknown data. While for unsupervised learning, use predict() or transform() to infer properties of the data. Supervised Learning Example Here, as an example of this process we are taking common case of fitting a line to (x,y) data i.e. simple linear regression. First, we need to load the dataset, we are using iris dataset − Example import seaborn as sns iris = sns.load_dataset(”iris”) X_iris = iris.drop(”species”, axis = 1) X_iris.shape Output (150, 4) Example y_iris = iris[”species”] y_iris.shape Output (150,) Example Now, for this regression example, we are going to use the following sample data − %matplotlib inline import matplotlib.pyplot as plt import numpy as np rng = np.random.RandomState(35) x = 10*rng.rand(40) y = 2*x-1+rng.randn(40) plt.scatter(x,y); Output So, we have the above data for our linear regression example. Now, with this data, we can apply the above-mentioned steps. Choose a class of model Here, to compute a simple linear regression model, we need to import the linear regression class as follows − from sklearn.linear_model import LinearRegression Choose model hyperparameters Once we choose a class of model, we need to make some important choices which are often represented as hyperparameters, or the parameters that must set before the model is fit to data. Here, for this example of linear regression, we would like to fit the intercept by using the fit_intercept hyperparameter as follows − Example model = LinearRegression(fit_intercept = True) model Output LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None, normalize = False) Arranging the data Now, as we know that our target variable y is in correct form i.e. a length n_samples array of 1-D. But, we need to reshape the feature matrix X to make it a matrix of size [n_samples, n_features]. It can be done as follows − Example X = x[:, np.newaxis] X.shape Output (40, 1) Model fitting Once, we arrange the data, it is time to fit the model i.e. to apply our model to data. This can be done with the help of fit() method as follows − Example model.fit(X, y) Output LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None,normalize = False) In Scikit-learn, the fit() process have some trailing underscores. For this example, the
Scikit Learn – Linear Modeling This chapter will help you in learning about the linear modeling in Scikit-Learn. Let us begin by understanding what is linear regression in Sklearn. The following table lists out various linear models provided by Scikit-Learn − Sr.No Model & Description 1 It is one of the best statistical models that studies the relationship between a dependent variable (Y) with a given set of independent variables (X). 2 Logistic regression, despite its name, is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false). 3 Ridge regression or Tikhonov regularization is the regularization technique that performs L2 regularization. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the square of the magnitude of coefficients. 4 Bayesian regression allows a natural mechanism to survive insufficient data or poorly distributed data by formulating linear regression using probability distributors rather than point estimates. 5 LASSO is the regularisation technique that performs L1 regularisation. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the summation of the absolute value of coefficients. 6 It allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks. Sklearn provides a linear model named MultiTaskLasso, trained with a mixed L1, L2-norm for regularisation, which estimates sparse coefficients for multiple regression problems jointly. 7 The Elastic-Net is a regularized regression method that linearly combines both penalties i.e. L1 and L2 of the Lasso and Ridge regression methods. It is useful when there are multiple correlated features. 8 It is an Elastic-Net model that allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks
Scikit Learn – Dimensionality Reduction using PCA Dimensionality reduction, an unsupervised machine learning method is used to reduce the number of feature variables for each data sample selecting set of principal features. Principal Component Analysis (PCA) is one of the popular algorithms for dimensionality reduction. Exact PCA Principal Component Analysis (PCA) is used for linear dimensionality reduction using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space. While decomposition using PCA, input data is centered but not scaled for each feature before applying the SVD. The Scikit-learn ML library provides sklearn.decomposition.PCA module that is implemented as a transformer object which learns n components in its fit() method. It can also be used on new data to project it on these components. Example The below example will use sklearn.decomposition.PCA module to find best 5 Principal components from Pima Indians Diabetes dataset. from pandas import read_csv from sklearn.decomposition import PCA path = r”C:UsersLeekhaDesktoppima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ‘class”] dataframe = read_csv(path, names = names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components = 5) fit = pca.fit(X) print((“Explained Variance: %s”) % (fit.explained_variance_ratio_)) print(fit.components_) Output Explained Variance: [0.88854663 0.06159078 0.02579012 0.01308614 0.00744094] [ [-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-029.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-029.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01] [-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-012.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01] [-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01] [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01] ] Incremental PCA Incremental Principal Component Analysis (IPCA) is used to address the biggest limitation of Principal Component Analysis (PCA) and that is PCA only supports batch processing, means all the input data to be processed should fit in the memory. The Scikit-learn ML library provides sklearn.decomposition.IPCA module that makes it possible to implement Out-of-Core PCA either by using its partial_fit method on sequentially fetched chunks of data or by enabling use of np.memmap, a memory mapped file, without loading the entire file into memory. Same as PCA, while decomposition using IPCA, input data is centered but not scaled for each feature before applying the SVD. Example The below example will use sklearn.decomposition.IPCA module on Sklearn digit dataset. from sklearn.datasets import load_digits from sklearn.decomposition import IncrementalPCA X, _ = load_digits(return_X_y = True) transformer = IncrementalPCA(n_components = 10, batch_size = 100) transformer.partial_fit(X[:100, :]) X_transformed = transformer.fit_transform(X) X_transformed.shape Output (1797, 10) Here, we can partially fit on smaller batches of data (as we did on 100 per batch) or you can let the fit() function to divide the data into batches. Kernel PCA Kernel Principal Component Analysis, an extension of PCA, achieves non-linear dimensionality reduction using kernels. It supports both transform and inverse_transform. The Scikit-learn ML library provides sklearn.decomposition.KernelPCA module. Example The below example will use sklearn.decomposition.KernelPCA module on Sklearn digit dataset. We are using sigmoid kernel. from sklearn.datasets import load_digits from sklearn.decomposition import KernelPCA X, _ = load_digits(return_X_y = True) transformer = KernelPCA(n_components = 10, kernel = ”sigmoid”) X_transformed = transformer.fit_transform(X) X_transformed.shape Output (1797, 10) PCA using randomized SVD Principal Component Analysis (PCA) using randomized SVD is used to project data to a lower-dimensional space preserving most of the variance by dropping the singular vector of components associated with lower singular values. Here, the sklearn.decomposition.PCA module with the optional parameter svd_solver=’randomized’ is going to be very useful. Example The below example will use sklearn.decomposition.PCA module with the optional parameter svd_solver=’randomized’ to find best 7 Principal components from Pima Indians Diabetes dataset. from pandas import read_csv from sklearn.decomposition import PCA path = r”C:UsersLeekhaDesktoppima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] dataframe = read_csv(path, names = names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components = 7,svd_solver = ”randomized”) fit = pca.fit(X) print((“Explained Variance: %s”) % (fit.explained_variance_ratio_)) print(fit.components_) Output Explained Variance: [8.88546635e-01 6.15907837e-02 2.57901189e-02 1.30861374e-027.44093864e-03 3.02614919e-03 5.12444875e-04] [ [-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-029.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-029.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01] [-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-012.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01] [-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01] [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01] [-5.04730888e-03 5.07391813e-02 7.56365525e-02 2.21363068e-01-6.13326472e-03 -9.70776708e-01 -2.02903702e-03 -1.51133239e-02] [ 9.86672995e-01 8.83426114e-04 -1.22975947e-03 -3.76444746e-041.42307394e-03 -2.73046214e-03 -6.34402965e-03 -1.62555343e-01] ]
Scikit Learn – Data Representation As we know that machine learning is about to create model from data. For this purpose, computer must understand the data first. Next, we are going to discuss various ways to represent the data in order to be understood by computer − Data as table The best way to represent data in Scikit-learn is in the form of tables. A table represents a 2-D grid of data where rows represent the individual elements of the dataset and the columns represents the quantities related to those individual elements. Example With the example given below, we can download iris dataset in the form of a Pandas DataFrame with the help of python seaborn library. import seaborn as sns iris = sns.load_dataset(”iris”) iris.head() Output sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa From above output, we can see that each row of the data represents a single observed flower and the number of rows represents the total number of flowers in the dataset. Generally, we refer the rows of the matrix as samples. On the other hand, each column of the data represents a quantitative information describing each sample. Generally, we refer the columns of the matrix as features. Data as Feature Matrix Features matrix may be defined as the table layout where information can be thought of as a 2-D matrix. It is stored in a variable named X and assumed to be two dimensional with shape [n_samples, n_features]. Mostly, it is contained in a NumPy array or a Pandas DataFrame. As told earlier, the samples always represent the individual objects described by the dataset and the features represents the distinct observations that describe each sample in a quantitative manner. Data as Target array Along with Features matrix, denoted by X, we also have target array. It is also called label. It is denoted by y. The label or target array is usually one-dimensional having length n_samples. It is generally contained in NumPy array or Pandas Series. Target array may have both the values, continuous numerical values and discrete values. How target array differs from feature columns? We can distinguish both by one point that the target array is usually the quantity we want to predict from the data i.e. in statistical terms it is the dependent variable. Example In the example below, from iris dataset we predict the species of flower based on the other measurements. In this case, the Species column would be considered as the feature. import seaborn as sns iris = sns.load_dataset(”iris”) %matplotlib inline import seaborn as sns; sns.set() sns.pairplot(iris, hue=”species”, height=3); Output X_iris = iris.drop(”species”, axis=1) X_iris.shape y_iris = iris[”species”] y_iris.shape Output (150,4) (150,)