Scikit Learn – Linear Modeling This chapter will help you in learning about the linear modeling in Scikit-Learn. Let us begin by understanding what is linear regression in Sklearn. The following table lists out various linear models provided by Scikit-Learn − Sr.No Model & Description 1 It is one of the best statistical models that studies the relationship between a dependent variable (Y) with a given set of independent variables (X). 2 Logistic regression, despite its name, is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false). 3 Ridge regression or Tikhonov regularization is the regularization technique that performs L2 regularization. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the square of the magnitude of coefficients. 4 Bayesian regression allows a natural mechanism to survive insufficient data or poorly distributed data by formulating linear regression using probability distributors rather than point estimates. 5 LASSO is the regularisation technique that performs L1 regularisation. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the summation of the absolute value of coefficients. 6 It allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks. Sklearn provides a linear model named MultiTaskLasso, trained with a mixed L1, L2-norm for regularisation, which estimates sparse coefficients for multiple regression problems jointly. 7 The Elastic-Net is a regularized regression method that linearly combines both penalties i.e. L1 and L2 of the Lasso and Ridge regression methods. It is useful when there are multiple correlated features. 8 It is an Elastic-Net model that allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks
Category: scikit Learn
Scikit Learn – Dimensionality Reduction using PCA Dimensionality reduction, an unsupervised machine learning method is used to reduce the number of feature variables for each data sample selecting set of principal features. Principal Component Analysis (PCA) is one of the popular algorithms for dimensionality reduction. Exact PCA Principal Component Analysis (PCA) is used for linear dimensionality reduction using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space. While decomposition using PCA, input data is centered but not scaled for each feature before applying the SVD. The Scikit-learn ML library provides sklearn.decomposition.PCA module that is implemented as a transformer object which learns n components in its fit() method. It can also be used on new data to project it on these components. Example The below example will use sklearn.decomposition.PCA module to find best 5 Principal components from Pima Indians Diabetes dataset. from pandas import read_csv from sklearn.decomposition import PCA path = r”C:UsersLeekhaDesktoppima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ‘class”] dataframe = read_csv(path, names = names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components = 5) fit = pca.fit(X) print((“Explained Variance: %s”) % (fit.explained_variance_ratio_)) print(fit.components_) Output Explained Variance: [0.88854663 0.06159078 0.02579012 0.01308614 0.00744094] [ [-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-029.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-029.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01] [-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-012.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01] [-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01] [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01] ] Incremental PCA Incremental Principal Component Analysis (IPCA) is used to address the biggest limitation of Principal Component Analysis (PCA) and that is PCA only supports batch processing, means all the input data to be processed should fit in the memory. The Scikit-learn ML library provides sklearn.decomposition.IPCA module that makes it possible to implement Out-of-Core PCA either by using its partial_fit method on sequentially fetched chunks of data or by enabling use of np.memmap, a memory mapped file, without loading the entire file into memory. Same as PCA, while decomposition using IPCA, input data is centered but not scaled for each feature before applying the SVD. Example The below example will use sklearn.decomposition.IPCA module on Sklearn digit dataset. from sklearn.datasets import load_digits from sklearn.decomposition import IncrementalPCA X, _ = load_digits(return_X_y = True) transformer = IncrementalPCA(n_components = 10, batch_size = 100) transformer.partial_fit(X[:100, :]) X_transformed = transformer.fit_transform(X) X_transformed.shape Output (1797, 10) Here, we can partially fit on smaller batches of data (as we did on 100 per batch) or you can let the fit() function to divide the data into batches. Kernel PCA Kernel Principal Component Analysis, an extension of PCA, achieves non-linear dimensionality reduction using kernels. It supports both transform and inverse_transform. The Scikit-learn ML library provides sklearn.decomposition.KernelPCA module. Example The below example will use sklearn.decomposition.KernelPCA module on Sklearn digit dataset. We are using sigmoid kernel. from sklearn.datasets import load_digits from sklearn.decomposition import KernelPCA X, _ = load_digits(return_X_y = True) transformer = KernelPCA(n_components = 10, kernel = ”sigmoid”) X_transformed = transformer.fit_transform(X) X_transformed.shape Output (1797, 10) PCA using randomized SVD Principal Component Analysis (PCA) using randomized SVD is used to project data to a lower-dimensional space preserving most of the variance by dropping the singular vector of components associated with lower singular values. Here, the sklearn.decomposition.PCA module with the optional parameter svd_solver=’randomized’ is going to be very useful. Example The below example will use sklearn.decomposition.PCA module with the optional parameter svd_solver=’randomized’ to find best 7 Principal components from Pima Indians Diabetes dataset. from pandas import read_csv from sklearn.decomposition import PCA path = r”C:UsersLeekhaDesktoppima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] dataframe = read_csv(path, names = names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components = 7,svd_solver = ”randomized”) fit = pca.fit(X) print((“Explained Variance: %s”) % (fit.explained_variance_ratio_)) print(fit.components_) Output Explained Variance: [8.88546635e-01 6.15907837e-02 2.57901189e-02 1.30861374e-027.44093864e-03 3.02614919e-03 5.12444875e-04] [ [-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-029.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-029.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01] [-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-012.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01] [-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01] [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01] [-5.04730888e-03 5.07391813e-02 7.56365525e-02 2.21363068e-01-6.13326472e-03 -9.70776708e-01 -2.02903702e-03 -1.51133239e-02] [ 9.86672995e-01 8.83426114e-04 -1.22975947e-03 -3.76444746e-041.42307394e-03 -2.73046214e-03 -6.34402965e-03 -1.62555343e-01] ]
Scikit Learn – Data Representation As we know that machine learning is about to create model from data. For this purpose, computer must understand the data first. Next, we are going to discuss various ways to represent the data in order to be understood by computer − Data as table The best way to represent data in Scikit-learn is in the form of tables. A table represents a 2-D grid of data where rows represent the individual elements of the dataset and the columns represents the quantities related to those individual elements. Example With the example given below, we can download iris dataset in the form of a Pandas DataFrame with the help of python seaborn library. import seaborn as sns iris = sns.load_dataset(”iris”) iris.head() Output sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa From above output, we can see that each row of the data represents a single observed flower and the number of rows represents the total number of flowers in the dataset. Generally, we refer the rows of the matrix as samples. On the other hand, each column of the data represents a quantitative information describing each sample. Generally, we refer the columns of the matrix as features. Data as Feature Matrix Features matrix may be defined as the table layout where information can be thought of as a 2-D matrix. It is stored in a variable named X and assumed to be two dimensional with shape [n_samples, n_features]. Mostly, it is contained in a NumPy array or a Pandas DataFrame. As told earlier, the samples always represent the individual objects described by the dataset and the features represents the distinct observations that describe each sample in a quantitative manner. Data as Target array Along with Features matrix, denoted by X, we also have target array. It is also called label. It is denoted by y. The label or target array is usually one-dimensional having length n_samples. It is generally contained in NumPy array or Pandas Series. Target array may have both the values, continuous numerical values and discrete values. How target array differs from feature columns? We can distinguish both by one point that the target array is usually the quantity we want to predict from the data i.e. in statistical terms it is the dependent variable. Example In the example below, from iris dataset we predict the species of flower based on the other measurements. In this case, the Species column would be considered as the feature. import seaborn as sns iris = sns.load_dataset(”iris”) %matplotlib inline import seaborn as sns; sns.set() sns.pairplot(iris, hue=”species”, height=3); Output X_iris = iris.drop(”species”, axis=1) X_iris.shape y_iris = iris[”species”] y_iris.shape Output (150,4) (150,)
Scikit Learn – Quick Guide Scikit Learn – Introduction In this chapter, we will understand what is Scikit-Learn or Sklearn, origin of Scikit-Learn and some other related topics such as communities and contributors responsible for development and maintenance of Scikit-Learn, its prerequisites, installation and its features. What is Scikit-Learn (Sklearn) Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. Origin of Scikit-Learn It was originally called scikits.learn and was initially developed by David Cournapeau as a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on 1st Feb. 2010. Let’s have a look at its version history − May 2019: scikit-learn 0.21.0 March 2019: scikit-learn 0.20.3 December 2018: scikit-learn 0.20.2 November 2018: scikit-learn 0.20.1 September 2018: scikit-learn 0.20.0 July 2018: scikit-learn 0.19.2 July 2017: scikit-learn 0.19.0 September 2016. scikit-learn 0.18.0 November 2015. scikit-learn 0.17.0 March 2015. scikit-learn 0.16.0 July 2014. scikit-learn 0.15.0 August 2013. scikit-learn 0.14 Community & contributors Scikit-learn is a community effort and anyone can contribute to it. This project is hosted on Following people are currently the core contributors to Sklearn’s development and maintenance − Joris Van den Bossche (Data Scientist) Thomas J Fan (Software Developer) Alexandre Gramfort (Machine Learning Researcher) Olivier Grisel (Machine Learning Expert) Nicolas Hug (Associate Research Scientist) Andreas Mueller (Machine Learning Scientist) Hanmin Qin (Software Engineer) Adrin Jalali (Open Source Developer) Nelle Varoquaux (Data Science Researcher) Roman Yurchak (Data Scientist) Various organisations like Booking.com, JP Morgan, Evernote, Inria, AWeber, Spotify and many more are using Sklearn. Prerequisites Before we start using scikit-learn latest release, we require the following − Python (>=3.5) NumPy (>= 1.11.0) Scipy (>= 0.17.0)li Joblib (>= 0.11) Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities. Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data structure and analysis. Installation If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn − Using pip Following command can be used to install scikit-learn via pip − pip install -U scikit-learn Using conda Following command can be used to install scikit-learn via conda − conda install scikit-learn On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either pip or conda. Another option to use scikit-learn is to use Python distributions like Canopy and Anaconda because they both ship the latest version of scikit-learn. Features Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows − Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn. Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks. Clustering − This model is used for grouping unlabeled data. Cross Validation − It is used to check the accuracy of supervised models on unseen data. Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection. Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models. Feature extraction − It is used to extract the features from data to define the attributes in image and text data. Feature selection − It is used to identify useful attributes to create supervised models. Open Source − It is open source library and also commercially usable under BSD license. Scikit Learn – Modelling Process This chapter deals with the modelling process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading. Dataset Loading A collection of data is called dataset. It is having the following two components − Features − The variables of data are called its features. They are also known as predictors, inputs or attributes. Feature matrix − It is the collection of features, in case there are more than one. Feature Names − It is the list of all the names of the features. Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. Response Vector − It is used to represent response column. Generally, we have just one response column. Target Names − It represent the possible values taken by a response vector. Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression. Example Following is an example to load iris dataset − from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names print(“Feature names:”, feature_names) print(“Target names:”, target_names) print(“nFirst 10 rows of X:n”, X[:10]) Output Feature names: [”sepal length (cm)”, ”sepal width (cm)”, ”petal length (cm)”, ”petal width (cm)”] Target names: [”setosa” ”versicolor” ”virginica”] First 10 rows of X: [ [5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] ] Splitting the dataset To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well
Scikit Learn – Extended Linear Modeling This chapter focusses on the polynomial features and pipelining tools in Sklearn. Introduction to Polynomial Features Linear models trained on non-linear functions of data generally maintains the fast performance of linear methods. It also allows them to fit a much wider range of data. That’s the reason in machine learning such linear models, that are trained on nonlinear functions, are used. One such example is that a simple linear regression can be extended by constructing polynomial features from the coefficients. Mathematically, suppose we have standard linear regression model then for 2-D data it would look like this − $$Y=W_{0}+W_{1}X_{1}+W_{2}X_{2}$$ Now, we can combine the features in second-order polynomials and our model will look like as follows − $$Y=W_{0}+W_{1}X_{1}+W_{2}X_{2}+W_{3}X_{1}X_{2}+W_{4}X_1^2+W_{5}X_2^2$$ The above is still a linear model. Here, we saw that the resulting polynomial regression is in the same class of linear models and can be solved similarly. To do so, scikit-learn provides a module named PolynomialFeatures. This module transforms an input data matrix into a new data matrix of given degree. Parameters Followings table consist the parameters used by PolynomialFeatures module Sr.No Parameter & Description 1 degree − integer, default = 2 It represents the degree of the polynomial features. 2 interaction_only − Boolean, default = false By default, it is false but if set as true, the features that are products of most degree distinct input features, are produced. Such features are called interaction features. 3 include_bias − Boolean, default = true It includes a bias column i.e. the feature in which all polynomials powers are zero. 4 order − str in {‘C’, ‘F’}, default = ‘C’ This parameter represents the order of output array in the dense case. ‘F’ order means faster to compute but on the other hand, it may slow down subsequent estimators. Attributes Followings table consist the attributes used by PolynomialFeatures module Sr.No Attributes & Description 1 powers_ − array, shape (n_output_features, n_input_features) It shows powers_ [i,j] is the exponent of the jth input in the ith output. 2 n_input_features _ − int As name suggests, it gives the total number of input features. 3 n_output_features _ − int As name suggests, it gives the total number of polynomial output features. Implementation Example Following Python script uses PolynomialFeatures transformer to transform array of 8 into shape (4,2) − from sklearn.preprocessing import PolynomialFeatures import numpy as np Y = np.arange(8).reshape(4, 2) poly = PolynomialFeatures(degree=2) poly.fit_transform(Y) Output array( [ [ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.], [ 1., 6., 7., 36., 42., 49.] ] ) Streamlining using Pipeline tools The above sort of preprocessing i.e. transforming an input data matrix into a new data matrix of a given degree, can be streamlined with the Pipeline tools, which are basically used to chain multiple estimators into one. Example The below python scripts using Scikit-learn’s Pipeline tools to streamline the preprocessing (will fit to an order-3 polynomial data). #First, import the necessary packages. from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline import numpy as np #Next, create an object of Pipeline tool Stream_model = Pipeline([(”poly”, PolynomialFeatures(degree=3)), (”linear”, LinearRegression(fit_intercept=False))]) #Provide the size of array and order of polynomial data to fit the model. x = np.arange(5) y = 3 – 2 * x + x ** 2 – x ** 3 Stream_model = model.fit(x[:, np.newaxis], y) #Calculate the input polynomial coefficients. Stream_model.named_steps[”linear”].coef_ Output array([ 3., -2., 1., -1.]) The above output shows that the linear model trained on polynomial features is able to recover the exact input polynomial coefficients.
Scikit Learn – Useful Resources The following resources contain additional information on Scikit Learn. Please use them to get more in-depth knowledge on this. Useful Links on Scikit Learn − Scikit Learn, its history and various other terms has been explained in simple language. Useful Books on Scikit Learn To enlist your site on this page, please drop an email to [email protected]
Scikit Learn – Classification with Naïve Bayes Naïve Bayes methods are a set of supervised learning algorithms based on applying Bayesâ theorem with a strong assumption that all the predictors are independent to each other i.e. the presence of a feature in a class is independent to the presence of any other feature in the same class. This is naïve assumption that is why these methods are called Naïve Bayes methods. Bayes theorem states the following relationship in order to find the posterior probability of class i.e. the probability of a label and some observed features, $Pleft(begin{array}{c} Yarrowvert featuresend{array}right)$. $$Pleft(begin{array}{c} Yarrowvert featuresend{array}right)=left(frac{Plgroup Yrgroup Pleft(begin{array}{c} featuresarrowvert Yend{array}right)}{Pleft(begin{array}{c} featuresend{array}right)}right)$$ Here, $Pleft(begin{array}{c} Yarrowvert featuresend{array}right)$ is the posterior probability of class. $Pleft(begin{array}{c} Yend{array}right)$ is the prior probability of class. $Pleft(begin{array}{c} featuresarrowvert Yend{array}right)$ is the likelihood which is the probability of predictor given class. $Pleft(begin{array}{c} featuresend{array}right)$ is the prior probability of predictor. The Scikit-learn provides different naïve Bayes classifiers models namely Gaussian, Multinomial, Complement and Bernoulli. All of them differ mainly by the assumption they make regarding the distribution of ð·$Pleft(begin{array}{c} featuresarrowvert Yend{array}right)$ i.e. the probability of predictor given class. Sr.No Model & Description 1 Gaussian Naïve Bayes classifier assumes that the data from each label is drawn from a simple Gaussian distribution. 2 It assumes that the features are drawn from a simple Multinomial distribution. 3 The assumption in this model is that the features binary (0s and 1s) in nature. An application of Bernoulli Naïve Bayes classification is Text classification with âbag of wordsâ model 4 It was designed to correct the severe assumptions made by Multinomial Bayes classifier. This kind of NB classifier is suitable for imbalanced data sets Building Naïve Bayes Classifier We can also apply Naïve Bayes classifier on Scikit-learn dataset. In the example below, we are applying GaussianNB and fitting the breast_cancer dataset of Scikit-leran. Example Import Sklearn from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split data = load_breast_cancer() label_names = data[”target_names”] labels = data[”target”] feature_names = data[”feature_names”] features = data[”data”] print(label_names) print(labels[0]) print(feature_names[0]) print(features[0]) train, test, train_labels, test_labels = train_test_split( features,labels,test_size = 0.40, random_state = 42 ) from sklearn.naive_bayes import GaussianNB GNBclf = GaussianNB() model = GNBclf.fit(train, train_labels) preds = GNBclf.predict(test) print(preds) Output [ 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 ] The above output consists of a series of 0s and 1s which are basically the predicted values from tumor classes namely malignant and benign.
Scikit Learn – Clustering Performance Evaluation There are various functions with the help of which we can evaluate the performance of clustering algorithms. Following are some important and mostly used functions given by the Scikit-learn for evaluating clustering performance − Adjusted Rand Index Rand Index is a function that computes a similarity measure between two clustering. For this computation rand index considers all pairs of samples and counting pairs that are assigned in the similar or different clusters in the predicted and true clustering. Afterwards, the raw Rand Index score is ‘adjusted for chance’ into the Adjusted Rand Index score by using the following formula − $$Adjusted:RI=left(RI-Expected_{-}RIright)/left(maxleft(RIright)-Expected_{-}RIright)$$ It has two parameters namely labels_true, which is ground truth class labels, and labels_pred, which are clusters label to evaluate. Example from sklearn.metrics.cluster import adjusted_rand_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_rand_score(labels_true, labels_pred) Output 0.4444444444444445 Perfect labeling would be scored 1 and bad labelling or independent labelling is scored 0 or negative. Mutual Information Based Score Mutual Information is a function that computes the agreement of the two assignments. It ignores the permutations. There are following versions available − Normalized Mutual Information (NMI) Scikit learn have sklearn.metrics.normalized_mutual_info_score module. Example from sklearn.metrics.cluster import normalized_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] normalized_mutual_info_score (labels_true, labels_pred) Output 0.7611702597222881 Adjusted Mutual Information (AMI) Scikit learn have sklearn.metrics.adjusted_mutual_info_score module. Example from sklearn.metrics.cluster import adjusted_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_mutual_info_score (labels_true, labels_pred) Output 0.4444444444444448 Fowlkes-Mallows Score The Fowlkes-Mallows function measures the similarity of two clustering of a set of points. It may be defined as the geometric mean of the pairwise precision and recall. Mathematically, $$FMS=frac{TP}{sqrt{left(TP+FPright)left(TP+FNright)}}$$ Here, TP = True Positive − number of pair of points belonging to the same clusters in true as well as predicted labels both. FP = False Positive − number of pair of points belonging to the same clusters in true labels but not in the predicted labels. FN = False Negative − number of pair of points belonging to the same clusters in the predicted labels but not in the true labels. The Scikit learn has sklearn.metrics.fowlkes_mallows_score module − Example from sklearn.metrics.cluster import fowlkes_mallows_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] fowlkes_mallows__score (labels_true, labels_pred) Output 0.6546536707079771 Silhouette Coefficient The Silhouette function will compute the mean Silhouette Coefficient of all samples using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Mathematically, $$S=left(b-aright)/maxleft(a,bright)$$ Here, a is intra-cluster distance. and, b is mean nearest-cluster distance. The Scikit learn have sklearn.metrics.silhouette_score module − Example from sklearn import metrics.silhouette_score from sklearn.metrics import pairwise_distances from sklearn import datasets import numpy as np from sklearn.cluster import KMeans dataset = datasets.load_iris() X = dataset.data y = dataset.target kmeans_model = KMeans(n_clusters = 3, random_state = 1).fit(X) labels = kmeans_model.labels_ silhouette_score(X, labels, metric = ”euclidean”) Output 0.5528190123564091 Contingency Matrix This matrix will report the intersection cardinality for every trusted pair of (true, predicted). Confusion matrix for classification problems is a square contingency matrix. The Scikit learn have sklearn.metrics.contingency_matrix module. Example from sklearn.metrics.cluster import contingency_matrix x = [“a”, “a”, “a”, “b”, “b”, “b”] y = [1, 1, 2, 0, 1, 2] contingency_matrix(x, y) Output array([ [0, 2, 1], [1, 1, 1] ]) The first row of above output shows that among three samples whose true cluster is “a”, none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in 2.
Scikit Learn – Randomized Decision Trees This chapter will help you in understanding randomized decision trees in Sklearn. Randomized Decision Tree algorithms As we know that a DT is usually trained by recursively splitting the data, but being prone to overfit, they have been transformed to random forests by training many trees over various subsamples of the data. The sklearn.ensemble module is having following two algorithms based on randomized decision trees − The Random Forest algorithm For each feature under consideration, it computes the locally optimal feature/split combination. In Random forest, each decision tree in the ensemble is built from a sample drawn with replacement from the training set and then gets the prediction from each of them and finally selects the best solution by means of voting. It can be used for both classification as well as regression tasks. Classification with Random Forest For creating a random forest classifier, the Scikit-learn module provides sklearn.ensemble.RandomForestClassifier. While building random forest classifier, the main parameters this module uses are ‘max_features’ and ‘n_estimators’. Here, ‘max_features’ is the size of the random subsets of features to consider when splitting a node. If we choose this parameter’s value to none then it will consider all the features rather than a random subset. On the other hand, n_estimators are the number of trees in the forest. The higher the number of trees, the better the result will be. But it will take longer to compute also. Implementation example In the following example, we are building a random forest classifier by using sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by using cross_val_score module. from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import RandomForestClassifier X, y = make_blobs(n_samples = 10000, n_features = 10, centers = 100,random_state = 0) RFclf = RandomForestClassifier(n_estimators = 10,max_depth = None,min_samples_split = 2, random_state = 0) scores = cross_val_score(RFclf, X, y, cv = 5) scores.mean() Output 0.9997 Example We can also use the sklearn dataset to build Random Forest classifier. As in the following example we are using iris dataset. We will also find its accuracy score and confusion matrix. import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score path = “https://archive.ics.uci.edu/ml/machine-learning-database s/iris/iris.data” headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] dataset = pd.read_csv(path, names = headernames) X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) RFclf = RandomForestClassifier(n_estimators = 50) RFclf.fit(X_train, y_train) y_pred = RFclf.predict(X_test) result = confusion_matrix(y_test, y_pred) print(“Confusion Matrix:”) print(result) result1 = classification_report(y_test, y_pred) print(“Classification Report:”,) print (result1) result2 = accuracy_score(y_test,y_pred) print(“Accuracy:”,result2) Output Confusion Matrix: [[14 0 0] [ 0 18 1] [ 0 0 12]] Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 14 Iris-versicolor 1.00 0.95 0.97 19 Iris-virginica 0.92 1.00 0.96 12 micro avg 0.98 0.98 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Accuracy: 0.9777777777777777 Regression with Random Forest For creating a random forest regression, the Scikit-learn module provides sklearn.ensemble.RandomForestRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.RandomForestClassifier. Implementation example In the following example, we are building a random forest regressor by using sklearn.ensemble.RandomForestregressor and also predicting for new values by using predict() method. from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) RFregr = RandomForestRegressor(max_depth = 10,random_state = 0,n_estimators = 100) RFregr.fit(X, y) Output RandomForestRegressor( bootstrap = True, criterion = ”mse”, max_depth = 10, max_features = ”auto”, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None, oob_score = False, random_state = 0, verbose = 0, warm_start = False ) Once fitted we can predict from regression model as follows − print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [98.47729198] Extra-Tree Methods For each feature under consideration, it selects a random value for the split. The benefit of using extra tree methods is that it allows to reduce the variance of the model a bit more. The disadvantage of using these methods is that it slightly increases the bias. Classification with Extra-Tree Method For creating a classifier using Extra-tree method, the Scikit-learn module provides sklearn.ensemble.ExtraTreesClassifier. It uses the same parameters as used by sklearn.ensemble.RandomForestClassifier. The only difference is in the way, discussed above, they build trees. Implementation example In the following example, we are building a random forest classifier by using sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using cross_val_score module. from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import ExtraTreesClassifier X, y = make_blobs(n_samples = 10000, n_features = 10, centers=100,random_state = 0) ETclf = ExtraTreesClassifier(n_estimators = 10,max_depth = None,min_samples_split = 10, random_state = 0) scores = cross_val_score(ETclf, X, y, cv = 5) scores.mean() Output 1.0 Example We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the following example we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import ExtraTreesClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 7 kfold = KFold(n_splits=10, random_state=seed) num_trees = 150 max_features = 5 ETclf = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features) results = cross_val_score(ETclf, X, Y, cv=kfold) print(results.mean()) Output 0.7551435406698566 Regression with Extra-Tree Method For creating a Extra-Tree regression, the Scikit-learn module provides sklearn.ensemble.ExtraTreesRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.ExtraTreesClassifier. Implementation example In the following example, we are applying sklearn.ensemble.ExtraTreesregressor and on the same data as we used while creating random forest regressor. Let’s see the difference in the Output from sklearn.ensemble import ExtraTreesRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ETregr = ExtraTreesRegressor(max_depth = 10,random_state = 0,n_estimators = 100) ETregr.fit(X, y) Output ExtraTreesRegressor(bootstrap = False, criterion =
Scikit Learn – Stochastic Gradient Descent Here, we will learn about an optimization algorithm in Sklearn, termed as Stochastic Gradient Descent (SGD). Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression. It has been successfully applied to large-scale datasets because the update to the coefficients is performed for each training instance, rather than at the end of instances. SGD Classifier Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikit-learn provides SGDClassifier module to implement SGD classification. Parameters Followings table consist the parameters used by SGDClassifier module − Sr.No Parameter & Description 1 loss − str, default = ‘hinge’ It represents the loss function to be used while implementing. The default value is ‘hinge’ which will give us a linear SVM. The other options which can be used are − log − This loss will give us logistic regression i.e. a probabilistic classifier. modified_huber − a smooth loss that brings tolerance to outliers along with probability estimates. squared_hinge − similar to ‘hinge’ loss but it is quadratically penalized. perceptron − as the name suggests, it is a linear loss which is used by the perceptron algorithm. 2 penalty − str, ‘none’, ‘l2’, ‘l1’, ‘elasticnet’ It is the regularization term used in the model. By default, it is L2. We can use L1 or ‘elasticnet; as well but both might bring sparsity to the model, hence not achievable with L2. 3 alpha − float, default = 0.0001 Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001. 4 l1_ratio − float, default = 0.15 This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0, the penalty would be an L2 penalty. 5 fit_intercept − Boolean, Default=True This parameter specifies that a constant (bias or intercept) should be added to the decision function. No intercept will be used in calculation and data will be assumed already centered, if it will set to false. 6 tol − float or none, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. Its default value is False but if set to None, the iterations will stop when 𝒍loss > best_loss – tol for n_iter_no_changesuccessive epochs. 7 shuffle − Boolean, optional, default = True This parameter represents that whether we want our training data to be shuffled after each epoch or not. 8 verbose − integer, default = 0 It represents the verbosity level. Its default value is 0. 9 epsilon − float, default = 0.1 This parameter specifies the width of the insensitive region. If loss = ‘epsilon-insensitive’, any difference, between current prediction and the correct label, less than the threshold would be ignored. 10 max_iter − int, optional, default = 1000 As name suggest, it represents the maximum number of passes over the epochs i.e. training data. 11 warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution. 12 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options. int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. 13 n_jobs − int or none, optional, Default = None It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1. 14 learning_rate − string, optional, default = ‘optimal’ If learning rate is ‘constant’, eta = eta0; If learning rate is ‘optimal’, eta = 1.0/(alpha*(t+t0)), where t0 is chosen by Leon Bottou; If learning rate = ‘invscalling’, eta = eta0/pow(t, power_t). If learning rate = ‘adaptive’, eta = eta0. 15 eta0 − double, default = 0.0 It represents the initial learning rate for above mentioned learning rate options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’. 16 power_t − idouble, default =0.5 It is the exponent for ‘incscalling’ learning rate. 17 early_stopping − bool, default = False This parameter represents the use of early stopping to terminate training when validation score is not improving. Its default value is false but when set to true, it automatically set aside a stratified fraction of training data as validation and stop training when validation score is not improving. 18 validation_fraction − float, default = 0.1 It is only used when early_stopping is true. It represents the proportion of training data to set asides as validation set for early termination of training data.. 19 n_iter_no_change − int, default=5 It represents the number of iteration with no improvement should algorithm run before early stopping. 20 classs_weight − dict, {class_label: weight} or “balanced”, or None, optional This parameter represents the weights associated with classes. If not provided, the classes are supposed to have weight 1. 20 warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution. 21 average − iBoolean or int, optional, default = false It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1. Attributes Following table consist the attributes