Scikit Learn – Modelling Process This chapter deals with the modelling process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading. Dataset Loading A collection of data is called dataset. It is having the following two components − Features − The variables of data are called its features. They are also known as predictors, inputs or attributes. Feature matrix − It is the collection of features, in case there are more than one. Feature Names − It is the list of all the names of the features. Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. Response Vector − It is used to represent response column. Generally, we have just one response column. Target Names − It represent the possible values taken by a response vector. Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression. Example Following is an example to load iris dataset − from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names print(“Feature names:”, feature_names) print(“Target names:”, target_names) print(“nFirst 10 rows of X:n”, X[:10]) Output Feature names: [”sepal length (cm)”, ”sepal width (cm)”, ”petal length (cm)”, ”petal width (cm)”] Target names: [”setosa” ”versicolor” ”virginica”] First 10 rows of X: [ [5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] ] Splitting the dataset To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did. Example The following example will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example. from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 1 ) print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape) Output (105, 4) (45, 4) (105,) (45,) As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments − X, y − Here, X is the feature matrix and y is the response vector, which need to be split. test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows. random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results. Train the Model Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc. Example In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go into the details of KNN algorithms, as there will be a separate chapter for that. This example is used to make you understand the implementation part only. from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.4, random_state=1 ) from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics classifier_knn = KNeighborsClassifier(n_neighbors = 3) classifier_knn.fit(X_train, y_train) y_pred = classifier_knn.predict(X_test) # Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred) print(“Accuracy:”, metrics.accuracy_score(y_test, y_pred)) # Providing sample data and the model will make prediction out of that data sample = [[5, 5, 3, 2], [2, 4, 3, 5]] preds = classifier_knn.predict(sample) pred_species = [iris.target_names[p] for p in preds] print(“Predictions:”, pred_species) Output Accuracy: 0.9833333333333333 Predictions: [”versicolor”, ”virginica”] Model Persistence Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of joblib package. Consider the example below in which we will be saving the above trained model (classifier_knn) for future use − from sklearn.externals import joblib joblib.dump(classifier_knn, ”iris_classifier_knn.joblib”) The above code will save the model into file named iris_classifier_knn.joblib. Now, the object can be reloaded from the file with the help of following code − joblib.load(”iris_classifier_knn.joblib”) Preprocessing the Data As we are dealing with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques − Binarisation This preprocessing technique is used when we need to convert our numerical values into Boolean values. Example import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] ) data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data) print(“nBinarized data:n”, data_binarized) In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0. Output Binarized data: [ [ 1. 0. 1.] [ 0. 1. 1.] [ 0. 0. 1.] [ 1. 1. 0.] ] Mean Removal This technique is used to eliminate the mean from feature vector so that every feature centered on zero. Example import numpy as np from sklearn import preprocessing Input_data = np.array( [2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]] )
Category: scikit Learn
Scikit Learn – KNN Learning k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-parametric and lazy in nature. Non-parametric means that there is no assumption for the underlying data distribution i.e. the model structure is determined from the dataset. Lazy or instance-based learning means that for the purpose of model generation, it does not require any training data points and whole training data is used in the testing phase. The k-NN algorithm consist of the following two steps − Step 1 In this step, it computes and stores the k nearest neighbors for each sample in the training set. Step 2 In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset. Then among these k-nearest neighbors, it predicts the class through voting (class with majority votes wins). The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides the functionality for unsupervised as well as supervised neighbors-based learning methods. The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is basically only step 1, which is discussed above, and the foundation of many algorithms (KNN and K-means being the famous one) which require the neighbor search. In simple words, it is Unsupervised learner for implementing neighbor searches. On the other hand, the supervised neighbors-based learning is used for classification as well as regression. Unsupervised KNN Learning As discussed, there exist many algorithms like KNN and K-Means that requires nearest neighbor searches. That is why Scikit-learn decided to implement the neighbor search part as its own “learner”. The reason behind making neighbor search as a separate learner is that computing all pairwise distance for finding a nearest neighbor is obviously not very efficient. Let’s see the module used by Sklearn to implement unsupervised nearest neighbor learning along with example. Scikit-learn module sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or Brute Force. In other words, it acts as a uniform interface to these three algorithms. Parameters Followings table consist the parameters used by NearestNeighbors module − Sr.No Parameter & Description 1 n_neighbors − int, optional The number of neighbors to get. The default value is 5. 2 radius − float, optional It limits the distance of neighbors to returns. The default value is 1.0. 3 algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional This parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide ‘auto’, it will attempt to decide the most appropriate algorithm based on the values passed to fit method. 4 leaf_size − int, optional It can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30. 5 metric − string or callable It is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string. We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows − Scikit-learn − [‘cosine’,’manhattan’,‘Euclidean’, ‘l1’,’l2’, ‘cityblock’] Scipy.spatial.distance − [‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘jaccard’, ‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rogerstanimoto’,‘russellrao’, ‘sokalmicheme’,’sokalsneath’, ‘seuclidean’, ‘sqeuclidean’, ‘yule’]. The default metric is ‘Minkowski’. 6 P − integer, optional It is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2). 7 metric_params − dict, optional This is the additional keyword arguments for the metric function. The default value is None. 8 N_jobs − int or None, optional It reprsetst the numer of parallel jobs to run for neighbor search. The default value is None. Implementation Example The example below will find the nearest neighbors between two sets of data by using the sklearn.neighbors.NearestNeighbors module. First, we need to import the required module and packages − from sklearn.neighbors import NearestNeighbors import numpy as np Now, after importing the packages, define the sets of data in between we want to find the nearest neighbors − Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]]) Next, apply the unsupervised learning algorithm, as follows − nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = ”ball_tree”) Next, fit the model with input data set. nrst_neigh.fit(Input_data) Now, find the K-neighbors of data set. It will return the indices and distances of the neighbors of each point. distances, indices = nbrs.kneighbors(Input_data) indices Output array( [ [0, 1, 3], [1, 2, 0], [2, 1, 0], [3, 4, 0], [4, 5, 3], [5, 6, 4], [6, 5, 4] ], dtype = int64 ) distances Output array( [ [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712], [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712] ] ) The above output shows that the nearest neighbor of each point is the point itself i.e. at zero. It is because the query set matches the training set. Example We can also show a connection between neighboring points by producing a sparse graph as follows − nrst_neigh.kneighbors_graph(Input_data).toarray() Output array( [ [1., 1., 0., 1., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 0., 0., 1., 1., 0., 0.], [0., 0., 0., 1., 1., 1., 0.], [0., 0., 0., 0., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1.] ] ) Once we fit the unsupervised NearestNeighbors model, the data will be stored in a data structure based on the value set for the argument ‘algorithm’. After that we can use this unsupervised learner’s kneighbors in a model which requires neighbor searches. Complete working/executable program from sklearn.neighbors import NearestNeighbors import numpy as np Input_data = np.array([[-1, 1], [-2,
Scikit Learn – Estimator API In this chapter, we will learn about Estimator API (application programming interface). Let us begin by understanding what is an Estimator API. What is Estimator API It is one of the main APIs implemented by Scikit-learn. It provides a consistent interface for a wide range of ML applications that’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any of the algorithms like classification, regression, clustering or even with a transformer, that extracts useful features from raw data. For fitting the data, all estimator objects expose a fit method that takes a dataset shown as follows − estimator.fit(data) Next, all the parameters of an estimator can be set, as follows, when it is instantiated by the corresponding attribute. estimator = Estimator (param1=1, param2=2) estimator.param1 The output of the above would be 1. Once data is fitted with an estimator, parameters are estimated from the data at hand. Now, all the estimated parameters will be the attributes of the estimator object ending by an underscore as follows − estimator.estimated_param_ Use of Estimator API Main uses of estimators are as follows − Estimation and decoding of a model Estimator object is used for estimation and decoding of a model. Furthermore, the model is estimated as a deterministic function of the following − The parameters which are provided in object construction. The global random state (numpy.random) if the estimator’s random_state parameter is set to none. Any data passed to the most recent call to fit, fit_transform, or fit_predict. Any data passed in a sequence of calls to partial_fit. Mapping non-rectangular data representation into rectangular data It maps a non-rectangular data representation into rectangular data. In simple words, it takes input where each sample is not represented as an array-like object of fixed length, and producing an array-like object of features for each sample. Distinction between core and outlying samples It models the distinction between core and outlying samples by using following methods − fit fit_predict if transductive predict if inductive Guiding Principles While designing the Scikit-Learn API, following guiding principles kept in mind − Consistency This principle states that all the objects should share a common interface drawn from a limited set of methods. The documentation should also be consistent. Limited object hierarchy This guiding principle says − Algorithms should be represented by Python classes Datasets should be represented in standard format like NumPy arrays, Pandas DataFrames, SciPy sparse matrix. Parameters names should use standard Python strings. Composition As we know that, ML algorithms can be expressed as the sequence of many fundamental algorithms. Scikit-learn makes use of these fundamental algorithms whenever needed. Sensible defaults According to this principle, the Scikit-learn library defines an appropriate default value whenever ML models require user-specified parameters. Inspection As per this guiding principle, every specified parameter value is exposed as pubic attributes. Steps in using Estimator API Followings are the steps in using the Scikit-Learn estimator API − Step 1: Choose a class of model In this first step, we need to choose a class of model. It can be done by importing the appropriate Estimator class from Scikit-learn. Step 2: Choose model hyperparameters In this step, we need to choose class model hyperparameters. It can be done by instantiating the class with desired values. Step 3: Arranging the data Next, we need to arrange the data into features matrix (X) and target vector(y). Step 4: Model Fitting Now, we need to fit the model to your data. It can be done by calling fit() method of the model instance. Step 5: Applying the model After fitting the model, we can apply it to new data. For supervised learning, use predict() method to predict the labels for unknown data. While for unsupervised learning, use predict() or transform() to infer properties of the data. Supervised Learning Example Here, as an example of this process we are taking common case of fitting a line to (x,y) data i.e. simple linear regression. First, we need to load the dataset, we are using iris dataset − Example import seaborn as sns iris = sns.load_dataset(”iris”) X_iris = iris.drop(”species”, axis = 1) X_iris.shape Output (150, 4) Example y_iris = iris[”species”] y_iris.shape Output (150,) Example Now, for this regression example, we are going to use the following sample data − %matplotlib inline import matplotlib.pyplot as plt import numpy as np rng = np.random.RandomState(35) x = 10*rng.rand(40) y = 2*x-1+rng.randn(40) plt.scatter(x,y); Output So, we have the above data for our linear regression example. Now, with this data, we can apply the above-mentioned steps. Choose a class of model Here, to compute a simple linear regression model, we need to import the linear regression class as follows − from sklearn.linear_model import LinearRegression Choose model hyperparameters Once we choose a class of model, we need to make some important choices which are often represented as hyperparameters, or the parameters that must set before the model is fit to data. Here, for this example of linear regression, we would like to fit the intercept by using the fit_intercept hyperparameter as follows − Example model = LinearRegression(fit_intercept = True) model Output LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None, normalize = False) Arranging the data Now, as we know that our target variable y is in correct form i.e. a length n_samples array of 1-D. But, we need to reshape the feature matrix X to make it a matrix of size [n_samples, n_features]. It can be done as follows − Example X = x[:, np.newaxis] X.shape Output (40, 1) Model fitting Once, we arrange the data, it is time to fit the model i.e. to apply our model to data. This can be done with the help of fit() method as follows − Example model.fit(X, y) Output LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None,normalize = False) In Scikit-learn, the fit() process have some trailing underscores. For this example, the
Scikit Learn – Linear Modeling This chapter will help you in learning about the linear modeling in Scikit-Learn. Let us begin by understanding what is linear regression in Sklearn. The following table lists out various linear models provided by Scikit-Learn − Sr.No Model & Description 1 It is one of the best statistical models that studies the relationship between a dependent variable (Y) with a given set of independent variables (X). 2 Logistic regression, despite its name, is a classification algorithm rather than regression algorithm. Based on a given set of independent variables, it is used to estimate discrete value (0 or 1, yes/no, true/false). 3 Ridge regression or Tikhonov regularization is the regularization technique that performs L2 regularization. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the square of the magnitude of coefficients. 4 Bayesian regression allows a natural mechanism to survive insufficient data or poorly distributed data by formulating linear regression using probability distributors rather than point estimates. 5 LASSO is the regularisation technique that performs L1 regularisation. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the summation of the absolute value of coefficients. 6 It allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks. Sklearn provides a linear model named MultiTaskLasso, trained with a mixed L1, L2-norm for regularisation, which estimates sparse coefficients for multiple regression problems jointly. 7 The Elastic-Net is a regularized regression method that linearly combines both penalties i.e. L1 and L2 of the Lasso and Ridge regression methods. It is useful when there are multiple correlated features. 8 It is an Elastic-Net model that allows to fit multiple regression problems jointly enforcing the selected features to be same for all the regression problems, also called tasks
Scikit Learn – Dimensionality Reduction using PCA Dimensionality reduction, an unsupervised machine learning method is used to reduce the number of feature variables for each data sample selecting set of principal features. Principal Component Analysis (PCA) is one of the popular algorithms for dimensionality reduction. Exact PCA Principal Component Analysis (PCA) is used for linear dimensionality reduction using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space. While decomposition using PCA, input data is centered but not scaled for each feature before applying the SVD. The Scikit-learn ML library provides sklearn.decomposition.PCA module that is implemented as a transformer object which learns n components in its fit() method. It can also be used on new data to project it on these components. Example The below example will use sklearn.decomposition.PCA module to find best 5 Principal components from Pima Indians Diabetes dataset. from pandas import read_csv from sklearn.decomposition import PCA path = r”C:UsersLeekhaDesktoppima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ‘class”] dataframe = read_csv(path, names = names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components = 5) fit = pca.fit(X) print((“Explained Variance: %s”) % (fit.explained_variance_ratio_)) print(fit.components_) Output Explained Variance: [0.88854663 0.06159078 0.02579012 0.01308614 0.00744094] [ [-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-029.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-029.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01] [-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-012.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01] [-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01] [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01] ] Incremental PCA Incremental Principal Component Analysis (IPCA) is used to address the biggest limitation of Principal Component Analysis (PCA) and that is PCA only supports batch processing, means all the input data to be processed should fit in the memory. The Scikit-learn ML library provides sklearn.decomposition.IPCA module that makes it possible to implement Out-of-Core PCA either by using its partial_fit method on sequentially fetched chunks of data or by enabling use of np.memmap, a memory mapped file, without loading the entire file into memory. Same as PCA, while decomposition using IPCA, input data is centered but not scaled for each feature before applying the SVD. Example The below example will use sklearn.decomposition.IPCA module on Sklearn digit dataset. from sklearn.datasets import load_digits from sklearn.decomposition import IncrementalPCA X, _ = load_digits(return_X_y = True) transformer = IncrementalPCA(n_components = 10, batch_size = 100) transformer.partial_fit(X[:100, :]) X_transformed = transformer.fit_transform(X) X_transformed.shape Output (1797, 10) Here, we can partially fit on smaller batches of data (as we did on 100 per batch) or you can let the fit() function to divide the data into batches. Kernel PCA Kernel Principal Component Analysis, an extension of PCA, achieves non-linear dimensionality reduction using kernels. It supports both transform and inverse_transform. The Scikit-learn ML library provides sklearn.decomposition.KernelPCA module. Example The below example will use sklearn.decomposition.KernelPCA module on Sklearn digit dataset. We are using sigmoid kernel. from sklearn.datasets import load_digits from sklearn.decomposition import KernelPCA X, _ = load_digits(return_X_y = True) transformer = KernelPCA(n_components = 10, kernel = ”sigmoid”) X_transformed = transformer.fit_transform(X) X_transformed.shape Output (1797, 10) PCA using randomized SVD Principal Component Analysis (PCA) using randomized SVD is used to project data to a lower-dimensional space preserving most of the variance by dropping the singular vector of components associated with lower singular values. Here, the sklearn.decomposition.PCA module with the optional parameter svd_solver=’randomized’ is going to be very useful. Example The below example will use sklearn.decomposition.PCA module with the optional parameter svd_solver=’randomized’ to find best 7 Principal components from Pima Indians Diabetes dataset. from pandas import read_csv from sklearn.decomposition import PCA path = r”C:UsersLeekhaDesktoppima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] dataframe = read_csv(path, names = names) array = dataframe.values X = array[:,0:8] Y = array[:,8] pca = PCA(n_components = 7,svd_solver = ”randomized”) fit = pca.fit(X) print((“Explained Variance: %s”) % (fit.explained_variance_ratio_)) print(fit.components_) Output Explained Variance: [8.88546635e-01 6.15907837e-02 2.57901189e-02 1.30861374e-027.44093864e-03 3.02614919e-03 5.12444875e-04] [ [-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-029.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-029.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01] [-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-012.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01] [-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01] [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01] [-5.04730888e-03 5.07391813e-02 7.56365525e-02 2.21363068e-01-6.13326472e-03 -9.70776708e-01 -2.02903702e-03 -1.51133239e-02] [ 9.86672995e-01 8.83426114e-04 -1.22975947e-03 -3.76444746e-041.42307394e-03 -2.73046214e-03 -6.34402965e-03 -1.62555343e-01] ]
Scikit Learn – Data Representation As we know that machine learning is about to create model from data. For this purpose, computer must understand the data first. Next, we are going to discuss various ways to represent the data in order to be understood by computer − Data as table The best way to represent data in Scikit-learn is in the form of tables. A table represents a 2-D grid of data where rows represent the individual elements of the dataset and the columns represents the quantities related to those individual elements. Example With the example given below, we can download iris dataset in the form of a Pandas DataFrame with the help of python seaborn library. import seaborn as sns iris = sns.load_dataset(”iris”) iris.head() Output sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa From above output, we can see that each row of the data represents a single observed flower and the number of rows represents the total number of flowers in the dataset. Generally, we refer the rows of the matrix as samples. On the other hand, each column of the data represents a quantitative information describing each sample. Generally, we refer the columns of the matrix as features. Data as Feature Matrix Features matrix may be defined as the table layout where information can be thought of as a 2-D matrix. It is stored in a variable named X and assumed to be two dimensional with shape [n_samples, n_features]. Mostly, it is contained in a NumPy array or a Pandas DataFrame. As told earlier, the samples always represent the individual objects described by the dataset and the features represents the distinct observations that describe each sample in a quantitative manner. Data as Target array Along with Features matrix, denoted by X, we also have target array. It is also called label. It is denoted by y. The label or target array is usually one-dimensional having length n_samples. It is generally contained in NumPy array or Pandas Series. Target array may have both the values, continuous numerical values and discrete values. How target array differs from feature columns? We can distinguish both by one point that the target array is usually the quantity we want to predict from the data i.e. in statistical terms it is the dependent variable. Example In the example below, from iris dataset we predict the species of flower based on the other measurements. In this case, the Species column would be considered as the feature. import seaborn as sns iris = sns.load_dataset(”iris”) %matplotlib inline import seaborn as sns; sns.set() sns.pairplot(iris, hue=”species”, height=3); Output X_iris = iris.drop(”species”, axis=1) X_iris.shape y_iris = iris[”species”] y_iris.shape Output (150,4) (150,)
Scikit Learn – Quick Guide Scikit Learn – Introduction In this chapter, we will understand what is Scikit-Learn or Sklearn, origin of Scikit-Learn and some other related topics such as communities and contributors responsible for development and maintenance of Scikit-Learn, its prerequisites, installation and its features. What is Scikit-Learn (Sklearn) Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. Origin of Scikit-Learn It was originally called scikits.learn and was initially developed by David Cournapeau as a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on 1st Feb. 2010. Let’s have a look at its version history − May 2019: scikit-learn 0.21.0 March 2019: scikit-learn 0.20.3 December 2018: scikit-learn 0.20.2 November 2018: scikit-learn 0.20.1 September 2018: scikit-learn 0.20.0 July 2018: scikit-learn 0.19.2 July 2017: scikit-learn 0.19.0 September 2016. scikit-learn 0.18.0 November 2015. scikit-learn 0.17.0 March 2015. scikit-learn 0.16.0 July 2014. scikit-learn 0.15.0 August 2013. scikit-learn 0.14 Community & contributors Scikit-learn is a community effort and anyone can contribute to it. This project is hosted on Following people are currently the core contributors to Sklearn’s development and maintenance − Joris Van den Bossche (Data Scientist) Thomas J Fan (Software Developer) Alexandre Gramfort (Machine Learning Researcher) Olivier Grisel (Machine Learning Expert) Nicolas Hug (Associate Research Scientist) Andreas Mueller (Machine Learning Scientist) Hanmin Qin (Software Engineer) Adrin Jalali (Open Source Developer) Nelle Varoquaux (Data Science Researcher) Roman Yurchak (Data Scientist) Various organisations like Booking.com, JP Morgan, Evernote, Inria, AWeber, Spotify and many more are using Sklearn. Prerequisites Before we start using scikit-learn latest release, we require the following − Python (>=3.5) NumPy (>= 1.11.0) Scipy (>= 0.17.0)li Joblib (>= 0.11) Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities. Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data structure and analysis. Installation If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn − Using pip Following command can be used to install scikit-learn via pip − pip install -U scikit-learn Using conda Following command can be used to install scikit-learn via conda − conda install scikit-learn On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either pip or conda. Another option to use scikit-learn is to use Python distributions like Canopy and Anaconda because they both ship the latest version of scikit-learn. Features Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows − Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn. Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks. Clustering − This model is used for grouping unlabeled data. Cross Validation − It is used to check the accuracy of supervised models on unseen data. Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection. Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models. Feature extraction − It is used to extract the features from data to define the attributes in image and text data. Feature selection − It is used to identify useful attributes to create supervised models. Open Source − It is open source library and also commercially usable under BSD license. Scikit Learn – Modelling Process This chapter deals with the modelling process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading. Dataset Loading A collection of data is called dataset. It is having the following two components − Features − The variables of data are called its features. They are also known as predictors, inputs or attributes. Feature matrix − It is the collection of features, in case there are more than one. Feature Names − It is the list of all the names of the features. Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. Response Vector − It is used to represent response column. Generally, we have just one response column. Target Names − It represent the possible values taken by a response vector. Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression. Example Following is an example to load iris dataset − from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names print(“Feature names:”, feature_names) print(“Target names:”, target_names) print(“nFirst 10 rows of X:n”, X[:10]) Output Feature names: [”sepal length (cm)”, ”sepal width (cm)”, ”petal length (cm)”, ”petal width (cm)”] Target names: [”setosa” ”versicolor” ”virginica”] First 10 rows of X: [ [5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] ] Splitting the dataset To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well
Scikit Learn – Clustering Performance Evaluation There are various functions with the help of which we can evaluate the performance of clustering algorithms. Following are some important and mostly used functions given by the Scikit-learn for evaluating clustering performance − Adjusted Rand Index Rand Index is a function that computes a similarity measure between two clustering. For this computation rand index considers all pairs of samples and counting pairs that are assigned in the similar or different clusters in the predicted and true clustering. Afterwards, the raw Rand Index score is ‘adjusted for chance’ into the Adjusted Rand Index score by using the following formula − $$Adjusted:RI=left(RI-Expected_{-}RIright)/left(maxleft(RIright)-Expected_{-}RIright)$$ It has two parameters namely labels_true, which is ground truth class labels, and labels_pred, which are clusters label to evaluate. Example from sklearn.metrics.cluster import adjusted_rand_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_rand_score(labels_true, labels_pred) Output 0.4444444444444445 Perfect labeling would be scored 1 and bad labelling or independent labelling is scored 0 or negative. Mutual Information Based Score Mutual Information is a function that computes the agreement of the two assignments. It ignores the permutations. There are following versions available − Normalized Mutual Information (NMI) Scikit learn have sklearn.metrics.normalized_mutual_info_score module. Example from sklearn.metrics.cluster import normalized_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] normalized_mutual_info_score (labels_true, labels_pred) Output 0.7611702597222881 Adjusted Mutual Information (AMI) Scikit learn have sklearn.metrics.adjusted_mutual_info_score module. Example from sklearn.metrics.cluster import adjusted_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_mutual_info_score (labels_true, labels_pred) Output 0.4444444444444448 Fowlkes-Mallows Score The Fowlkes-Mallows function measures the similarity of two clustering of a set of points. It may be defined as the geometric mean of the pairwise precision and recall. Mathematically, $$FMS=frac{TP}{sqrt{left(TP+FPright)left(TP+FNright)}}$$ Here, TP = True Positive − number of pair of points belonging to the same clusters in true as well as predicted labels both. FP = False Positive − number of pair of points belonging to the same clusters in true labels but not in the predicted labels. FN = False Negative − number of pair of points belonging to the same clusters in the predicted labels but not in the true labels. The Scikit learn has sklearn.metrics.fowlkes_mallows_score module − Example from sklearn.metrics.cluster import fowlkes_mallows_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] fowlkes_mallows__score (labels_true, labels_pred) Output 0.6546536707079771 Silhouette Coefficient The Silhouette function will compute the mean Silhouette Coefficient of all samples using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Mathematically, $$S=left(b-aright)/maxleft(a,bright)$$ Here, a is intra-cluster distance. and, b is mean nearest-cluster distance. The Scikit learn have sklearn.metrics.silhouette_score module − Example from sklearn import metrics.silhouette_score from sklearn.metrics import pairwise_distances from sklearn import datasets import numpy as np from sklearn.cluster import KMeans dataset = datasets.load_iris() X = dataset.data y = dataset.target kmeans_model = KMeans(n_clusters = 3, random_state = 1).fit(X) labels = kmeans_model.labels_ silhouette_score(X, labels, metric = ”euclidean”) Output 0.5528190123564091 Contingency Matrix This matrix will report the intersection cardinality for every trusted pair of (true, predicted). Confusion matrix for classification problems is a square contingency matrix. The Scikit learn have sklearn.metrics.contingency_matrix module. Example from sklearn.metrics.cluster import contingency_matrix x = [“a”, “a”, “a”, “b”, “b”, “b”] y = [1, 1, 2, 0, 1, 2] contingency_matrix(x, y) Output array([ [0, 2, 1], [1, 1, 1] ]) The first row of above output shows that among three samples whose true cluster is “a”, none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in 2.
Scikit Learn – Randomized Decision Trees This chapter will help you in understanding randomized decision trees in Sklearn. Randomized Decision Tree algorithms As we know that a DT is usually trained by recursively splitting the data, but being prone to overfit, they have been transformed to random forests by training many trees over various subsamples of the data. The sklearn.ensemble module is having following two algorithms based on randomized decision trees − The Random Forest algorithm For each feature under consideration, it computes the locally optimal feature/split combination. In Random forest, each decision tree in the ensemble is built from a sample drawn with replacement from the training set and then gets the prediction from each of them and finally selects the best solution by means of voting. It can be used for both classification as well as regression tasks. Classification with Random Forest For creating a random forest classifier, the Scikit-learn module provides sklearn.ensemble.RandomForestClassifier. While building random forest classifier, the main parameters this module uses are ‘max_features’ and ‘n_estimators’. Here, ‘max_features’ is the size of the random subsets of features to consider when splitting a node. If we choose this parameter’s value to none then it will consider all the features rather than a random subset. On the other hand, n_estimators are the number of trees in the forest. The higher the number of trees, the better the result will be. But it will take longer to compute also. Implementation example In the following example, we are building a random forest classifier by using sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by using cross_val_score module. from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import RandomForestClassifier X, y = make_blobs(n_samples = 10000, n_features = 10, centers = 100,random_state = 0) RFclf = RandomForestClassifier(n_estimators = 10,max_depth = None,min_samples_split = 2, random_state = 0) scores = cross_val_score(RFclf, X, y, cv = 5) scores.mean() Output 0.9997 Example We can also use the sklearn dataset to build Random Forest classifier. As in the following example we are using iris dataset. We will also find its accuracy score and confusion matrix. import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score path = “https://archive.ics.uci.edu/ml/machine-learning-database s/iris/iris.data” headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] dataset = pd.read_csv(path, names = headernames) X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) RFclf = RandomForestClassifier(n_estimators = 50) RFclf.fit(X_train, y_train) y_pred = RFclf.predict(X_test) result = confusion_matrix(y_test, y_pred) print(“Confusion Matrix:”) print(result) result1 = classification_report(y_test, y_pred) print(“Classification Report:”,) print (result1) result2 = accuracy_score(y_test,y_pred) print(“Accuracy:”,result2) Output Confusion Matrix: [[14 0 0] [ 0 18 1] [ 0 0 12]] Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 14 Iris-versicolor 1.00 0.95 0.97 19 Iris-virginica 0.92 1.00 0.96 12 micro avg 0.98 0.98 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Accuracy: 0.9777777777777777 Regression with Random Forest For creating a random forest regression, the Scikit-learn module provides sklearn.ensemble.RandomForestRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.RandomForestClassifier. Implementation example In the following example, we are building a random forest regressor by using sklearn.ensemble.RandomForestregressor and also predicting for new values by using predict() method. from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) RFregr = RandomForestRegressor(max_depth = 10,random_state = 0,n_estimators = 100) RFregr.fit(X, y) Output RandomForestRegressor( bootstrap = True, criterion = ”mse”, max_depth = 10, max_features = ”auto”, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None, oob_score = False, random_state = 0, verbose = 0, warm_start = False ) Once fitted we can predict from regression model as follows − print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [98.47729198] Extra-Tree Methods For each feature under consideration, it selects a random value for the split. The benefit of using extra tree methods is that it allows to reduce the variance of the model a bit more. The disadvantage of using these methods is that it slightly increases the bias. Classification with Extra-Tree Method For creating a classifier using Extra-tree method, the Scikit-learn module provides sklearn.ensemble.ExtraTreesClassifier. It uses the same parameters as used by sklearn.ensemble.RandomForestClassifier. The only difference is in the way, discussed above, they build trees. Implementation example In the following example, we are building a random forest classifier by using sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using cross_val_score module. from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import ExtraTreesClassifier X, y = make_blobs(n_samples = 10000, n_features = 10, centers=100,random_state = 0) ETclf = ExtraTreesClassifier(n_estimators = 10,max_depth = None,min_samples_split = 10, random_state = 0) scores = cross_val_score(ETclf, X, y, cv = 5) scores.mean() Output 1.0 Example We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the following example we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import ExtraTreesClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 7 kfold = KFold(n_splits=10, random_state=seed) num_trees = 150 max_features = 5 ETclf = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features) results = cross_val_score(ETclf, X, Y, cv=kfold) print(results.mean()) Output 0.7551435406698566 Regression with Extra-Tree Method For creating a Extra-Tree regression, the Scikit-learn module provides sklearn.ensemble.ExtraTreesRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.ExtraTreesClassifier. Implementation example In the following example, we are applying sklearn.ensemble.ExtraTreesregressor and on the same data as we used while creating random forest regressor. Let’s see the difference in the Output from sklearn.ensemble import ExtraTreesRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ETregr = ExtraTreesRegressor(max_depth = 10,random_state = 0,n_estimators = 100) ETregr.fit(X, y) Output ExtraTreesRegressor(bootstrap = False, criterion =
Scikit Learn – Stochastic Gradient Descent Here, we will learn about an optimization algorithm in Sklearn, termed as Stochastic Gradient Descent (SGD). Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression. It has been successfully applied to large-scale datasets because the update to the coefficients is performed for each training instance, rather than at the end of instances. SGD Classifier Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikit-learn provides SGDClassifier module to implement SGD classification. Parameters Followings table consist the parameters used by SGDClassifier module − Sr.No Parameter & Description 1 loss − str, default = ‘hinge’ It represents the loss function to be used while implementing. The default value is ‘hinge’ which will give us a linear SVM. The other options which can be used are − log − This loss will give us logistic regression i.e. a probabilistic classifier. modified_huber − a smooth loss that brings tolerance to outliers along with probability estimates. squared_hinge − similar to ‘hinge’ loss but it is quadratically penalized. perceptron − as the name suggests, it is a linear loss which is used by the perceptron algorithm. 2 penalty − str, ‘none’, ‘l2’, ‘l1’, ‘elasticnet’ It is the regularization term used in the model. By default, it is L2. We can use L1 or ‘elasticnet; as well but both might bring sparsity to the model, hence not achievable with L2. 3 alpha − float, default = 0.0001 Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001. 4 l1_ratio − float, default = 0.15 This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0, the penalty would be an L2 penalty. 5 fit_intercept − Boolean, Default=True This parameter specifies that a constant (bias or intercept) should be added to the decision function. No intercept will be used in calculation and data will be assumed already centered, if it will set to false. 6 tol − float or none, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. Its default value is False but if set to None, the iterations will stop when 𝒍loss > best_loss – tol for n_iter_no_changesuccessive epochs. 7 shuffle − Boolean, optional, default = True This parameter represents that whether we want our training data to be shuffled after each epoch or not. 8 verbose − integer, default = 0 It represents the verbosity level. Its default value is 0. 9 epsilon − float, default = 0.1 This parameter specifies the width of the insensitive region. If loss = ‘epsilon-insensitive’, any difference, between current prediction and the correct label, less than the threshold would be ignored. 10 max_iter − int, optional, default = 1000 As name suggest, it represents the maximum number of passes over the epochs i.e. training data. 11 warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution. 12 random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options. int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random. 13 n_jobs − int or none, optional, Default = None It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1. 14 learning_rate − string, optional, default = ‘optimal’ If learning rate is ‘constant’, eta = eta0; If learning rate is ‘optimal’, eta = 1.0/(alpha*(t+t0)), where t0 is chosen by Leon Bottou; If learning rate = ‘invscalling’, eta = eta0/pow(t, power_t). If learning rate = ‘adaptive’, eta = eta0. 15 eta0 − double, default = 0.0 It represents the initial learning rate for above mentioned learning rate options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’. 16 power_t − idouble, default =0.5 It is the exponent for ‘incscalling’ learning rate. 17 early_stopping − bool, default = False This parameter represents the use of early stopping to terminate training when validation score is not improving. Its default value is false but when set to true, it automatically set aside a stratified fraction of training data as validation and stop training when validation score is not improving. 18 validation_fraction − float, default = 0.1 It is only used when early_stopping is true. It represents the proportion of training data to set asides as validation set for early termination of training data.. 19 n_iter_no_change − int, default=5 It represents the number of iteration with no improvement should algorithm run before early stopping. 20 classs_weight − dict, {class_label: weight} or “balanced”, or None, optional This parameter represents the weights associated with classes. If not provided, the classes are supposed to have weight 1. 20 warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution. 21 average − iBoolean or int, optional, default = false It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1. Attributes Following table consist the attributes