Scikit Learn – Conventions Scikit-learn’s objects share a uniform basic API that consists of the following three complementary interfaces − Estimator interface − It is for building and fitting the models. Predictor interface − It is for making predictions. Transformer interface − It is for converting data. The APIs adopt simple conventions and the design choices have been guided in a manner to avoid the proliferation of framework code. Purpose of Conventions The purpose of conventions is to make sure that the API stick to the following broad principles − Consistency − All the objects whether they are basic, or composite must share a consistent interface which further composed of a limited set of methods. Inspection − Constructor parameters and parameters values determined by learning algorithm should be stored and exposed as public attributes. Non-proliferation of classes − Datasets should be represented as NumPy arrays or Scipy sparse matrix whereas hyper-parameters names and values should be represented as standard Python strings to avoid the proliferation of framework code. Composition − The algorithms whether they are expressible as sequences or combinations of transformations to the data or naturally viewed as meta-algorithms parameterized on other algorithms, should be implemented and composed from existing building blocks. Sensible defaults − In scikit-learn whenever an operation requires a user-defined parameter, an appropriate default value is defined. This default value should cause the operation to be performed in a sensible way, for example, giving a base-line solution for the task at hand. Various Conventions The conventions available in Sklearn are explained below − Type casting It states that the input should be cast to float64. In the following example, in which sklearn.random_projection module used to reduce the dimensionality of the data, will explain it − Example import numpy as np from sklearn import random_projection rannge = np.random.RandomState(0) X = range.rand(10,2000) X = np.array(X, dtype = ”float32”) X.dtype Transformer_data = random_projection.GaussianRandomProjection() X_new = transformer.fit_transform(X) X_new.dtype Output dtype(”float32”) dtype(”float64”) In the above example, we can see that X is float32 which is cast to float64 by fit_transform(X). Refitting & Updating Parameters Hyper-parameters of an estimator can be updated and refitted after it has been constructed via the set_params() method. Let’s see the following example to understand it − Example import numpy as np from sklearn.datasets import load_iris from sklearn.svm import SVC X, y = load_iris(return_X_y = True) clf = SVC() clf.set_params(kernel = ”linear”).fit(X, y) clf.predict(X[:5]) Output array([0, 0, 0, 0, 0]) Once the estimator has been constructed, above code will change the default kernel rbf to linear via SVC.set_params(). Now, the following code will change back the kernel to rbf to refit the estimator and to make a second prediction. Example clf.set_params(kernel = ”rbf”, gamma = ”scale”).fit(X, y) clf.predict(X[:5]) Output array([0, 0, 0, 0, 0]) Complete code The following is the complete executable program − import numpy as np from sklearn.datasets import load_iris from sklearn.svm import SVC X, y = load_iris(return_X_y = True) clf = SVC() clf.set_params(kernel = ”linear”).fit(X, y) clf.predict(X[:5]) clf.set_params(kernel = ”rbf”, gamma = ”scale”).fit(X, y) clf.predict(X[:5]) Multiclass & Multilabel fitting In case of multiclass fitting, both learning and the prediction tasks are dependent on the format of the target data fit upon. The module used is sklearn.multiclass. Check the example below, where multiclass classifier is fit on a 1d array. Example from sklearn.svm import SVC from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import LabelBinarizer X = [[1, 2], [3, 4], [4, 5], [5, 2], [1, 1]] y = [0, 0, 1, 1, 2] classif = OneVsRestClassifier(estimator = SVC(gamma = ”scale”,random_state = 0)) classif.fit(X, y).predict(X) Output array([0, 0, 1, 1, 2]) In the above example, classifier is fit on one dimensional array of multiclass labels and the predict() method hence provides corresponding multiclass prediction. But on the other hand, it is also possible to fit upon a two-dimensional array of binary label indicators as follows − Example from sklearn.svm import SVC from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import LabelBinarizer X = [[1, 2], [3, 4], [4, 5], [5, 2], [1, 1]] y = LabelBinarizer().fit_transform(y) classif.fit(X, y).predict(X) Output array( [ [0, 0, 0], [0, 0, 0], [0, 1, 0], [0, 1, 0], [0, 0, 0] ] ) Similarly, in case of multilabel fitting, an instance can be assigned multiple labels as follows − Example from sklearn.preprocessing import MultiLabelBinarizer y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]] y = MultiLabelBinarizer().fit_transform(y) classif.fit(X, y).predict(X) Output array( [ [1, 0, 1, 0, 0], [1, 0, 1, 0, 0], [1, 0, 1, 1, 0], [1, 0, 1, 1, 0], [1, 0, 1, 0, 0] ] ) In the above example, sklearn.MultiLabelBinarizer is used to binarize the two dimensional array of multilabels to fit upon. That’s why predict() function gives a 2d array as output with multiple labels for each instance.
Category: scikit Learn
Scikit Learn – Introduction In this chapter, we will understand what is Scikit-Learn or Sklearn, origin of Scikit-Learn and some other related topics such as communities and contributors responsible for development and maintenance of Scikit-Learn, its prerequisites, installation and its features. What is Scikit-Learn (Sklearn) Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. Origin of Scikit-Learn It was originally called scikits.learn and was initially developed by David Cournapeau as a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on 1st Feb. 2010. Let’s have a look at its version history − May 2019: scikit-learn 0.21.0 March 2019: scikit-learn 0.20.3 December 2018: scikit-learn 0.20.2 November 2018: scikit-learn 0.20.1 September 2018: scikit-learn 0.20.0 July 2018: scikit-learn 0.19.2 July 2017: scikit-learn 0.19.0 September 2016. scikit-learn 0.18.0 November 2015. scikit-learn 0.17.0 March 2015. scikit-learn 0.16.0 July 2014. scikit-learn 0.15.0 August 2013. scikit-learn 0.14 Community & contributors Scikit-learn is a community effort and anyone can contribute to it. This project is hosted on Following people are currently the core contributors to Sklearn’s development and maintenance − Joris Van den Bossche (Data Scientist) Thomas J Fan (Software Developer) Alexandre Gramfort (Machine Learning Researcher) Olivier Grisel (Machine Learning Expert) Nicolas Hug (Associate Research Scientist) Andreas Mueller (Machine Learning Scientist) Hanmin Qin (Software Engineer) Adrin Jalali (Open Source Developer) Nelle Varoquaux (Data Science Researcher) Roman Yurchak (Data Scientist) Various organisations like Booking.com, JP Morgan, Evernote, Inria, AWeber, Spotify and many more are using Sklearn. Prerequisites Before we start using scikit-learn latest release, we require the following − Python (>=3.5) NumPy (>= 1.11.0) Scipy (>= 0.17.0)li Joblib (>= 0.11) Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities. Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data structure and analysis. Installation If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn − Using pip Following command can be used to install scikit-learn via pip − pip install -U scikit-learn Using conda Following command can be used to install scikit-learn via conda − conda install scikit-learn On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either pip or conda. Another option to use scikit-learn is to use Python distributions like Canopy and Anaconda because they both ship the latest version of scikit-learn. Features Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows − Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn. Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks. Clustering − This model is used for grouping unlabeled data. Cross Validation − It is used to check the accuracy of supervised models on unseen data. Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection. Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models. Feature extraction − It is used to extract the features from data to define the attributes in image and text data. Feature selection − It is used to identify useful attributes to create supervised models. Open Source − It is open source library and also commercially usable under BSD license.