scikit Learn Archives - Page 3 of 3 - Donotsad where can learn any thing work project and make money

Aug 12

Learn Clustering Performance Evaluation work project make money

Scikit Learn – Clustering Performance Evaluation There are various functions with the help of which we can evaluate the performance of clustering algorithms. Following are some important and mostly used functions given by the Scikit-learn for evaluating clustering performance − Adjusted Rand Index Rand Index is a function that computes a similarity measure between two clustering. For this computation rand index considers all pairs of samples and counting pairs that are assigned in the similar or different clusters in the predicted and true clustering. Afterwards, the raw Rand Index score is ‘adjusted for chance’ into the Adjusted Rand Index score by using the following formula − $$Adjusted:RI=left(RI-Expected_{-}RIright)/left(maxleft(RIright)-Expected_{-}RIright)$$ It has two parameters namely labels_true, which is ground truth class labels, and labels_pred, which are clusters label to evaluate. Example from sklearn.metrics.cluster import adjusted_rand_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_rand_score(labels_true, labels_pred) Output 0.4444444444444445 Perfect labeling would be scored 1 and bad labelling or independent labelling is scored 0 or negative. Mutual Information Based Score Mutual Information is a function that computes the agreement of the two assignments. It ignores the permutations. There are following versions available − Normalized Mutual Information (NMI) Scikit learn have sklearn.metrics.normalized_mutual_info_score module. Example from sklearn.metrics.cluster import normalized_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] normalized_mutual_info_score (labels_true, labels_pred) Output 0.7611702597222881 Adjusted Mutual Information (AMI) Scikit learn have sklearn.metrics.adjusted_mutual_info_score module. Example from sklearn.metrics.cluster import adjusted_mutual_info_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] adjusted_mutual_info_score (labels_true, labels_pred) Output 0.4444444444444448 Fowlkes-Mallows Score The Fowlkes-Mallows function measures the similarity of two clustering of a set of points. It may be defined as the geometric mean of the pairwise precision and recall. Mathematically, $$FMS=frac{TP}{sqrt{left(TP+FPright)left(TP+FNright)}}$$ Here, TP = True Positive − number of pair of points belonging to the same clusters in true as well as predicted labels both. FP = False Positive − number of pair of points belonging to the same clusters in true labels but not in the predicted labels. FN = False Negative − number of pair of points belonging to the same clusters in the predicted labels but not in the true labels. The Scikit learn has sklearn.metrics.fowlkes_mallows_score module − Example from sklearn.metrics.cluster import fowlkes_mallows_score labels_true = [0, 0, 1, 1, 1, 1] labels_pred = [0, 0, 2, 2, 3, 3] fowlkes_mallows__score (labels_true, labels_pred) Output 0.6546536707079771 Silhouette Coefficient The Silhouette function will compute the mean Silhouette Coefficient of all samples using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Mathematically, $$S=left(b-aright)/maxleft(a,bright)$$ Here, a is intra-cluster distance. and, b is mean nearest-cluster distance. The Scikit learn have sklearn.metrics.silhouette_score module − Example from sklearn import metrics.silhouette_score from sklearn.metrics import pairwise_distances from sklearn import datasets import numpy as np from sklearn.cluster import KMeans dataset = datasets.load_iris() X = dataset.data y = dataset.target kmeans_model = KMeans(n_clusters = 3, random_state = 1).fit(X) labels = kmeans_model.labels_ silhouette_score(X, labels, metric = ”euclidean”) Output 0.5528190123564091 Contingency Matrix This matrix will report the intersection cardinality for every trusted pair of (true, predicted). Confusion matrix for classification problems is a square contingency matrix. The Scikit learn have sklearn.metrics.contingency_matrix module. Example from sklearn.metrics.cluster import contingency_matrix x = [“a”, “a”, “a”, “b”, “b”, “b”] y = [1, 1, 2, 0, 1, 2] contingency_matrix(x, y) Output array([ [0, 2, 1], [1, 1, 1] ]) The first row of above output shows that among three samples whose true cluster is “a”, none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in 2.

Aug 12

Learn Randomized Decision Trees work project make money

Scikit Learn – Randomized Decision Trees This chapter will help you in understanding randomized decision trees in Sklearn. Randomized Decision Tree algorithms As we know that a DT is usually trained by recursively splitting the data, but being prone to overfit, they have been transformed to random forests by training many trees over various subsamples of the data. The sklearn.ensemble module is having following two algorithms based on randomized decision trees − The Random Forest algorithm For each feature under consideration, it computes the locally optimal feature/split combination. In Random forest, each decision tree in the ensemble is built from a sample drawn with replacement from the training set and then gets the prediction from each of them and finally selects the best solution by means of voting. It can be used for both classification as well as regression tasks. Classification with Random Forest For creating a random forest classifier, the Scikit-learn module provides sklearn.ensemble.RandomForestClassifier. While building random forest classifier, the main parameters this module uses are ‘max_features’ and ‘n_estimators’. Here, ‘max_features’ is the size of the random subsets of features to consider when splitting a node. If we choose this parameter’s value to none then it will consider all the features rather than a random subset. On the other hand, n_estimators are the number of trees in the forest. The higher the number of trees, the better the result will be. But it will take longer to compute also. Implementation example In the following example, we are building a random forest classifier by using sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by using cross_val_score module. from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import RandomForestClassifier X, y = make_blobs(n_samples = 10000, n_features = 10, centers = 100,random_state = 0) RFclf = RandomForestClassifier(n_estimators = 10,max_depth = None,min_samples_split = 2, random_state = 0) scores = cross_val_score(RFclf, X, y, cv = 5) scores.mean() Output 0.9997 Example We can also use the sklearn dataset to build Random Forest classifier. As in the following example we are using iris dataset. We will also find its accuracy score and confusion matrix. import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score path = “https://archive.ics.uci.edu/ml/machine-learning-database s/iris/iris.data” headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] dataset = pd.read_csv(path, names = headernames) X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) RFclf = RandomForestClassifier(n_estimators = 50) RFclf.fit(X_train, y_train) y_pred = RFclf.predict(X_test) result = confusion_matrix(y_test, y_pred) print(“Confusion Matrix:”) print(result) result1 = classification_report(y_test, y_pred) print(“Classification Report:”,) print (result1) result2 = accuracy_score(y_test,y_pred) print(“Accuracy:”,result2) Output Confusion Matrix: [[14 0 0] [ 0 18 1] [ 0 0 12]] Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 14 Iris-versicolor 1.00 0.95 0.97 19 Iris-virginica 0.92 1.00 0.96 12 micro avg 0.98 0.98 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Accuracy: 0.9777777777777777 Regression with Random Forest For creating a random forest regression, the Scikit-learn module provides sklearn.ensemble.RandomForestRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.RandomForestClassifier. Implementation example In the following example, we are building a random forest regressor by using sklearn.ensemble.RandomForestregressor and also predicting for new values by using predict() method. from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) RFregr = RandomForestRegressor(max_depth = 10,random_state = 0,n_estimators = 100) RFregr.fit(X, y) Output RandomForestRegressor( bootstrap = True, criterion = ”mse”, max_depth = 10, max_features = ”auto”, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None, oob_score = False, random_state = 0, verbose = 0, warm_start = False ) Once fitted we can predict from regression model as follows − print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]])) Output [98.47729198] Extra-Tree Methods For each feature under consideration, it selects a random value for the split. The benefit of using extra tree methods is that it allows to reduce the variance of the model a bit more. The disadvantage of using these methods is that it slightly increases the bias. Classification with Extra-Tree Method For creating a classifier using Extra-tree method, the Scikit-learn module provides sklearn.ensemble.ExtraTreesClassifier. It uses the same parameters as used by sklearn.ensemble.RandomForestClassifier. The only difference is in the way, discussed above, they build trees. Implementation example In the following example, we are building a random forest classifier by using sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using cross_val_score module. from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import ExtraTreesClassifier X, y = make_blobs(n_samples = 10000, n_features = 10, centers=100,random_state = 0) ETclf = ExtraTreesClassifier(n_estimators = 10,max_depth = None,min_samples_split = 10, random_state = 0) scores = cross_val_score(ETclf, X, y, cv = 5) scores.mean() Output 1.0 Example We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the following example we are using Pima-Indian dataset. from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import ExtraTreesClassifier path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] data = read_csv(path, names=headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 7 kfold = KFold(n_splits=10, random_state=seed) num_trees = 150 max_features = 5 ETclf = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features) results = cross_val_score(ETclf, X, Y, cv=kfold) print(results.mean()) Output 0.7551435406698566 Regression with Extra-Tree Method For creating a Extra-Tree regression, the Scikit-learn module provides sklearn.ensemble.ExtraTreesRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.ExtraTreesClassifier. Implementation example In the following example, we are applying sklearn.ensemble.ExtraTreesregressor and on the same data as we used while creating random forest regressor. Let’s see the difference in the Output from sklearn.ensemble import ExtraTreesRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ETregr = ExtraTreesRegressor(max_depth = 10,random_state = 0,n_estimators = 100) ETregr.fit(X, y) Output ExtraTreesRegressor(bootstrap = False, criterion =