Classification – Introduction ”; Previous Next Introduction to Classification Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”. Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set. An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification. To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email. Types of Learners in Classification We have two types of learners in respective to classification problems − Lazy Learners As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning. Eager Learners As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN). Building a Classifier in Python Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows − Step1: Importing necessary python package For building a classifier using scikit-learn, we need to import it. We can import it by using following script − import sklearn Step2: Importing dataset After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script − from sklearn.datasets import load_breast_cancer The following script will load the dataset; data = load_breast_cancer() We also need to organize the data and it can be done with the help of following scripts − label_names = data[”target_names”] labels = data[”target”] feature_names = data[”feature_names”] features = data[”data”] The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database. print(label_names) The output of the above command is the names of the labels − [”malignant” ”benign”] These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1. The feature names and feature values of these labels can be seen with the help of following commands − print(feature_names[0]) The output of the above command is the names of the features for label 0 i.e. Malignant cancer − mean radius Similarly, names of the features for label can be produced as follows − print(feature_names[1]) The output of the above command is the names of the features for label 1 i.e. Benign cancer − mean texture We can print the features for these labels with the help of following command − print(features[0]) This will give the following output − [ 1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01 4.601e-01 1.189e-01 ] We can print the features for these labels with the help of following command − print(features[1]) This will give the following output − [ 2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02 7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01 5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01 2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01 2.750e-01 8.902e-02 ] Step3: Organizing data into training & testing sets As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function − from sklearn.model_selection import train_test_split Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose − train, test, train_labels, test_labels = train_test_split( features,labels,test_size = 0.40, random_state = 42 ) Step4: Model evaluation After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module − from sklearn.naive_bayes import GaussianNB Now, initialize the model as follows − gnb = GaussianNB() Next, with the help of following command we can train the model − model = gnb.fit(train, train_labels) Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows − preds = gnb.predict(test) print(preds) This will give the following output − [ 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1
Category: Machine Learning
Decision Tree
Machine Learning – Decision Trees Algorithm ”; Previous Next The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data. The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best feature is the one that provides the most information gain or the most reduction in entropy. Information gain is a measure of the amount of information gained by splitting the data at a particular feature, while entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to determine the best feature to split the data at each node. The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below − In the above decision tree, the question are decision nodes and final outcomes are leaves. Types of Decision Tree Algorithm There are two main types of Decision Tree algorithm − Classification Tree − A classification tree is used to classify data into different classes or categories. It works by splitting the data into subsets based on the values of the input features and assigning each subset to a different class. Regression Tree − A regression tree is used to predict numerical values or continuous variables. It works by splitting the data into subsets based on the values of the input features and assigning each subset a numerical value. Implementation in Python Let”s implement the Decision Tree algorithm in Python using a popular dataset for classification tasks named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica. First, we will import the necessary libraries and load the dataset − import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Load the iris dataset iris = load_iris() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) We then create an instance of the Decision Tree classifier and train it on the training set − # Create a Decision Tree classifier dtc = DecisionTreeClassifier() # Fit the classifier to the training data dtc.fit(X_train, y_train) We can now use the trained classifier to make predictions on the testing set − # Make predictions on the testing data y_pred = dtc.predict(X_test) We can evaluate the performance of the classifier by calculating its accuracy − # Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) We can visualize the Decision Tree using Matplotlib library − import matplotlib.pyplot as plt from sklearn.tree import plot_tree # Visualize the Decision Tree using Matplotlib plt.figure(figsize=(20,10)) plot_tree(dtc, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show() The plot_tree function from the sklearn.tree module can be used to plot the Decision Tree. We can pass in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names argument to label the features, and the class_names argument to label the target classes. We also specify the figsize argument to set the size of the figure and call the show function to display the plot. Complete Implementation Example Given below is the complete implementation example of Decision Tree Classification algorithm in python using the iris dataset − import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Load the iris dataset iris = load_iris() # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) # Create a Decision Tree classifier dtc = DecisionTreeClassifier() # Fit the classifier to the training data dtc.fit(X_train, y_train) # Make predictions on the testing data y_pred = dtc.predict(X_test) # Calculate the accuracy of the classifier accuracy = np.sum(y_pred == y_test) / len(y_test) print(“Accuracy:”, accuracy) # Visualize the Decision Tree using Matplotlib import matplotlib.pyplot as plt from sklearn.tree import plot_tree plt.figure(figsize=(20,10)) plot_tree(dtc, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show() Output This will create a plot of the Decision Tree that looks like this − Accuracy: 0.9777777777777777 As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision based on the value of a feature, and each leaf node representing a class or numerical value. The color of each node indicates the majority class or value of the samples in that node, and the numbers at the bottom indicate the number of samples that reach that node. Print Page Previous Next Advertisements ”;
Mean Shift Algorithm
Machine Learning – Mean-Shift Clustering ”; Previous Next The Mean-Shift clustering algorithm is a non-parametric clustering algorithm that works by iteratively shifting the mean of a data point towards the densest area of the data. The densest area of the data is determined by the kernel function, which is a function that assigns weights to the data points based on their distance from the mean. The kernel function used in Mean-Shift clustering is usually a Gaussian function. The steps involved in the Mean-Shift clustering algorithm are as follows − Initialize the mean of each data point to its own value. For each data point, compute the mean shift vector, which is the vector that points towards the densest area of the data. Update the mean of each data point by shifting it towards the densest area of the data. Repeat steps 2 and 3 until convergence is reached. The Mean-Shift clustering algorithm is a density-based clustering algorithm, which means that it identifies clusters based on the density of the data points rather than the distance between them. In other words, the algorithm identifies clusters based on the areas where the density of the data points is highest. Implementation of Mean-Shift Clustering in Python The Mean-Shift clustering algorithm can be implemented in Python programming language using the scikit-learn library. The scikit-learn library is a popular machine learning library in Python that provides various tools for data analysis and machine learning. The following steps are involved in implementing the Mean-Shift clustering algorithm in Python using the scikit-learn library − Step 1 − Import the necessary libraries The numpy library is used for scientific computing in Python, while the matplotlib library is used for data visualization. The sklearn.cluster library contains the MeanShift class, which is used for implementing the Mean-Shift clustering algorithm in Python. The estimate_bandwidth function is used to estimate the bandwidth of the kernel function, which is an important parameter in the Mean-Shift clustering algorithm. import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth Step 2 − Generate the data In this step, we generate a random dataset with 500 data points and 2 features. We use the numpy.random.randn function to generate the data. # Generate the data X = np.random.randn(500,2) Step 3 − Estimate the bandwidth of the kernel function In this step, we estimate the bandwidth of the kernel function using the estimate_bandwidth function. The bandwidth is an important parameter in the Mean-Shift clustering algorithm, which determines the width of the kernel function. # Estimate the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100) Step 4 − Initialize the Mean-Shift clustering algorithm In this step, we initialize the Mean-Shift clustering algorithm using the MeanShift class. We pass the bandwidth parameter to the class to set the width of the kernel function. # Initialize the Mean-Shift algorithm ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) Step 5 − Train the model In this step, we train the Mean-Shift clustering algorithm on the dataset using the fit method of the MeanShift class. # Train the model ms.fit(X) Step 6 − Visualize the results # Visualize the results labels = ms.labels_ cluster_centers = ms.cluster_centers_ n_clusters_ = len(np.unique(labels)) print(“Number of estimated clusters:”, n_clusters_) # Plot the data points and the centroids plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:,0], X[:,1], c=labels, cmap=”viridis”) plt.scatter(cluster_centers[:,0], cluster_centers[:,1], marker=”*”, s=300, c=”r”) plt.show() In this step, we visualize the results of the Mean-Shift clustering algorithm. We extract the cluster labels and the cluster centers from the trained model. We then print the number of estimated clusters. Finally, we plot the data points and the centroids using the matplotlib library. Example Here is the complete implementation example of Mean-Shift Clustering Algorithm in python − import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift, estimate_bandwidth # Generate the data X = np.random.randn(500,2) # Estimate the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=100) # Initialize the Mean-Shift algorithm ms = MeanShift(bandwidth=bandwidth, bin_seeding=True) # Train the model ms.fit(X) # Visualize the results labels = ms.labels_ cluster_centers = ms.cluster_centers_ n_clusters_ = len(np.unique(labels)) print(“Number of estimated clusters:”, n_clusters_) # Plot the data points and the centroids plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:,0], X[:,1], c=labels, cmap=”summer”) plt.scatter(cluster_centers[:,0], cluster_centers[:,1], marker=”*”, s=200, c=”r”) plt.show() Output When you execute the program, it will produce the following plot as the output − Applications of Mean-Shift Clustering The Mean-Shift clustering algorithm has several applications in various fields. Some of the applications of Mean-Shift clustering are as follows − Computer vision − Mean-Shift clustering is widely used in computer vision for object tracking, image segmentation, and feature extraction. Image processing − Mean-Shift clustering is used for image segmentation, which is the process of dividing an image into multiple segments based on the similarity of the pixels. Anomaly detection − Mean-Shift clustering can be used for detecting anomalies in data by identifying the areas with low density. Customer segmentation − Mean-Shift clustering can be used for customer segmentation in marketing by identifying groups of customers with similar behavior and preferences. Social network analysis − Mean-Shift clustering can be used for clustering users in social networks based on their interests and interactions. Print Page Previous Next Advertisements ”;
Random Forest
Classification Algorithms – Random Forest ”; Previous Next Introduction Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result. Working of Random Forest Algorithm We can understand the working of Random Forest algorithm with the help of following steps − Step 1 − First, start with the selection of random samples from a given dataset. Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree. Step 3 − In this step, voting will be performed for every predicted result. Step 4 − At last, select the most voted prediction result as the final prediction result. The following diagram will illustrate its working − Implementation in Python First, start with importing necessary Python packages − import numpy as np import matplotlib.pyplot as plt import pandas as pd Next, download the iris dataset from its weblink as follows − path = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data” Next, we need to assign column names to the dataset as follows − headernames = [”sepal-length”, ”sepal-width”, ”petal-length”, ”petal-width”, ”Class”] Now, we need to read dataset to pandas dataframe as follows − dataset = pd.read_csv(path, names=headernames) dataset.head() sepal-length sepal-width petal-length petal-width Class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa Data Preprocessing will be done with the help of following script lines − X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data − from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30) Next, train the model with the help of RandomForestClassifier class of sklearn as follows − from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=50) classifier.fit(X_train, y_train) At last, we need to make prediction. It can be done with the help of following script − y_pred = classifier.predict(X_test) Next, print the results as follows − from sklearn.metrics import classification_report, confusion_matrix, accuracy_score result = confusion_matrix(y_test, y_pred) print(“Confusion Matrix:”) print(result) result1 = classification_report(y_test, y_pred) print(“Classification Report:”,) print (result1) result2 = accuracy_score(y_test,y_pred) print(“Accuracy:”,result2) Output Confusion Matrix: [ [14 0 0] [ 0 18 1] [ 0 0 12] ] Classification Report: precision recall f1-score support Iris-setosa 1.00 1.00 1.00 14 Iris-versicolor 1.00 0.95 0.97 19 Iris-virginica 0.92 1.00 0.96 12 micro avg 0.98 0.98 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Accuracy: 0.9777777777777777 Pros and Cons of Random Forest Pros The following are the advantages of Random Forest algorithm − It overcomes the problem of overfitting by averaging or combining the results of different decision trees. Random forests work well for a large range of data items than a single decision tree does. Random forest has less variance then single decision tree. Random forests are very flexible and possess very high accuracy. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling. Cons The following are the disadvantages of Random Forest algorithm − Complexity is the main disadvantage of Random forest algorithms. Construction of Random forests are much harder and time-consuming than decision trees. More computational resources are required to implement Random Forest algorithm. It is less intuitive in case when we have a large collection of decision trees . The prediction process using random forests is very time-consuming in comparison with other algorithms. Print Page Previous Next Advertisements ”;
Support Vector Machine (SVM)
Support Vector Machine (SVM) ”; Previous Next Introduction to SVM Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithms which are used both for classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990. SVMs have their unique way of implementation as compared to other machine learning algorithms. Lately, they are extremely popular because of their ability to handle multiple continuous and categorical variables. Working of SVM An SVM model is basically a representation of different classes in a hyperplane in multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH). The followings are important concepts in SVM − Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points. Hyperplane − As we can see in the above diagram, it is a decision plane or space which is divided between a set of objects having different classes. Margin − It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. Large margin is considered as a good margin and small margin is considered as a bad margin. The main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) and it can be done in the following two steps − First, SVM will generate hyperplanes iteratively that segregates the classes in best way. Then, it will choose the hyperplane that separates the classes correctly. Implementing SVM in Python For implementing SVM in Python we will start with the standard libraries import as follows − import numpy as np import matplotlib.pyplot as plt from scipy import stats import seaborn as sns; sns.set() Next, we are creating a sample dataset, having linearly separable data, from sklearn.dataset.sample_generator for classification using SVM − from sklearn.datasets.samples_generator import make_blobs X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=”summer”); The following would be the output after generating sample dataset having 100 samples and 2 clusters − We know that SVM supports discriminative classification. it divides the classes from each other by simply finding a line in case of two dimensions or manifold in case of multiple dimensions. It is implemented on the above dataset as follows − xfit = np.linspace(-1, 3.5) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=”summer”) plt.plot([0.6], [2.1], ”x”, color=”black”, markeredgewidth=4, markersize=12) for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]: plt.plot(xfit, m * xfit + b, ”-k”) plt.xlim(-1, 3.5); The output is as follows − We can see from the above output that there are three different separators that perfectly discriminate the above samples. As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can draw around each line a margin of some width up to the nearest point. It can be done as follows − xfit = np.linspace(-1, 3.5) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=”summer”) for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]: yfit = m * xfit + b plt.plot(xfit, yfit, ”-k”) plt.fill_between(xfit, yfit – d, yfit + d, edgecolor=”none”, color=”#AAAAAA”, alpha=0.4) plt.xlim(-1, 3.5); From the above image in output, we can easily observe the “margins” within the discriminative classifiers. SVM will choose the line that maximizes the margin. Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this data. Here, we are using linear kernel to fit SVM as follows − from sklearn.svm import SVC # “Support vector classifier” model = SVC(kernel=”linear”, C=1E10) model.fit(X, y) The output is as follows − SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=”ovr”, degree=3, gamma=”auto_deprecated”, kernel=”linear”, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) Now, for a better understanding, the following will plot the decision functions for 2D SVC − def decision_function(model, ax=None, plot_support=True): if ax is None: ax = plt.gca() xlim = ax.get_xlim() ylim = ax.get_ylim() For evaluating model, we need to create grid as follows − x = np.linspace(xlim[0], xlim[1], 30) y = np.linspace(ylim[0], ylim[1], 30) Y, X = np.meshgrid(y, x) xy = np.vstack([X.ravel(), Y.ravel()]).T P = model.decision_function(xy).reshape(X.shape) Next, we need to plot decision boundaries and margins as follows − ax.contour(X, Y, P, colors=”k”, levels=[-1, 0, 1], alpha=0.5, linestyles=[”–”, ”-”, ”–”]) Now, similarly plot the support vectors as follows − if plot_support: ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=300, linewidth=1, facecolors=”none”); ax.set_xlim(xlim) ax.set_ylim(ylim) Now, use this function to fit our models as follows − plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=”summer”) decision_function(model); We can observe from the above output that an SVM classifier fit to the data with margins i.e. dashed lines and support vectors, the pivotal elements of this fit, touching the dashed line. These support vector points are stored in the support_vectors_ attribute of the classifier as follows − model.support_vectors_ The output is as follows − array([[0.5323772 , 3.31338909], [2.11114739, 3.57660449], [1.46870582, 1.86947425]]) SVM Kernels In practice, SVM algorithm is implemented with kernel that transforms an input data space into the required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space and transforms it into a higher dimensional space. In simple words, kernel converts non-separable problems into
Preparing Data
Machine Learning – Data Preparation ”; Previous Next Data preparation, also known as data preprocessing, is a crucial step in machine learning. The quality of the data you use for your model can have a significant impact on the performance of the model. Data preparation involves cleaning, transforming, and pre-processing the data to make it suitable for analysis and modeling. The goal of data preparation is to make sure that the data is accurate, complete, and relevant for the analysis. The following are some of the key steps involved in data preparation − Data cleaning − This involves identifying and correcting errors, missing values, and outliers in the data. Common techniques used for data cleaning include imputation, outlier detection and removal, and data normalization. Data transformation − This involves converting the data from its original format into a format that is suitable for analysis. This could involve converting categorical variables into numerical variables, or scaling the data to a certain range. Feature engineering − This involves creating new features from the existing data that may be more informative or useful for the analysis. Feature engineering can involve combining or transforming existing features, or creating new features based on domain knowledge or insights. Data integration − This involves combining data from multiple sources into a single dataset for analysis. This may involve matching or linking records across different datasets, or merging datasets based on common variables. Data reduction − This involves reducing the size of the dataset by selecting a subset of features or observations that are most relevant for the analysis. This can help to reduce noise and improve the accuracy of the model. Data preparation is a critical step in the machine learning process, and can have a significant impact on the accuracy and effectiveness of the final model. It requires careful attention to detail and a thorough understanding of the data and the problem at hand. Example Let”s check an example of data preparation using the breast cancer dataset − from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # load the dataset data = load_breast_cancer() # separate the features and target X = data.data y = data.target # split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # normalize the data using StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) In this example, we first load the breast cancer dataset using load_breast_cancer function from scikit-learn. Then we separate the features and target, and split the data into training and testing sets using train_test_split function. Finally, we normalize the data using StandardScaler from scikit-learn, which subtracts the mean and scales the data to unit variance. This helps to bring all the features to a similar scale, which is particularly important for models like SVM and neural networks. Why Data Pre-processing? After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm. Data Pre-processing Techniques We have the following data preprocessing techniques that can be applied on data set to produce data for ML algorithms − Scaling Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library. Example In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1. The first few lines of the following script are same as we have written in previous chapters while loading CSV data. from pandas import read_csv from numpy import set_printoptions from sklearn import preprocessing path = r”C:pima-indians-diabetes.csv” names = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”] dataframe = read_csv(path, names=names) array = dataframe.values Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1. data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1)) data_rescaled = data_scaler.fit_transform(array) We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and showing the first 10 rows in the output. set_printoptions(precision=1) print (“nScaled data:n”, data_rescaled[0:10]) Output Scaled data: [ [0.4 0.7 0.6 0.4 0. 0.5 0.2 0.5 1. ] [0.1 0.4 0.5 0.3 0. 0.4 0.1 0.2 0. ] [0.5 0.9 0.5 0. 0. 0.3 0.3 0.2 1. ] [0.1 0.4 0.5 0.2 0.1 0.4 0. 0. 0. ] [0. 0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ] [0.3 0.6 0.6 0. 0. 0.4 0.1 0.2 0. ] [0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ] [0.6 0.6 0. 0. 0. 0.5 0. 0.1 0. ] [0.1 1. 0.6 0.5 0.6 0.5 0. 0.5 1. ] [0.5 0.6 0.8 0. 0. 0. 0.1 0.6 1. ] ] From the above output, all the data got rescaled into the range of 0 and 1. Normalization Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale
Discuss Machine Learning With Python ”; Previous Next Machine Learning (ML) is basically that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do. In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method. The key focus of ML is to allow computer systems to learn from experience without being explicitly programmed or human intervention. Print Page Previous Next Advertisements ”;
Data Loading for ML Projects
Machine Learning – Data Loading ”; Previous Next Suppose if you want to start a ML project then what is the first and most important thing you would require? It is the data that we need to load for starting any of the ML project. In machine learning, data loading refers to the process of importing or reading data from external sources and converting it into a format that can be used by the machine learning algorithm. The data is then preprocessed to remove any inconsistencies, missing values, or outliers. Once the data is preprocessed, it is split into training and testing sets, which are then used for model training and evaluation. The data can come from various sources such as CSV files, databases, web APIs, cloud storage, etc. The most common file formats for machine learning projects is CSV (Comma Separated Values). Consideration While Loading CSV data CSV is a plain text format that stores tabular data, where each row represents a record, and each column represents a field or attribute. It is widely used because it is simple, lightweight, and can be easily read and processed by programming languages such as Python, R, and Java. In Python, we can load CSV data into ML projects with different ways but before loading CSV data we must have to take care about some considerations. In this chapter, let”s understand the main parts of a CSV file, how they might affect the loading and analysis of data, and some consideration we should take care before loading CSV data into ML projects. File Header This is the first row of the CSV file, and it typically contains the names of the columns in the table. When loading CSV data into an ML project, the file header (also known as column headers or variable names) can play an important role in data analysis and model training. Here are some considerations to keep in mind regarding the file header − Consistency − The header row should be consistent across the entire CSV file. This means that the number of columns and their names should be the same for each row. Inconsistencies can cause issues with parsing and analysis. Meaningful names − Column names should be meaningful and descriptive. This can help with understanding the data and building more accurate models. Avoid using generic names like “column1”, “column2”, etc. Case sensitivity − Depending on the tool or library being used to load the CSV file, the column names may be case sensitive. It”s important to ensure that the case of the header row matches the expected case sensitivity of the tool or library being used. Special characters − Column names should not contain any special characters, such as spaces, commas, or quotation marks. These characters can cause issues with parsing and analysis. Instead, use underscores or camelCase to separate words. Missing header − If the CSV file does not have a header row, it”s important to specify the column names manually or provide a separate file or documentation that includes the column names. Encoding − The encoding of the header row can affect its interpretation when loading the CSV file. It”s important to ensure that the encoding of the header row is compatible with the tool or library being used to read the file. Comments These are optional lines that begin with a specified character, such as “#” or “//”, and are ignored by most programs that read CSV files. They can be used to provide additional information or context about the data in the file. Comments in a CSV file are not typically used to represent data that would be used in a machine learning project. However, if comments are present in a CSV file, it”s important to consider how they might affect the loading and analysis of the data. Here are some considerations − Comment markers − In a CSV file, comments can be indicated using a specific marker, such as “#” or “//”. It”s important to know what marker is being used, so that the loading process can ignore comments properly. Placement − Comments should be placed in a separate line from the actual data. If a comment is included in a line with actual data, it may cause issues with parsing and analysis. Consistency − If comments are used in a CSV file, it”s important to ensure that the comment marker is used consistently throughout the entire file. Inconsistencies can cause issues with parsing and analysis. Handling comments − Depending on the tool or library being used to load the CSV file, comments may be ignored by default or may require a specific parameter to be set. It”s important to understand how comments are handled by the tool or library being used. Effect on analysis − If comments contain important information about the data, it may be necessary to process them separately from the data itself. This can add complexity to the loading and analysis process. Delimiter This is the character that separates the fields in each row. While the name suggests that a comma is used as the delimiter, other characters such as tabs, semicolons, or pipes can also be used depending on the file. The delimiter used in a CSV file can significantly affect the accuracy and performance of a machine learning model, so it is important to consider the following while loading data into an ML project − Delimiter choice − The delimiter used in a CSV file should be carefully chosen based on the data being used. For example, if the data contains commas within the values (e.g. “New York, NY”), then using a comma as a delimiter may cause issues. In this case, a different delimiter, such as a tab or semicolon, may be more appropriate.
Data Feature Selection
Machine Learning – Feature Selection ”; Previous Next Feature selection is an important step in machine learning that involves selecting a subset of the available features to improve the performance of the model. The following are some commonly used feature selection techniques − Filter Methods This method involves evaluating the relevance of each feature by calculating a statistical measure (e.g., correlation, mutual information, chi-square, etc.) and ranking the features based on their scores. Features that have low scores are then removed from the model. To implement filter methods in Python, you can use the SelectKBest or SelectPercentile functions from the sklearn.feature_selection module. Below is a small code snippet to implement Feature selection. from sklearn.feature_selection import SelectPercentile, chi2 selector = SelectPercentile(chi2, percentile=10) X_new = selector.fit_transform(X, y) Wrapper Methods This method involves evaluating the model”s performance by adding or removing features and selecting the subset of features that yields the best performance. This approach is computationally expensive, but it is more accurate than filter methods. To implement wrapper methods in Python, you can use the RFE (Recursive Feature Elimination) function from the sklearn.feature_selection module. Below is a small code snippet to implement Wrapper method. from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression estimator = LogisticRegression() selector = RFE(estimator, n_features_to_select=5) selector = selector.fit(X, y) X_new = selector.transform(X) Embedded Methods This method involves incorporating feature selection into the model building process itself. This can be done using techniques such as Lasso regression, Ridge regression, or Decision Trees. These methods assign weights to each feature and features with low weights are removed from the model. To implement embedded methods in Python, you can use the Lasso or Ridge regression functions from the sklearn.linear_model module. Below is a small code snippet for implementing embedded methods − from sklearn.linear_model import Lasso lasso = Lasso(alpha=0.1) lasso.fit(X, y) coef = pd.Series(lasso.coef_, index = X.columns) important_features = coef[coef != 0] Principal Component Analysis (PCA) This is a type of unsupervised learning method that involves transforming the original features into a set of uncorrelated principal components that explain the maximum variance in the data. The number of principal components can be selected based on a threshold value, which can reduce the dimensionality of the dataset. To implement PCA in Python, you can use the PCA function from the sklearn.decomposition module. For example, to reduce the number of features you can use PCA as given the following code − from sklearn.decomposition import PCA pca = PCA(n_components=3) X_new = pca.fit_transform(X) Recursive Feature Elimination (RFE) This method involves recursively eliminating the least significant features until a subset of the most important features is identified. It uses a model-based approach and can be computationally expensive, but it can yield good results in high-dimensional datasets. To implement RFE in Python, you can use the RFECV (Recursive Feature Elimination with Cross Validation) function from the sklearn.feature_selection module. For example, below is a small code snippet with the help of which we can implement to use Recursive Feature Elimination − from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier estimator = DecisionTreeClassifier() selector = RFECV(estimator, step=1, cv=5) selector = selector.fit(X, y) X_new = selector.transform(X) These feature selection techniques can be used alone or in combination to improve the performance of machine learning models. It is important to choose the appropriate technique based on the size of the dataset, the nature of the features, and the type of model being used. Example In the below example, we will implement three feature selection methods − univariate feature selection using the chi-square test, recursive feature elimination with cross-validation (RFECV), and principal component analysis (PCA). We will use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is included in scikit-learn. This dataset contains 569 samples with 30 features, and the task is to classify whether a tumor is malignant or benign based on these features. Here is the Python code to implement these feature selection methods on the Breast Cancer Wisconsin (Diagnostic) Dataset − # Import necessary libraries and dataset import pandas as pd from sklearn.datasets import load_diabetes from sklearn.feature_selection import SelectKBest, chi2 from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Load the dataset diabetes = pd.read_csv(r”C:UsersLeekhaDesktopdiabetes.csv”) # Split the dataset into features and target variable X = diabetes.drop(”Outcome”, axis=1) y = diabetes[”Outcome”] # Apply univariate feature selection using the chi-square test selector = SelectKBest(chi2, k=4) X_new = selector.fit_transform(X, y) # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42) # Fit a logistic regression model on the selected features clf = LogisticRegression() clf.fit(X_train, y_train) # Evaluate the model on the test set accuracy = clf.score(X_test, y_test) print(“Accuracy using univariate feature selection: {:.2f}”.format(accuracy)) # Recursive feature elimination with cross-validation (RFECV) estimator = LogisticRegression() selector = RFECV(estimator, step=1, cv=5) selector.fit(X, y) X_new = selector.transform(X) scores = cross_val_score(LogisticRegression(), X_new, y, cv=5) print(“Accuracy using RFECV feature selection: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2)) # PCA implementation pca = PCA(n_components=5) X_new = pca.fit_transform(X) scores = cross_val_score(LogisticRegression(), X_new, y, cv=5) print(“Accuracy using PCA feature selection: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2)) Output When you execute this code, it will produce the following output on the terminal − Accuracy using univariate feature selection: 0.74 Accuracy using RFECV feature selection: 0.77 (+/- 0.03) Accuracy using PCA feature selection: 0.75 (+/- 0.07) Print Page Previous Next Advertisements ”;
Machine Learning – Data Visualization ”; Previous Next Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate patterns, trends, and insights in the data. Data visualization involves creating graphical representations of the data, which can help to identify patterns and relationships that may not be apparent from the raw data. Here are some of the ways data visualization is used in machine learning − Exploring Data − Data visualization is an essential tool for exploring and understanding data. Visualization can help to identify patterns, correlations, and outliers, and can also help to detect data quality issues such as missing values and inconsistencies. Feature Selection − Data visualization can help to select relevant features for the ML model. By visualizing the data and its relationship with the target variable, you can identify features that are strongly correlated with the target variable and exclude irrelevant features that have little predictive power. Model Evaluation − Data visualization can be used to evaluate the performance of the ML model. Visualization techniques such as ROC curves, precision-recall curves, and confusion matrices can help to understand the accuracy, precision, recall, and F1 score of the model. Communicating Insights − Data visualization is an effective way to communicate insights and results to stakeholders who may not have a technical background. Visualizations such as scatter plots, line charts, and bar charts can help to convey complex information in an easily understandable format. Some popular libraries used for data visualization in Python include Matplotlib, Seaborn, Plotly, and Bokeh. These libraries provide a wide range of visualization techniques and customization options to suit different needs and preferences. Univariate Plots: Understanding Attributes Independently The simplest type of visualization is single-variable or “univariate” visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization − Histograms Density Plots Box and Whisker Plots Multivariate Plots: Interaction Among Multiple Variables Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivariate visualization, we can understand interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization − Correlation Matrix Plot Scatter Matrix Plot In the next few chapters, we will look at some of the popular and widely used visualization techniques available in machine learning. Print Page Previous Next Advertisements ”;