Data scaling is a pre-processing technique used in Machine Learning to normalize or standardize the range or distribution of features in the data. Data scaling is essential because the different features in the data may have different scales, and some algorithms may not work well with such data. By scaling the data, we can ensure that each feature has a similar scale and range, which can improve the performance of the machine learning model.
There are two common techniques used for data scaling −
-
Normalization − Normalization scales the values of a feature between 0 and 1. This is achieved by subtracting the minimum value of the feature from each value and dividing it by the range of the feature (the difference between the maximum and minimum values).
-
Standardization − Standardization scales the values of a feature to have a mean of 0 and a standard deviation of 1. This is achieved by subtracting the mean of the feature from each value and dividing it by the standard deviation.
Example
In Python, data scaling can be implemented using the sklearn module. The sklearn.preprocessing sub-module provides classes for scaling data. Below is an example implementation of data scaling in Python using the StandardScaler class for standardization −
from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_iris import pandas as pd # Load the iris dataset data = load_iris() X = data.data y = data.target # Create a DataFrame from the dataset df = pd.DataFrame(X, columns=data.feature_names) print("Before scaling:") print(df.head()) # Scale the data using StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Create a new DataFrame from the scaled data df_scaled = pd.DataFrame(X_scaled, columns=data.feature_names) print("After scaling:") print(df_scaled.head())
In this example, we load the iris dataset and create a DataFrame from it. We then use the StandardScaler class to scale the data and create a new DataFrame from the scaled data. Finally, we print the dataframes to see the difference in the data before and after scaling. Note that we fit and transform the data using the fit_transform() method of the scaler object.
Output
When you execute this code, it will produce the following output −
Before scaling: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 After scaling: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 -0.900681 1.019004 -1.340227 -1.315444 1 -1.143017 -0.131979 -1.340227 -1.315444 2 -1.385353 0.328414 -1.397064 -1.315444 3 -1.506521 0.098217 -1.283389 -1.315444 4 -1.021849 1.249201 -1.340227 -1.315444