Normalization Vs Standardization
- neovijayk
- Jul 6, 2020
- 3 min read
Before going through following article please read previous articles:
In this article we will take a look at the two techniques that we regularly encounters in different papers, examples and blogs of Machine Learning or Deep Learning. These are normalize and standardize input features.
What is Normalization and Standardization in brief?
Normalization an Standardization used to scale the input features having different ranges of values.
Normalization typically means rescales the values into a range of [0,1].
Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
What is Standardization?
Standardisation replaces the values by their Z scores.
This redistributes the features with their mean μ = 0 and standard deviation σ =1 .
sklearn.preprocessing.scale helps us implementing standardisation in python.
Z-Score Standardization = x – xmean/ xstd
When to use which one?
We have library in Python sklearn library but on their website they didn’t show how it affects classification tasks with different classifiers.
I did not found any fix rule to to use either Normalization or Standardization
Hence for all of the ML or DL models there is no single scaling method to rule them all.
I find In general most of the people used:
normalization is used in most of the neural network. Specially if images are used as the input.
And Standardize is used in most of the machine learning algorithms
But again it depends upon the performance of the Algorithm. In my case I have used normalization with L2 norm and find the results were good with that. But again almost in all of the regression we use standardization.
Now I am sharing some of the lines from a blog that I find very interesting nd related to this question it was:
When do you need to Standardize the Variables in a Regression Model?
Standardization is the process of putting different variables on the same scale. In regression analysis, there are some scenarios where it is crucial to standardize your independent variables or risk obtaining misleading results.
In regression analysis, you need to standardize the independent variables when your model contains polynomial terms to model curvature or interaction terms. These terms provide crucial information about the relationships between the independent variables and the dependent variable, but they also generate high amounts of multicollinearity.
Multicollinearity refers to independent variables that are correlated.
This problem can obscure the statistical significance of model terms, produce imprecise coefficients, and make it more difficult to choose the correct model.
When you include polynomial and interaction terms, your model almost certainly has excessive amounts of multicollinearity.
These higher-order terms multiply independent variables that are in the model. Consequently, it’s easy to see how these terms are correlated with other independent variables in the model.When your model includes these types of terms, you are at risk of producing misleading results and missing statistically significant terms
Standardizing the independent variables is a simple method to reduce multicollinearity that is produced by higher-order terms. Although, it’s important to note that it won’t work for other causes of multicollinearity
Standardizing your independent variables can also help you determine which variable is the most important.
(Source)
What is the reason for getting different accuracy and model parameters if I use Normalization and Standardization?
One reason is the output of both the variables is different and it will impact the performance of the machine learning model
Example Output using both the techniques:
Example implementation in Python:
First we will look at the Output of normalize():
1
2
3
#Input array
npSample = np.array([[1,11],\
[0,12]])
1
2
result = preprocessing.normalize(npSample, norm='l2', axis = 0) # axis = 0 along the column.
result #printing the result
print("After normalization Addition of 1st Column:", int(result[:, 0].sum()))
print("After normalization Addition of 2nd Column:", int(result[:, 1].sum())) # answers should be = 1 (since scaling in 0-1)
1
2
Note that after normalization the summation of a feature column is 1
Now we will look at the output of StandardScalar():
1
2
3
4
sc_new = preprocessing.StandardScaler()
sc_new.fit(npSample)
result = sc_new.transform(npSample)
result #output is
print("After Standardization Addition of 1st Column:", int(result[:, 0].sum()))
print("After Standardization Addition of 2nd Column:", int(result[:, 1].sum())) # answer should be 0
1
2
Note that after Standardization the summation of a feature column is 0
For the code implementation and output please refer this GitHub repository.
If you like this article please click like button and subscribe to my blog. 🙂
Comments