Important Machine Learning Regression Algorithms You must to Know

Quick! Name five machine learning algorithms (actually try to do it). Chances are that not very many of them are regression algorithms. After all, the only widely popularized regression algorithm is linear regression, mostly because of its simplicity. However, linear regression is very often not applicable to real-world data because of its basic capabilities and limited freedom of movement. It is really only often used as a baseline model to evaluate and compare to new approaches in research. Here are 5 regression algorithms that you should have in your toolbox along with popularized classification algorithms like SVM, decision tree, and neural networks.

1 | Neural Network Regression Theory Neural Networks are incredibly powerful, but they are usually used for classification. Signals pass through layers of neurons and are generalized into one of several classes. However, they can be very quickly adapted into regression models by changing the last activation function. Each neuron passes values from the previous connection through an activation function, serving the purpose of generalization and nonlinearity. Usually, the activation function is something like the sigmoid function or the ReLU function: Source.But by substituting the last activation function (output neuron) with a linear activation function, the output can be mapped to a variety of values beyond fixed classes. This way, the output is not a probability of the input’s categorization into any one class but the continuous value that the neural network places the observation on. In this sense, it is like a neural network addition to a linear regression. Neural network regression has the advantage of nonlinearity (in addition to complexity), which can be introduced with sigmoid and other nonlinear activation functions earlier in the neural network. However, the excessive use of ReLU (Rectified Linear Unit) as an activation function may mean the model has a tendency to avoid outputting negative values, as ReLU ignores relative differences between negative values. This can either be solved by limiting the use of ReLU and adding more negative-value appropriate activation functions, or by normalizing the data to a strictly positive range before training. Implementation Using Keras, say we build the following Artificial Neural Network structure, although the same could be done with a Convolutional Neural Network or another network, as long as the last layer is either a dense layer with a linear activation or simply a linear activation layer. (Note that Keras imports are not loaded to conserve space.)


model = Sequential()
model.add(Dense(100, input_dim=3, activation='sigmoid'))
model.add(ReLU(alpha=1.0))
model.add(Dense(50, activation='sigmoid'))
model.add(ReLU(alpha=1.0))
model.add(Dense(25, activation='softmax'))#IMPORTANT PART
model.add(Dense(1, activation='linear'))

A problem in neural networks has always been its high variance and tendency to overfit. In the code example above, there are many sources of nonlinearity, like SoftMax or sigmoid. If your neural network performs well on the training data with a purely linear structure, it may be better to use a pruned decision tree regression, which emulates the linear and high-variance of the neural network but allows the data scientist more control over the depth, width, and other attributes to control over-fitting.

2 | Decision Tree Regression Theory Decision trees in classification and regression are very similar, in that both work by constructing trees of yes/no nodes. However, while classification end nodes result in a single class value (for example, 1 or 0 for a binary classification problem), regression trees end with a continuous value (for example, 4593.49 or 10.98). Because of the specific and high-variance nature of regression simply as a machine learning task, decision tree regressors need to be pruned carefully. Yet the way it approaches regression is irregular — instead of calculating the value on a continuous scale, it arrives at set end nodes — and if the regressor is pruned too much, it has too few end nodes to properly achieve its task. Hence, a decision tree should be pruned such that it has the most freedom (possible output regression values — number of end nodes), but not enough such that it is too deep. If left unpruned, an already high-variance algorithm will skyrocket in overfitting complexity due to the nature of regression. Implementation Decision tree regression can easily be created in sklearn:


from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

Because decision tree regressor parameters are so essential, it is recommended to use sklearn’s GridCV parameter search optimization tool to find the right guidelines for the model. When evaluating performance formally, use K-fold testing instead of standard train-test-split to avoid the randomness of the latter that may intrude with the delicate results of the high-variance model. Bonus: Decision Tree’s close relative, the Random Forest algorithm, can also be implemented as a regressor. A Random Forest regressor may or may not perform better than the decision tree in regression (while it usually performs better in classification), because of the delicate overfitting-underfitting balance in the nature of tree-constructing algorithms.


from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)

3 | LASSO Regression LASSO regression is a variation of linear regression specifically adapted for data that shows heavy multicollinearity (heavy correlation of features with each other). It automates parts of model selection, like variable selection or parameter elimination. Standing for Least Absolute Shrinkage and Selection Operator, LASSO uses shrinkage, a process in which data values are shrunk towards a central point (for example, the mean). Simplified visualization of the shrinking process.The process of shrinking adds several benefits to the regression models:

More accurate and stable estimates for true parameters.
Reduced sampling and non-sampling errors.
Smoother spatial fluctuations.

Instead of adjusting model complexity to compensate for complexity in the data, like the high-variance methods of neural network and decision tree regression, LASSO attempts to decrease the data complexity to be able to be handled by simple regression techniques by warping the space upon which it lies on. In the process, LASSO automatically helps eliminate or warp highly correlated and redundant features in a low-variance method. LASSO regression uses L1 regularization, meaning it weights errors at their absolute value (instead of, for example, L2 regularization, which weights errors at their square to punish higher errors more). This regularization often results in sparser models with fewer coefficients, as some coefficients can become zero and hence be eliminated from the model. This makes it interpretable. Implementation In sklearn LASSO regression comes with a cross-validation model that selects the best-performing of many trained models with varied fundamental parameters and training paths, which automates a task that would need to have been done manually.


from sklearn.linear_model import LassoCV
model = LassoCV()
model.fit(X_train, y_train)

4 | Ridge Regression Theory Ridge regression is very similar to LASSO regression in that it applies shrinking. Both Ridge and LASSO regression are well-suited for datasets that have an abundant amount of features which are not independent (collinearity) from one another, but the largest difference between the two is that Ridge utilizes L2 regularization, meaning that none of the coefficients become zero as they do in LASSO regression. Instead, the coefficients get closer and closer to zero, but do not have much incentive to reach it because of the nature of L2 regularization. Comparison of LASSO and Ridge errors. Because of ridge regression uses L2 regularization, its area resembles a circle, whereas LASSO’s L1 regularization draws straight lines.In LASSO, improving from an error of 5 to an error of 4 is weighted the same as an improvement from 4 to 3, as well as 3 to 2, 2 to 1, and 1 to 0. Hence, more coefficients reach 0 and more features are eliminated. In Ridge regression, however, an improvement from an error of 5 to an error of 4 is calculated as 5² — 4² = 9, whereas an improvement from 4 to 3 is only weighted as 7. Progressively, the reward for improving decreases; therefore, less features are eliminated. Because of this, generally, if you would like many variables each with a small effect to be prioritized, Ridge is the better choice. If a few variables each with a medium to large effect are desired to be considered in the model, LASSO is the better choice. Implementation Ridge regression can be implemented in sklearn as follows. Like LASSO regression, sklearn has an implementation for cross-validation selection of the best of many trained models.


from sklearn.linear_model import RidgeCV
model = Ridge()
model.fit(X_train, y_train)

5 | ElasticNet Regression Theory ElasticNet seeks to take the best from both Ridge Regression and LASSO Regression by combining L1 and L2 regularizations. LASSO and Ridge present two different methods of regularization. λ is the turning factor in both that controls the strength of the penalty.

If λ = 0, the objective becomes similar to simple linear regression, achieving the same coefficients as simple linear regression.
If λ = ∞, the coefficients will be zero because of infinite weightage on the square of coefficients. Anything less than zero makes the objective infinite.
If 0 < λ < ∞, the magnitude of λ decides the weightage given to the different parts of the objective.

In addition to the λ parameter, ElasticNet adds an additional parameter α, a measure of how ‘mixed’ the L1 and L2 regularizations should be. When α is equal to 0, the model is a purely ridge regression model, and when α is equal to 1, it is purely a LASSO regression model. The ‘mixing factor’ α simply determines how much of L1 and L2 regularization should be considered in the loss function. All three popular regression models — Ridge, LASSO, and ElasticNet — all aim at decreasing the size of their coefficients, each from different perspectives. Implementation ElasticNet can be implemented with sklearn’s cross-validation model:


from sklearn.linear_model import ElasticNetCV
model = ElasticNetCV()
model.fit(X_train, y_train)

Important Machine Learning Regression Algorithms You must to Know

Recent Posts

ความคิดเห็น

Subscribe to BrainStorm newsletter