In very simplistic terms Machine Learning is the science of getting computers to learn, without being explicitly programmed. So, how do computers learn? They learn, just like we do, through experience. They perform a task, measure its performance and with experience, improve the performance.
Tom Mitchell, a professor at CMU, describes Machine Learning as "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Machine Learning problems can be classified into many types of problems, with the main types being:
1.1 Supervised Learning
In supervised learning, we have an example data set where we already know the outcome. Computers use this data set to learn about the relationships between input and output.
1.2 Unsupervised Learning
In unsupervised learning, the computer algorithms are used to find patterns in vast amount of data.
1.3 Recommender Learning
In recommender system, the computer algorithm makes recommendations to us, based on our past history, or preferences.
1.4 Reinforced Learning
In reinforcement learning, the computer algorithm gets to predict an action in response to each input data set. Subsequently, it receives a reward signal, indicating how good (or bad) the decision was. Reinforced learning is very popular in robotics.
Computers use powerful algorithms to analyze the data, recognize patterns within data, and learn from the data to improve its performance. In this article, I will discuss the Gradient Descent Algorithm for Linear Regression for machine learning.
At a very basic level, Gradient Descent is just an algorithm that minimizes a function. Given a function, and its set of parameters, Gradient Descent finds the parameters that minimize the function. Gradient Descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. It takes steps in the negative direction of the function gradient.
Lets take an example. Suppose we have a function y = 5(x*x)+10. We want to minimize this function. Gradient Descent starts with an initial set of values. Say we start with x=2. At x=2 gradient for this function is +ve (y is increasing as x increases). So we move the ‘x’ by a small negative value, say the next x is 2-0.3=1.7. Gradient Descent algorithm keeps on moving x till it minimizes y =10 at x=0.
Gradient Descent is used mostly in supervised learning that uses the training set to learn the relationship between input and output. The training set consists of input variables, called features, and the desired output. We make an hypothesis to predict the output from the input variables. Using hypothesis we find the predicted output value for each example in our training set, and then calculate the error between the predicted output and the correct output as given in the training set. The Gradient Descent algorithm then minimizes this error, by trying different values of the parameters.
Gradient Descent is used in machine learning to try to fit a line to the set of points in our training set. It does so by minimizing the error between the output projected by our algorithm and the points in out data set. Let’s take an example.
Suppose we want to find the price of a house based on its plot size. Our dataset given in Appendix A, has plot size of the house and its price, and it is plotted below:
We want to fit a line through these points, so we can predict the price of a house given its plot size. Let’s use the standard y=mx+c line equation as our hypothesis function. Here ‘m’ is the slope of the line and ‘c’ is the y-intercept. We need to find the best line fit for our data.
To solve this problem using Gradient Descent, we must define an error function (called a cost function) that measures how “good” a given line is. The cost function takes in the parameters (m and c) and returns an error value based on how well the line fits our data. To compute this error value for a given set of parameters m & c, we’ll :
Mathematically, we can write this as follows:
To use Gradient Descent algorithm on this equation, we need to first find out its gradient for parameters m & c. The gradient will tell us if we need to increase the parameters or decrease them for the next iteration. To compute the gradient, we must differentiate the equation. The gradient for this equations are:
Before running Gradient Descent algorithm on our dataset, we must do a few things:
Learning Rate is the rate by which we modify out input parameters for each iterations. If learning rate is too small, gradient descent will be slow to converge. If it is too large, cost function may not decrease on every iteration, and may not converge. For sufficiently small learning rate, cost function will decrease with every iteration of gradient descent
If our features are scaled properly, a good practice is to try with learning rates of 0.001, 0.003, 0.01. 0.03, 0.1, 0.3, 1.0 and see which works best.
Calculating Increment for Parameters
The increment is the value by which we change the parameter after every iteration. This is just our learning rate (alpha) times the gradient for that parameter. Thus in our case, the increments for m & c will be:
In our example we have only one feature - the size of the plot. Typically Gradient Descent is used with many features. Say, to predict the value of house we also use number of rooms in the house and the age of the house. Gradient Descent can be used with several thens of thousands of features.
When we are using multiple features, we can speed up gradient descent by having each of our input values in roughly the same range. For this we modify the ranges of our input variables so that they are all roughly the same say:
-1<=x<=1 or -3 <=x<= 3.
Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
We can achieve feature scaling and mean normalization by modifying our input parameter as follows:
where 𝜇 i is the average of feature xi and si is the range of values (max - min) or it can also be the standard deviation.
To run Gradient Descent we must first define initial parameters.
Let’s define the initial parameters as m=0 and c=0 and the learning rate as 0.01. I have run Gradient Descent using Octave library and the plots for different iterations are:
m=0; c=0; Error : 16,901,352,061.183672
After 10 iterations:
m=3,467.001170; c=16,709.107503; Error= 14,001,346,641.117964
After 100 iterations:
m=110,785.020323; c=23,162.604795; Error= 3,102,167,183.723494
After 300 iterations:
m=166,178.861692; c=35,051.360387; Error = 1,001,515,194.505497
After 600 iterations:
m=174,328.421645; c=36,879.959598 ; Error = 963,064,442.402157
After 1000 iterations:
m=174,741.149758; c=36,978.644771; Error = 962,971,196.814512
After 5000 iterations:
m=174,748.693878; c=36,980.606713; Error = 962,971,166.472315
Initally, our line moved quite rapidly towards the “best fit” line. But, we can see there is hardly any difference between 1000 iteration and 5000 iteration.
In this example, we have taken a very simple case, where we have only one input feature(x) - the plot size of the house. In reality we will have several features, thousands or even hundreds of thousands. Gradient Descent algorithm works very well for large number of input features.
Gradient Descent algorithm also requires that we assume initial parameters and the learning rate. Then we run the algorithm iteratively several times. There are other algorithm, say Normal Equation, that do not require these and can directly give you the optimized fit in a single pass. But these algorithms do not perform very well when the number of features or input variables is very large. The following tables gives the advantage of Gradient Descent algorithm versus the Normal Equation algorithm:
|GRADIENT DESCENT||NORMAL EQUATION|
|Need to choose learning rate||No need to choose learning rate|
|Needs many iterations||No need to iterate|
|Computations required is of the order of n squared, where n is the number of input features||Computations required is of the order of n cubed, where n is the number of input features. Need to compute X Transpose, where X in nXn matrix.|
|Works very well when the number of input features is very very large.||Does not work well, when the number of input features in 100s of thousand or more.|
In conclusion, we can say that Gradient Descent is a basic algorithm for machine learning. It provides good insight into how machine learning works, and is suitable for large complex real life problems.
Plot size versus house value data used in this article:
Plot Size, House Value