Quantcast
Channel: Machine Learning | Towards AI
Viewing all articles
Browse latest Browse all 791

How To Explain Gradient Descent to Your Mom: Complete Tutorial

$
0
0
Author(s): Igor Novikov Originally published on Towards AI. Image by the author Gradient descent is at the core of most AI/ML techniques. It sounds strange and kinda scary. Descent? Oh man, I hope I won’t have to jump with a chute out of a plane..U+1F612 Well, worry not, you might have to jump, but only if you want to. Here is the explanation even your 10-year-old nephew can understand. Let’s imagine you are trying to learn a new skill. Suppose you live in an ancient tribe, your task is to tell people their weight by their height. Scales have not been invented yet. I know it sounds ridiculous, but for simplicity, let’s assume the survival of the tribe depends on this. So a fellow member of the tribe comes and tells you his height, and you, using your wast experience, should tell him how much he weighs in, say, bags of potatoes. So a John comes and tells you — I am 174 centimeters high. And you don’t like John very much, so you tell him he weighs 100500 bags of potatoes, and he runs away in tears… U+1F602 But really, you know from your experience, that this bloke probably weighs around 70 kilos, or 7 bags of potatoes (10 kilos each). How do you know that? Well, you’ve seen a lot of people with the same height and complexion as John so you guess that should be about right. Now we want to train a system that can do this. For that, we need to simulate this prior experience of seeing different people of different heights and knowing their weight. We do that with training data that looks like this: The average height-to-weight correlation This is the average height-to-weight correlation we know is true. If we plot this data, it looks like this: A graph of the average height-to-weight correlation There is, obviously, a pattern. The higher the height — the higher the weight. Ok, let’s try to draw a line that represents this insight. Using this line, we can estimate the weight for any given height: The average height-to-weight correlation graph For example, for someone with a height of around 157, the height is around 40 kilos. The function that represents this line — is, basically, our AI model. Sounds suspiciously simple right? Well, it is.U+1F60E All complexity comes from complex dependencies. Weight-to-height dependency is simple and can be represented with a simple function. But in the real world, many dependencies (or patterns) are very complex and are represented by very complex functions. But let’s stick with ours, for now. A line function like ours is represented by the following equation: f(x) = a * x + bory = a * x + b Where: a is the slope and b is the intercept. Ah, more definitions U+1F92F. But these are very simple: A slope and an intercept Intercept is a point where the line crosses the y-axis and slope is the angle between the line and the x-axis. From simple triangle geometry, we know that the angle here equals A/B (or dy/dx) as below: Basic triangle math These two parameters, a and b, are the coefficients or parameters of our model. In big models like OpenAI, there are billions of parameters and the underlying function is different, but the principle is the same. So, to the training process. We decided in our case we are going to use a simple line to represent the correlation between height and weight. So the function is linear. This is why this is also called linear regression. The regression part of the name is historical, I’ll explain it at the end. How do we find the correct slope and intercept? We will do that the same way we learn any skill. We start with a random guess, observe the result, and correct the guess accordingly if it is too far from the truth. Kinda similar to the cold and hot game. So we place our line at a random location, like that: Distances to the line We measure the distances from the line to all of our points, the sum of those distances is the total error of our current line position. It is called the loss. Or goal is to minimize the loss so that the line is located in such a way that the sum of distances is as close to zero as possible. Ideally, the distance is zero, which means all points are on the line — but that is not feasible in our case with our linear function. But we want it to be as small as possible. Now we have formalized our training objective — we want to find such coefficients a and b that the loss is as small as possible, or in other words: f(x) = a * x + bsum ( distances_to_line (a, b) ) -> 0 The formula for the distance D between two points is: Distance formula Let’s randomly pick a = 0.3125, and b =14.375. Given that, let’s pick a point from our training data, say {174 cm, 70kg}. The second point we get from our function y = x *a + b, with a and b, we selected y = 174 * 0.3125 + 14.375 = 40. So the point is {174 cm, 40kg} The distance is D = sqrt((174–174)² + (70–40)²) = 30. So we are 30kg off. Not a good model, we need to find better a and b. We don’t need a square root per se in the above formula. If you think, we use distance to score the error. We can easily use just: or (calculated_value - predicted value)² where the calculated value is calculated using our function and the expected is taken from the training set. Less computation and works just as well. That is what is typically used, this method is called the method of least squares. If we take all the points we have one by one and sum up the […]

Viewing all articles
Browse latest Browse all 791

Trending Articles