So we saw a lot of confusion in the forums about what was going on with this gradient descent. The equations that were given to you were presented as a black box, and I’d like to open up that black box and see what’s going on inside. For those of you who looked at Wikipedia to try to understand what’s going on with gradient descent, the first formula you would have encountered was this one. This equation describes gradient descent. Now this is a little intimidating at first. What do all of these parameters mean? b=a – gamma x the gradient of the function of valued at a. Let’s figure out what’s going on here. Before we get into gradient descent, let’s first answer the question, what is a gradient? To help me answer this question, I’m going to get the help of my blindfolded hill climber. Here’s our blindfolded hill climber, and he’s climbing this rounded hill. He wants to get to the top, and he wants to do it in as few steps as possible. What is his method going to be? Well, as long as the hill is steep, he knows he can afford to take a really big step because he’s still not really close to the top. As the hill smoothes out and becomes more level, he’s going to want to take smaller and smaller steps because he’s going to be afraid of overshooting the top and winding up on the other side of the hill. How can we describe this mathematically? The way we’re going to do that is by using the gradient, and when we do that, our climber is going to get to the top of the hill. Let’s look at this hill from a top-down view, instead of from the side. I’m going to draw these isocline, so these lines all represent lines of the same altitude. We can see they come together. They’re really close together in the beginning and become further and further apart as we go inside, and that’s because the hill is getting less and less steep. So at any point, the gradient is a vector. If I took the gradient at, let’s say, this point. I would get a vector pointing in the direction of steepest descent, and the length of that vector would depend on exactly how steep the hill is. So at this point, which maybe corresponds to somewhere down here on the hill, the hill is quite steep, so our vector is quite long, and it points directly uphill. If I was over here and taking the gradient, I would still be pointing uphill, but the vector wouldn’t be quite as long now because now we’re up here, and the slope isn’t so great, and with reduced slope comes reduced gradient and a reduced step size for our blindfolded hill climber. Gradient descent is very similar to blindfolded hill climbing. Now instead of climbing up a hill, we’re climbing into a valley and trying to find the bottom of it because we want to minimize a function. That minimization is encoded in this – sign so instead of adding the gradient, we’re going to subtract it. Let’s talk about what each of these variables mean. A is our current position. If we were the hill climber, it would be right here where he starts off. B is going to be his next position so this is how the hill climber decides where to go next. This gradient term tells us the direction of steepest ascent. This – sign flips that around. It says okay, let’s actually find the direction of steepest descent. This gamma is just a waiting factor. That’s all this formula is doing. It’s saying, let’s take a step down the hill and use that as our new equation– our new position. Sorry. Now to elaborate on this, we’re going to need to use a little bit of calculus, so if you don’t know calculus, don’t be intimidated. Try your best to follow along, and if you can’t, that’s okay too. I think this captures the essence of what’s going on with gradient descent. First, we have to figure out what exactly is the function we’re trying to minimize. For the purposes of the lecture, the function we’re trying to minimize was a function of yi, and it was equal to some waiting alpha x xi – yi squared, + some waiting beta x yi – yi +1 squared, so the next y coordinate, and we also do the same for the previous y coordinate, so yi – 1. We want to minimize this, and we’re going to use gradient descent. Well, gradient descent says that, we should just iterate over this process until we get to a sufficiently shallow slope that we’re confident are in the bottom. We’ve really found a minimum. So at each step, we’re going to follow this. B, our new location becomes yi prime. That’s going to equal our old location, yi, and then – this gradient, and the gradient here is just going to be the derivative with respect to yi. If you don’t know calculus again, don’t worry about it, and if you do, this is a fairly simple derivative. I’m going to writie it simplified. I’m not going to show you all the intermediate steps. But once we do this out, we find that yi prime=yi + 2 alpha x (xi – yi) + 2B(yi + 1 + yi – 1 – 2yi). This almost, almost looks like the equation that you were given in class. There’s a little problem, and that’s these 2s. I’m going to do something where I just erase them. You may be saying, hey, you’re not allowed to do that, but let’s just pretend I’d originally called these parameters alpha over 2 and beta over 2, then everything will work out just fine. That’s gradient descent. That’s where this occasion comes from in our update step. This would be simultaneous update. Another good way of using gradient descent. It’s not how we implement it in lectures, but that’s okay. I hope this was helpful. Good luck!

fool proof teaching…awesome!

Isn't gradient just a vector? Why can you write it as a sum of its components ?

Great video for a novice in AI!

How do you superimpose the text on top of your hand? Where is the camera at??

nice round numbers

thank you

you're very helpful

Maybe I missed it, but if Y(sub-i + 1) is the next Y position, and Y(prime) is the next position. What is the difference between Y(sub-i + 1) and Y(prime)?

I saw this thing 5 years ago and I kinda forgot it, but you helped me remember it. Thank you!

great video! i was wondering the entire time during the lecture, where he got that final equation from.

where did the yi come from out in front in the derivation?

Lovely!

this is awesome.

what is yi-1 and yi+1 doing ?

why did you change the signs of yi in the parenthesis?

what happened to gamma in the equation after updating ?

Did you mention the gamma learning rate is 1?

Awesom explanation

I love ur pen

thank you..it was very helpful

your pen is huge

thank you buddy!

very clear. Isn't called gradient ascent? They are the same thing, so no big deal.

thank you for yo effort. But i didn't understand a shit.

great

Thanks man!

what a waste of time….

The 3rd term in the final expression should be negative.

your learning was great,,,thanks a lot about it. anddddd your equipment for it was so fantastic … could you guide me about it please?

to the point!

Great Work!!!!

Can you please upload similar video for Conjugate Gradient Algorithm??

Just to clarify something; the gradient is not where you point. The gradient of a differentiable function is always perpendicular to the level curves of the graph of the function. That being said, the red vectors you plot is -grad(f) (at the given point). That's why the minus in the formula shows up. I know that the point of the video is not to rigorously define the algorithm but rather to give an idea of how it works. However, that's a fundamental mistake which is misleading for someone with basic mathematical understanding.