[인공지능 강의 리뷰] 6 - 경사 하강법(Gradient Descent)

코세라 앤드류 응 AI 강의 리뷰

[인공지능 강의 리뷰] 6 - 경사 하강법(Gradient Descent)

파요요요 2022. 5. 9. 15:13

이번 시간에는,

모델의 정확도(오차)를 측정하는 로지스틱 회귀 비용 함수 (Logistic Regression Cost Function)에서

오차가 가장 낮은 (함숫값이 가장 작은) 맨 아래 지점을 지름길로 가도록 해주어서,

매개 변수 w, b를 효율적으로 수정하게 해주는 알고리즘인 경사 하강법(Gradient Descent)에 대해 알아보겠습니다.

--> 위의 그래프에서, 오차(함숫값)가 가장 낮은 빨간 점(Global Optimum)을 찾기 위한 알고리즘

이번 시간에도, 강의 자료를 하나하나 뜯어서 분석해 봅시다.

You've seen the logistic regression model, you've seen the loss function that measures how well you're doing on the single training example.

저희는 로지스틱 회귀(logistic regression)를 사용하여 훈련시킨 모델 (= y-hat)과

그러한 모델에서 단일 훈련 예시가 잘 수행되고 있는지 확인할 수 있는 손실 함수(Loss function)에 대해 배웠습니다.

학생 A의 질문 : 손실 함수(loss function)와비용 함수(Cost function)의 차이는 뭔가요??

정답 --> 손실 함수(loss function)는단일 트레이닝 예시에서의 오차 측정 함수이고,

비용 함수 (Cost function)은 전체 훈련 예시에서의 오차 측정 함수입니다.

You've also seen the cost function that measures how well your parameters W and B are doing on your entire training set.

또한, 전체 훈련 예시에서 쓰는 비용 함수(Cost function)를 배웠고

비용 함수가 변수 W(가중치)와 B(편향)의 값에 따라 달라지는 훈련의 정확도를 측정하는 함수라는 걸 배웠습니다.

Now let's talk about how you can use the gradient descent algorithm to train or to learn the parameters W on your training set.

Y hat I on each of the training examples stacks up compares to the ground troops labor Y I on each of the training examples and full formula is expanded out on the right. So the cost function measures how well your parameters w and b are doing on the training set.

So in order to learn a set of parameters w and b, it seems natural that we want to find w and b. That make the cost function J of w, b as small as possible.

비용 함수 J (w, b)의 값은 저희가 훈련시키는 모델의 오차를 나타내는 값이므로,

비용 함수 ( 모델의 오차)의 값을 최소로 만드는 w, b 값을 찾는 것은, 당연해 보입니다.

So, here's an illustration of gradient descent.

In this diagram, the horizontal axis represent your space of parameters w and b in practice w can be much higher dimensional, but for the purposes of plotting, let's illustrate w as a single roll number and b as a single rule number.

위의 그림은, Gradient descent (경사 하강)을 나타내는 그림입니다.

수평축은 w 축과 b 축으로 구성되어 있습니다.

그림으로 쉽게 나타내기 위해서, 여기서는 w, b를 단일 숫자라 가정했습니다.

The cost function J of w, b is then some surface above these horizontal axis w and b. So the height of the surface represents the value of J, b at a certain point. And what we want to do is really to find the value of w and b that corresponds to the minimum of the cost function J.

저희는 위 그림에서, cost function(비용 함수) J의 최솟값을 나타내는 지점과,

그에 해당하는 w, b 값을 찾아내는 게 목표입니다.

It turns out that this particular cost function J is a convex function.

So it's just a single big bowl, so this is a convex function and this is as opposed to functions that look like this, which are non convex and has lots of different local optimal. So the fact that our cost function J of w, b as defined here is convex, is one of the huge reasons why we use this particular cost function J for logistic regression.

위의 그림을 참고해 주세요,

cost function ( 비용 함수 ) J는 마치 큰 그릇과 같은 모양을 하고 있는 볼록 함수(convex function)입니다.

2번째 그래프처럼, 여러 개의 최솟값 지점을 가지는 것과 대조되어, 하나의 최솟값만을 가집니다.

--> 여기서는 중간층 없는 연습 모델이기에 , 하나의 최솟값입니다.(실제 딥러닝에서는 최솟값이 여러 개인데, 뭐가 가장 작은 값인지는 딥러닝에서도 해결되지 않은 문제입니다.

위 그래프에서 경사하강법을 통해서, 최대한 빠른길로 최소손실값인 빨간점으로 감

So to find a good value for the parameters, what we'll do is initialize w and b to some initial value may be denoted by that little red dot.

저희는 w, b의 값을 수정해가며,

최솟값을 가지는 빨간 점(global optimum)을 찾고, 그 지점에 해당하는 적절한 w, b의 값을 찾을 것입니다.

And for logistic regression, almost any initialization method works. Usually you Initialize the values of 0. Random initialization also works, but people don't usually do that for logistic regression.

But because this function is convex, no matter where you initialize, you should get to the same point or roughly the same point.

And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. So after one step of gradient descent, you might end up there because it's trying to take a step downhill in the direction of steepest descent or as quickly down who as possible. So that's one iteration of gradient descent.

로지스틱 회귀(logistic regression)의 경우, 일반적으로 처음 설정을 0부터 합니다.

--> 시작 지점 설정에서, 변수의 값을 0으로 초기화하고 적절한 변숫값을 찾아 나아간다는 뜻.

함수( cost function J)는 볼록 함수(Convex function)이기에,

어느 지점에서 시작하든 간에, 내려가게 되면 결국 한 지점에서 만나게 됩니다.

---> 여기서 한 지점은 빨간 점(Global optimum)을 의미함

경사 하강법(Gradient descent Algorothm)이 할 일은

그 초기 지점에서 가장 빠른(가파른) 지름길로

가장 아랫점인 빨간 점(Global optimum)에 한 걸음 내딛게 해주는 것입니다.

(아래의 그림처럼 지름길로 이동하게 해주는 역할)

Let's write a little bit more of the details for the purpose of illustration, let's say that there's some function J of w that you want to minimize and maybe that function looks like this to make this easier to draw.

I'm going to ignore b for now just to make this one dimensional plot instead of a higher dimensional plot. So gradient descent does this.

위 그림은, 여러분들의 이해를 위해 b를 배제하고 w와 J(w) 만으로 나타낸 그림입니다.

이 그림을 통해, 경사 하강법(Gradient descent)을 사용해 보겠습니다.

We're going to repeatedly carry out the following update.

We'll take the value of w and updated. Going to use colon equals to represent updating w.

So set w to w minus alpha times and this is a derivative d of J w d w.And we repeatedly do that until the algorithm converges.So a couple of points in the notation alpha here is the learning rate and controls how big a step we take on each iteration are great inter sent, we'll talk later about some ways for choosing the learning rate, alpha and second this quantity here, this is a derivative.

저희는 W를 계속해서 업데이트할 것입니다.

그래서, W= W - α * ( d J(w) / d w)라 하겠습니다.

저희는 가장 아래인 지점에 수렴할 때까지, 경사 하강(Gradient Descent)을 몇 번이고 사용할 것인데,

경사 하강법으로 이동하는 거리를 조절하기 위한 변수 α라 하고, 학습률(learning Rate)라고 부르겠습니다.

-----------> 학생 A의 질문 :

" w의 변화량에 대한 J(w)의 변화 값은 기울기인데, W 좌표를 설정할 때 왜 기울기를 이용하는 건가요??"

-------> 학생 B의 대답 : 아래의 링크를 참고해 주세요! (경사 하강법에 대해 쉽게 설명해 놓은 글입니다)

https://angeloyeo.github.io/2020/08/16/gradient_descent.html

저희는 최솟값을 가지는 지점에 가려고 합니다.

그러기 위해서는 어떤 지점에서 w 값을 정하려고 할 때, J(w)의 값이 감소해야 합니다.

예를 들어서,

만약 어떤 지점에서 w의 값이 증가할 때, J(w)도 증가한다고 하면, 그 지점에서는 w의 값을 감소시켜야 합니다.

반대로 어떤 지점에서 w의 값이 감소할 때, J(w)가 증가한다면, 그 지점에서는 w의 값을 증가시켜야 할 것입니다.

요약정리 : 어떤 지점에서 w의 값을 증가시켜야 할지 or 감소시켜야 할지 정할 때,

함수의 기울기(도함수 Derivative)를 활용하여 방향을 정해줍니다.

So when you write code, you write something like w equals or cold equals w minus alpha time's d w. So we use d w to be the variable name to represent this derivative term. Now, let's just make sure that this gradient descent update makes sense.

컴퓨터에서) 파이썬 코딩에서는,

미분 값인 ( d J(w) / d w)를 변수 이름 dw라 하겠습니다.

만약, J(w)가 원래의 함수인 J(w, b)가 된다 하더라도

미분 값 ( d J(w, b) / d w)라 표현하면 되고, 이것이 변수 이름 dw가 됩니다.

b의 경우에도 마찬가지로, 미분 값 ( d J(w, b) / d b)를 변수 이름 db로 표현합니다.

이제부터 미분(Derivatives)의 개념을 자주 사용하게 될 텐데,

혹시나 미분에 익숙하지 않으신 분을 위해서, 미분에 대해 설명드리는 시간을 가질 테니 걱정하지 마세요

다음 글에는,

도함수(Derivatives)와계산 그래프(Computation Graph)에 대해 알아보겠습니다.

이 내용들은 모두 coursera에서 앤드루 응 교수님의 강의를 요약정리 및 쉽게 재 풀이하여 적은 글이며,

내용에는 생략되거나 변형된 부분이 많으니 직접 강의를 들어보시는 걸 추천드립니다!

이 글은 상업적 목적이 아닌, 한국에서 인공지능을 배우고 싶은 분들을 위해 적은 교육적 목적에서 작성하였습니다.

저작자표시 비영리 변경금지 (새창열림)

'코세라 앤드류 응 AI 강의 리뷰' 카테고리의 다른 글

[인공지능 강의 리뷰] 9 - 벡터화(Vectorization) (0)	2022.05.16
[인공지능 강의 리뷰] 8 - 로지스틱 회귀 경사 하강법(Logistic Regression Gradient Descent) (0)	2022.05.16
[인공지능 강의 리뷰] 7 - 도함수(Derivative)와 계산 그래프 (Computation Gragh) (0)	2022.05.09
[인공지능 강의 리뷰] 5 - 로지스틱 비용 함수(Logistic Regression Cost function) (0)	2022.05.02
[인공지능 강의 리뷰] 4 - 로지스틱 회귀(Logistic Regression) (0)	2022.05.02
[인공지능 강의 리뷰] 3 - 이진 분류(Binary Classification) 와 로지스틱 회귀(Logistic Regression) (0)	2022.05.02
[인공지능 강의 리뷰] 2 - 지도 학습(Supervised Learning with Neural Network) (0)	2022.04.19

현재글[인공지능 강의 리뷰] 6 - 경사 하강법(Gradient Descent)

인공지능 공부 & 연구 기록을 남기고, 누구나 이해할 수 있는 인공지능 정보 요약을 목적으로 개설하였습니다 . 글에 대한 피드백이나 부족한점, 궁금점 등이 있으시다면 언제든지 댓글남겨주세요! 네이버 블로그 https://blog.naver.com/dg3625 이메일 dg3625@naver.com

AGI인공지능연구실