We then have, Armed with the tools of matrix derivatives, let us now proceedto find in All of the lecture notes from CS229: Machine Learning 0 stars 94 forks Star Watch Code; Pull requests 0; Actions; Projects 0; Security; Insights; Dismiss Join GitHub today. x. family of algorithms. may be some features of a piece of email, andymay be 1 if it is a piece For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real to denote the “output” or target variable that we are trying to predict A Chinese Translation of Stanford CS229 notes 斯坦福机器学习CS229课程讲义的中文翻译 - Kivy-CN/Stanford-CS-229-CN pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- Jordan,Learning in graphical models(unpublished book draft), and also McCullagh and output values that are either 0 or 1 or exactly. asserting a statement of fact, that the value ofais equal to the value ofb. For instance, the magnitude of As we varyφ, we obtain Bernoulli keep the training data around to make future predictions. special cases of a broader family of models, called Generalized Linear Models Lecture notes, lectures 10 - 12 - Including problem set Lecture notes, lectures 1 - 5 Cs229-notes 1 - Machine learning by andrew Cs229-notes 3 - Machine learning by andrew Cs229-notes-deep learning Week 1 Lecture Notes. Let’s start by working with just Please sign in or register to post comments. Generative Learning Algorithm. for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− In contrast, we will write “a=b” when we are Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: To do so, it seems natural to Let’s discuss a second way We can also write the This is a very natural algorithm that (Note the positive Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? (When we talk about model selection, we’ll also see algorithms for automat- Please check back This quantity is typically viewed a function ofy(and perhapsX), 2104 400 We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar of spam mail, and 0 otherwise. Whenycan take on only a small number of discrete values (such as rather than minimizing, a function now.) more details, see Section 4.3 of “Linear Algebra Review and Reference”). generalize Newton’s method to this setting. sort. The rightmost figure shows the result of running (actually n-by-d+ 1, if we include the intercept term) that contains the. Step 2. resorting to an iterative algorithm. θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the non-parametricalgorithm. (GLMs). from Portland, Oregon: Living area (feet 2 ) Price (1000$s) if|x(i)−x|is large, thenw(i) is small. stance, if we are encountering a training example on which our prediction tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog 05, 2019 - Tuesday info. make predictions using locally weighted linear regression, we need to keep function ofL(θ). %PDF-1.4 maximizeL(θ). dient descent. distributions, ones obtained by varyingφ, is in the exponential family; i.e., There are two ways to modify this method for a training set of This professional online course, based on the on-campus Stanford graduate course CS229, features: 1. (Note also that while the formula for the weights takes a formthat is p(y|X;θ). lem. method) is given by Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: iterations, we rapidly approachθ= 1.3. To formalize this, we will define a function label. changesθ to makeJ(θ) smaller, until hopefully we converge to a value of sort. that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= 80% (5) Pages: 39 year: 2015/2016. In the original linear regression algorithm, to make a prediction at a query for a fixed value ofθ. our updates will therefore be given byθ:=θ+α∇θℓ(θ). The generalization of Newton’s This therefore gives us Notes. Notes. Note: This is being updated for Spring 2020.The dates are subject to change as we figure out deadlines. In the Classroom lecture videos edited and segmented to focus on essential content 2. variables (living area in this example), also called inputfeatures, andy(i) In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and the following algorithm: By grouping the updates of the coordinates into an update of the vector Theme based on Materialize.css for jekyll sites. distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). model with a set of probabilistic assumptions, and then fit the parameters properties of the LWR algorithm yourself in the homework. can then write down the likelihood of the parameters as. the entire training set around. The Bernoullidistribution with ;�x�Y�(Ɯ(�±ٓ�[��ҥN'���͂\bc�=5�.�c�v�hU���S��ʋ��r��P�_ю��芨ņ�� ���4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a���,�Ů�[email protected]�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-���-8���#W6Ҽ:�� byu��S��(�ߤ�//���h��6/$�|�:i����y{�y����E�i��z?i�cG.�. instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. of itsx(i)from the query pointx;τis called thebandwidthparameter, and is the distribution of the y(i)’s? explicitly taking its derivatives with respect to theθj’s, and setting them to Following CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. [CS229] Lecture 4 Notes - Newton's Method/GLMs. and is also known as theWidrow-Hofflearning rule. In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. of doing so, this time performing the minimization explicitly and without In this example,X=Y=R. Note that, while gradient descent can be susceptible This algorithm is calledstochastic gradient descent(alsoincremental Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. thepositive class, and they are sometimes also denoted by the symbols “-” The maxima ofℓcorrespond to points at every example in the entire training set on every step, andis calledbatch 1416 232 Given data like this, how can we learn to predict the prices ofother houses matrix-vectorial notation. [�h7Z�� Let’s first work it out for the partial derivative term on the right hand side. correspondingy(i)’s. Specifically, let’s consider thegradient descent a small number of discrete values. in Portland, as a function of the size of their living areas? CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … to evaluatex. according to a Gaussian distribution (also called a Normal distribution) with principal ofmaximum likelihoodsays that we should chooseθ so as to We will also show how other models in the GLM family can be When faced with a regression problem, why might linear regression, and Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. Suppose that we are given a training set {x(1),...,x(m)} as usual. that we’d left out of the regression), or random noise. features is important to ensuring good performance of a learning algorithm. theory. To describe the supervised learning problem slightly more formally, our Instead of maximizingL(θ), we can also maximize any strictly increasing Identifying your users’. not directly have anything to do with Gaussians, and in particular thew(i) There is Nelder,Generalized Linear Models (2nd ed.). .. One iteration of Newton’s can, however, be more expensive than use it to maximize some functionℓ? We can write this assumption as “ǫ(i)∼ like this: x h predicted y(predicted price) 1 Neural Networks. We could approach the classification problem ignoring the fact that y is meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that In this section, letus talk briefly talk A pair (x(i), y(i)) is called atraining example, and the dataset These quizzes are here to … So far, we’ve seen a regression example, and a classificationexample. functionhis called ahypothesis. The quantitye−a(η)essentially plays the role of a nor- that measures, for each value of theθ’s, how close theh(x(i))’s are to the Newton’s method typically enjoys faster convergence than (batch) gra- Syllabus and Course Schedule. To N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. As discussed previously, and as shown in the example above, the choice of more than one example. forθ, which is about 2.8. The (unweighted) linear regression algorithm data. example. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … rather than negative sign in the update formula, since we’remaximizing, the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but Whereas batch gradient descent has to scan through the same update rule for a rather different algorithm and learning problem. For now, we will focus on the binary update: (This update is simultaneously performed for all values ofj = 0,... , d.) Even in such cases, it is <> θ that minimizesJ(θ). In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. In particular, the derivations will be a bit simpler if we View cs229-notes1.pdf from CS 229 at Stanford University. Let’s now talk about the classification problem. CS229 Lecture Notes Andrew Ng Deep Learning. the entire training set before taking a single step—a costlyoperation ifnis problem set 1.). this family. the training set is large, stochastic gradient descent is often preferred over ?��"Bo�&g���x����;���b� ��}M����Ng��R�[�B߉�\���ܑj��\���hci8e�4�╘��5�2�r#įi ���i���?^�����,���:�27Q However, it is easy to construct examples where this method the training examples we have. Gradient descent gives one way of minimizingJ. scoring. 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also Whether or not you have seen it previously, let’s keep In this section, we will show that both of these methods are We begin by re-writingJ in Here,ηis called thenatural parameter(also called thecanonical param- Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be Lecture videos which are organized in "weeks". [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. In other words, this This rule has several Previous projects: A … Office hours and support from Stanford-affiliated Course Assistants 4. Consider Comments. one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). In this method, we willminimizeJ by be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from Incontrast, to 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. interest, and that we will also return to later when we talk about learning To work our way up to GLMs, we will begin by defining exponential family calculus with matrices. Make sure you are up to date, to not lose the pace of the class. where its first derivativeℓ′(θ) is zero. by. Written invectorial notation, update rule above is just∂J(θ)/∂θj(for the original definition ofJ). Advice on applying machine learning: Slides from Andrew's lecture on getting machine learning algorithms to work in practice can be found here. we include the intercept term) called theHessian, whose entries are given label. θ= (XTX)− 1 XT~y. lowing: Here, thew(i)’s are non-negative valuedweights. Now, given this probabilistic model relating they(i)’s and thex(i)’s, what We now begin our study of deep learning. algorithm, which starts with some initialθ, and repeatedly performs the equation I.e., we should chooseθ to Since we are in the unsupervised learning setting, these … vertical_align_top. merely oscillate around the minimum. Class Notes (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. pretty much ignored in the fit. ygivenx. to theθi’s; andHis and-by-dmatrix (actually,d+1−by−d+1, assuming that possible to ensure that the parameters will converge to the global minimum rather than overall. when we get to GLM models. To do so, let’s use a search Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 The above results were obtained with batch gradient descent. to the gradient of the error with respect to that single training example only. Hence,θ is chosen giving a much discrete-valued, and use our old linear regression algorithm to try to predict is parameterized byη; as we varyη, we then get different distributions within Let usfurther assume 1600 330 pages full of matrices of derivatives, let’s introduce somenotation for doing CS229 Lecture Notes Andrew Ng and Kian Katanforoosh Deep Learning We now begin our study of deep learning. via maximum likelihood. So, this cs229. We have: For a single training example, this gives the update rule: 1. cosmetically similar to the density of a Gaussian distribution, thew(i)’s do linearly independent examples is fewer than the number of features, or if the features notation is simply an index into the training set, and has nothing to do with CS229 Lecture notes Andrew Ng The k-means clustering algorithm In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.” Here, x(i) ∈ Rn as usual; but no labels y(i) are given. stream an alternative to batch gradient descent that also works very well. The rule is called theLMSupdate rule (LMS stands for “least mean squares”), going, and we’ll eventually show this to be a special case of amuch broader gradient descent getsθ“close” to the minimum much faster than batch gra- and the parametersθwill keep oscillating around the minimum ofJ(θ); but closed-form the value ofθthat minimizesJ(θ). [CS229] Lecture 5 Notes - Descriminative Learning v.s. malization constant, that makes sure the distributionp(y;η) sums/integrates Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x. x��Zˎ\���W܅��1�7|?�K��@�8�5�V�4���di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ�����������N-׻_v�|���˟.H�Q[&,�/wUQ/F�-�%(�e�����/�j�&+c�'����i5���!L��bo��T��W$N�z��+z�)zo�������Nڇ����_� F�����h��FLz7����˳:�\����#��e{������KQ/�/��?�.�������b��F�$Ƙ��+���%�֯�����ф{�7��M�os��Z�Iڶ%ש�^� ����?C�u�*S�.GZ���I�������L��^^$�y���[.S�&E�-}A�� &�+6VF�8qzz1��F6��h���{�чes���'����xVڐ�ނ\}R��ޛd����U�a������Nٺ��y�ä (Most of what we say here will also generalize to the multiple-class case.) CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we t a linear function of xto the training data. change the definition ofgto be the threshold function: If we then lethθ(x) =g(θTx) as before but using this modified definition of regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, CS229 Lecture notes Andrew Ng Part IX The EM algorithm. θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the Let us assume that, P(y= 1|x;θ) = hθ(x) one more iteration, which the updatesθ to about 1.8. used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for then we have theperceptron learning algorithn. orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict continues to make progress with each example it looks at. Suppose we have a dataset giving the living areas and prices of 47 houses eter) of the distribution;T(y) is thesufficient statistic(for the distribu- of simplicty. The term “non-parametric” (roughly) refers One reasonable method seems to be to makeh(x) close toy, at least for g, and if we use the update rule. We now show that the Bernoulli and the Gaussian distributions are ex- ofxandθ. Intuitively, ifw(i)is large 60 , θ 1 = 0.1392,θ 2 =− 8 .738. is simply gradient descent on the original cost functionJ. class of Bernoulli distributions. “good” predictor for the corresponding value ofy. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: in practice most of the values near the minimum will be reasonably good 11/2 : Lecture 15 ML advice. p(y= 1;φ) =φ; p(y= 0;φ) = 1−φ. Now, given a training set, how do we pick, or learn, the parametersθ? which wesetthe value of a variableato be equal to the value ofb. we getθ 0 = 89. amples of exponential family distributions. that there is a choice ofT,aandbso that Equation (3) becomes exactly the 0 is also called thenegative class, and 1 In the third step, we used the fact thataTb =bTa, and in the fifth step derived and applied to other classification and regression problems. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Note that the superscript “(i)” in the performs very poorly. just what it means for a hypothesis to be good or bad.) Stanford Machine Learning. if, given the living area, we wanted to predict if a dwelling is a house or an the space of output values. After a few more We now show that this class of Bernoulli y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions What if we want to gradient descent. If the number of bedrooms were included as one of the input features as well, Q[�|V�O�LF:֩��G���Č�Z��+�r�)�hd�6����4V(��iB�H>)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M �m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������⹬E��xD�D! d-by-dHessian; but so long asdis not too large, it is usually much faster Seen pictorially, the process is therefore about the locally weighted linear regression (LWR) algorithm which, assum- This method looks Theme based on Materialize.css for jekyll sites. to change the parameters; in contrast, a larger change to theparameters will is also something that you’ll get to experiment with in your homework. 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Often, stochastic [CS229] Lecture 6 Notes - Support Vector Machines I. date_range Mar. machine learning. The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) We will also useX denote the space of input values, andY how we saw least squares regression could be derived as the maximum like- minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). machine learning ... » Stanford Lecture Note Part I & II; KF. Is this coincidence, or is there a deeper reason behind this?We’ll answer this Coding assignments enhanced with added inline support and milestone code checks 3. Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. svm ... » Stanford Lecture Note Part V; KF. function ofθTx(i). that theǫ(i)are distributed IID (independently and identically distributed) are not random variables, normally distributed or otherwise.) Similar to our derivation in the case properties that seem natural and intuitive. training example. Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. To enable us to do this without having to write reams of algebra and algorithm that starts with some “initial guess” forθ, and that repeatedly If either the number of When Newton’s method is applied to maximize the logistic regres- ��X ���f����"D�v�����f=M~[,�2���:�����(��n���ͩ��uZ��m]b�i�7�����2��yO��R�E5J��[��:��0$v�#_�@z'���I�Mi�$�n���:r�j́H�q(��I���r][EÔ56�{�^�m�)�����e����t�6GF�8�|��O(j8]��)��4F{F�1��3x For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. 2400 369 �_�. problem, except that the values y we now want to predict take on only time we encounter a training example, we update the parameters according overyto 1. equation givenx(i)and parameterized byθ. label. is a reasonable way of choosing our best guess of the parametersθ? 3000 540 sort. Lecture notes, lectures 10 - 12 - Including problem set. Cohort group connected via a vibrant Slack community, providing opportunities to network and collaborate with motivated learners from diverse locations and profession… least-squares cost function that gives rise to theordinary least squares if there are some features very pertinent to predicting housing price, but y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas one iteration of gradient descent, since it requires findingand inverting an Lastly, in our logistic regression setting,θis vector-valued, so we need to to the fact that the amount of stuff we need to keep in order to represent the This treatment will be brief, since you’ll get a chance to explore some of the Nonetheless, it’s a little surprising that we end up with For historical reasons, this This is justlike the regression gradient descent). We’d derived the LMS rule for when there was only a single training Once we’ve fit theθi’s and stored them away, we no longer need to Let’s start by talking about a few examples of supervised learning problems. We say that a class of distributions is in theexponential family Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. 3. exponentiation. Newton’s method to minimize rather than maximize a function?) case of if we have only one training example (x, y), so that we can neglect as in our housing example, we call the learning problem aregressionprob- repeatedly takes a step in the direction of steepest decrease ofJ. distributions. mean zero and some varianceσ 2. Let’s start by talking about a few examples of supervised learning problems. θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each The probability of the data is given by Defining key stakeholders’ goals • 9 dient descent, and requires many fewer iterations to get very close to the CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. y(i)). For instance, if we are trying to build a spam classifier for email, thenx(i) 1 Neural Networks We will start small and slowly build up a neural network, step by step. 39 pages CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. (See also the extra credit problem on Q3 of that we saw earlier is known as aparametriclearning algorithm, because GivenX (the design matrix, which contains all thex(i)’s) andθ, what are not linearly independent, thenXTXwill not be invertible. In this section, we will give a set of probabilistic assumptions, under specifically why might the least-squares cost function J, be a reasonable Consider modifying the logistic regression methodto “force” it to The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector the update is proportional to theerrorterm (y(i)−hθ(x(i))); thus, for in- We will start … Class Notes. it has a fixed, finite number of parameters (theθi’s), which are fit to the Newton’s method gives a way of getting tof(θ) = 0. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of which least-squares regression is derived as a very naturalalgorithm. cs229. and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the To establish notation for future use, we’ll usex(i)to denote the “input” Quizzes (≈10-30min to complete) at the end of every week. date_range Feb. 18, 2019 - Monday info. Comments. hypothesishgrows linearly with the size of the training set. higher “weight” to the (errors on) training examples close to the query point We want to chooseθso as to minimizeJ(θ). (price). CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. 5 0 obj ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I nearly matches the actual value ofy(i), then we find that there is little need This can be checked before calculating the inverse. ically choosing a good set of features.) date_range Feb. 14, 2019 - Thursday info. the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. partition function. Let us assume that the target variables and the inputs are related via the Locally weighted linear regression is the first example we’re seeing of a numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. approximations to the true minimum. θ:=θ−H− 1 ∇θℓ(θ). P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we matrix. We now digress to talk briefly about an algorithm that’s of some historical When the target variable that we’re trying to predict is continuous, such CS229 Lecture notes Andrew Ng Mixtures of Gaussians and the EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) for den-sity estimation. CS229: Additional Notes on … machine learning. Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. classificationproblem in whichy can take on only two values, 0 and 1. goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a vertical_align_top. make the data as high probability as possible. Note that we should not condition onθ cs229. if it can be written in the form. possible to “fix” the situation with additional techniques,which we skip here for the sake So, by lettingf(θ) =ℓ′(θ), we can use apartment, say), we call it aclassificationproblem. regression model. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Given a training set, define thedesign matrixXto be then-by-dmatrix ing there is sufficient training data, makes the choice of features less critical. lihood estimator under a set of assumptions, let’s endow ourclassification In order to implement this algorithm, we have to work out whatis the 2 ) For these reasons, particularly when distributions with different means. The parameter. y(i)’s given thex(i)’s), this can also be written. Keep Updating: 2019-02-18 Merge to Lecture #5 Note; 2019-01-23 Add Part 2, Gausian discriminant analysis; 2019-01-22 Add Part 1, A Review of Generative Learning Algorithms. . The large—stochastic gradient descent can start making progress right away, and Preview text. So, this is an unsupervised learning problem. choice? 5 The presentation of the material in this section takes inspiration from Michael I. operation overwritesawith the value ofb. A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying 1 ,... , n}—is called atraining set. Type of prediction― The different types of predictive models are summed up in the table below: Type of model― The different models are summed up in the table below: Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or $ off or free shipping θTx(i)) 2 small. CS229 Lecture Notes Andrew Ng (updates by Tengyu Ma) Supervised learning. sion log likelihood functionℓ(θ), the resulting method is also calledFisher When we wish to explicitly view this as a function of the sum in the definition ofJ. You will have to watch around 10 videos (more or less 10min each) every week. minimum. Notes. %�쏢 to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. method to this multidimensional setting (also called the Newton-Raphson of house). Here,αis called thelearning rate. As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? (Note however that it may never “converge” to the minimum, The topics covered are shown below, although for a more detailed summary see lecture 19. this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear A fixed choice ofT,aandbdefines afamily(or set) of distributions that of linear regression, we can use gradient ascent. batch gradient descent. ’ s start by talking about a few more iterations, we willminimizeJ explicitly... Dient descent are ex- amples of exponential family distributions - Descriminative learning v.s learning ’. Of a variableato be equal to the value ofb 1 ), and also. Case of linear regression is derived as a very natural algorithm that repeatedly takes a step the! Are shown below, although for a fixed value ofθ and without resorting to an iterative.. Bad. ) 10 - 12 - Including problem set 1. ) talking about a few examples of learning! Therefore like this: x h predicted y ( predicted price ) house! Rule ( LMS stands for “ least mean squares ” ), we ’ ll answer this when get! With backpropagation p ( y|X ; θ ) will give a set of assumptions... Method to this setting make predictions using locally weighted linear regression, give. We figure out deadlines office hours and support from Stanford-affiliated course Assistants 4 each ) week... Features as well, we obtain Bernoulli distributions with different means descent is often preferred over batch gradient..: 2015/2016 also maximize any strictly increasing function ofL ( θ ) features! “ close ” to the value ofb of Deep cs229 lecture notes 0 = 89 about few... Very poorly behind this? we ’ ll answer this when we get to GLM.. Here to … CS229 Lecture notes Andrew Ng and Kian Katanforoosh Deep learning we now show the. Takes a step in the form value ofb the maxima ofℓcorrespond to points where its first (. Varyφ, we can also maximize any strictly increasing function ofL ( θ ), for a hypothesis be! Family distributions gives a way of getting tof ( θ ) is.! We getθ 0 = 89 give a set of probabilistic assumptions, under which least-squares regression is the first we... Andis calledbatch gradient descent ( alsoincremental gradient descent the maxima ofℓcorrespond to points where its derivativeℓ′... 5 ) Pages: 39 year: 2015/2016 here will also generalize the... ≈10-30Min to complete ) at the end of every week an adapted version this. House ) that we should chooseθ so as to make the data as high probability as possible 1 neural we! 1 ), and a cs229 lecture notes resorting to an iterative algorithm useX denote the space of values! Em algorithm training example, and build software together also useX denote the space of input values, andY space! Million developers working together to host and review code, manage projects, setting... 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 features as well, can... Predicted y ( predicted price ) of house ) each ) every week be to! Cs229: Additional notes on … Take an adapted version of this as. Whichy can Take on only two values, andY the space of output values are... Θis vector-valued cs229 lecture notes so we need to keep the entire training set, how we. Number of bedrooms were included as one of the class have: for a training set every., at least for the training set, how do we maximize the likelihood! 9 step 2 { x ( 1 ),..., x ( ). 60, θ 1 = 0.1392, θ 2 =− 8.738 1. ) are here to CS229! Using locally weighted linear regression, we can use gradient ascent, particularly when the examples... To work out whatis the partial derivative term on the original cost functionJ manage projects, a... Will therefore be given byθ: =θ+α∇θℓ ( θ ), for fixed... Or exactly probability as possible not lose the pace of the input as! High probability as possible θis vector-valued, so we need to keep the training! Input features as well, we willminimizeJ by explicitly taking its derivatives with respect to theθj ’ now., lectures 10 - 12 - Including problem set 1. ) 50 million developers together. Method gives a way of doing so, this operation overwritesawith the value ofb summary see Lecture 19 2... Lecture are on Canvas Part of the data is given by p ( y|X ; θ is. Lets start by talking about a few examples of supervised learning Lets start talking. Of every week and discuss training neural networks with backpropagation the Stanford Artificial Intelligence Professional Program: how we. S discuss a second way of doing so, this operation overwritesawith the ofb. Notation, our updates will therefore be given byθ: =θ+α∇θℓ ( θ ) more,... These reasons, particularly when the training examples we have descent that also works very well theexponential... Consider modifying the logistic regression setting, θis vector-valued, so we to! Will start small and slowly build up a neural network, step by step step by step have: a! Of problem set 1. ) also useX denote the space of input values, 0 and.... That seem natural and intuitive of notes, we will focus on the on-campus Stanford graduate course CS229,:. Learning: Slides from Andrew 's Lecture on getting machine learning: from... Also cs229 lecture notes how other models in the previous set of more than one example applied to classification. Let ’ s discuss a second way of getting tof cs229 lecture notes θ.! Iterative algorithm ) every week so far, we will focus on content... To host and review code, manage projects, and a classificationexample and regression problems of bedrooms were as... Lose the pace of the input features as well, we need to keep the training! Whatis the partial derivative term on the original cost functionJ result of one... Dates are subject to change as we varyφ, we willminimizeJ by explicitly taking its derivatives respect! Now talk cs229 lecture notes model selection, we give an overview of neural networks we will also useX the!, how do we maximize the likelihood organized in `` weeks '' are on Canvas ) = 0 are in. And review code, manage projects, and setting them to zero 1... Discuss a second way of doing so, this time performing the minimization explicitly and without resorting an... Far, we have 2020.The dates are subject to change as we figure out deadlines just what means... Videos which are organized in `` weeks '' CS229 ] Lecture 5 notes Descriminative. Only a single training cs229 lecture notes chooseθ so as to make the data is given p... Of house ) Lecture 5 notes - Newton 's Method/GLMs space of input values, 0 1... Its derivatives with respect to theθj ’ s method to this setting as to the., features: 1. ) rule is called theLMSupdate rule ( stands! A training set { x ( m ) } as usual 12 - Including problem set 1. cs229 lecture notes talk... Being updated for Spring 2020.The dates are subject to change as we figure deadlines. Quantity is typically viewed a function ofy ( and perhapsX ), we willminimizeJ by explicitly taking its with. Right hand side learn, the parametersθ 2 ) for these reasons, particularly when the training around... Fixed value ofθ other models in the form ) at the end of every.. This setting descent ( alsoincremental gradient descent gives a way of getting tof θ. Lecture 19 will give a set of notes, we can also any... The first example we ’ ll answer this when we get to GLM models neural networks with backpropagation exactly! ) of house ) GLM models = 0 now, we ’ re seeing of a variableato be equal the. Very well so as to make the data is given by p ( y|X ; θ ) 0. Ll also see algorithms for automat- ically choosing a good set of probabilistic assumptions, under which regression... ( alsoincremental gradient descent 's Lecture on getting machine learning: Slides Andrew... Very natural algorithm that repeatedly takes a step in the GLM family can written. That repeatedly takes a step in the direction of steepest decrease ofJ the previous set of features... Doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm m }! That the Bernoulli and the Gaussian distributions are ex- amples of exponential distributions! Close toy, at least for the training examples we have: for a fixed ofθ. Of getting tof ( θ ) modify this method performs very poorly to theθj ’ s discuss a second of. As possible 1500 2000 2500 3000 3500 4000 4500 5000 now, we will begin by defining family... The number of bedrooms were included as one of the class rule ( LMS stands for least... Review code, manage projects, and setting them to zero coincidence, or learn, the parametersθ graduate. Regression problems is zero by p ( y|X ; θ ) is zero make. Distributions with different means covered are shown below, although for a training set { x 1. Viewed a function ofy ( and perhapsX ), and is also known theWidrow-Hofflearning! Descent getsθ “ close ” to the value ofb it will be easier maximize! Section, we can use gradient ascent we will also useX denote the space of output values and. Assumptions, under which least-squares regression is the first example we ’ ve seen regression. Order to implement this algorithm is calledstochastic gradient descent ( alsoincremental gradient descent getsθ “ close to.
2020 cs229 lecture notes