MIT 15.097程序 写作、algorithms课程程序 辅导

” MIT 15.097程序 写作、algorithms课程程序 辅导Support Vector MachinesMIT 15.097 Course NotesLets start with some intuition about margins.The margin of an example xi = distance from example to decision boundary= yif(xi)The margin is positive if the example is on the correct side of the decision boundary,otherwise its negative.Heres the intuition for SVMs: We want all examples to have large margins, want them to be as far fromdecision boundary as possible. That way, the decision boundary is more stable, we are confident in alldecisions.1Most other algorithms (logistic regression, decision trees, perceptron) dont generallyproduce large margins. (AdaBoost generally produces large margins.)As in logistic regression and AdaBoost, function f is linear,mf(x) = X(j)x(j) + 0.j=1MIT 15.097作业 写作、algorithms课程作业 辅导Note that the intercept term can get swept into x by adding a 1 as the lastcomponent of each x. Then f(x) would be just Tx but For this lecture wellkeep the intercept term separately because SVM handles that term differentlythan if you put the intercept as a separate feature. We classify x using sign(f(x)).If xi has a large margin, we are confident that we classified it correctly. So wereessentially suggesting to use the margin yif(xi) to measure the confidence in ourprediction.But there is a problem with using yif(xi) to measure confidence in prediction.There is some arbitrariness about it.How? What should we do about it?2SVMs maximize the distance from the decision boundary to the nearest trainingexample they maximize the minimum margin. There is a geometric perspectivetoo. I set the Intercept to zero for this picture (so the decision boundary passesthrough the origin):The decision boundary are xs where Tx = 0. That means the unit vector for must be perpendicular to those xs that lie on the decision boundary.Now that you have the intuition, well put the intercept back, and we have totranslate the decision boundary, so its really the set of xs where Tx = 0.The margin of example i is denoted i:B is the point on the decision boundary closest to the positive example xi. B isB = xi i|||| 23since we Moved i units along the unit vector to get from the example to B.Since B lies on the decision boundary, it obeys Tx + 0 = 0, where x is B. (Iwrote the intercept there explicitly).Note that here we normalized so we wouldnt have the arbitrariness in the meaningof the margin.If the example is negative, the same calculation works, with a few sign flips (wedneed to move i units rather than i units).So the geometric margin from the picture is the same as the functionalmargin yif(xi).Maximize the minimum marginSupport vector Machines maximize the minimum margin. They would like tohave all examples Being far from the decision boundary. So theyll choose f thisway:For any and 0 that satisfy this, any positively scaled multiple satisfies themtoo, so we can arbitrarily set ||||2 = 1/ so that the right side is 1.Now when we maximize , were maximizing = 1/ ||||2.Txi + 0) 1 0 i = 1 . . . m (1)(the 1/2 and square are just for convenience) which is the same as:Writing the KKT conditions, starting with Lagrangian stationarity, where weneed to find the gradient wrt and the derivative wrt 0:m mL([, 0], ) = Xiyixi = 0 = =Xiyixi.i=1 i=1m mL([, 0], ) = iyi = 0 = 0i=1 iyi = 0.i=1i 0Xi (dual feasibilitXy)i yi(Txi + 0) + 1= 0Ti (complementary slackness)yi( xi + 0) + 1 0. (primal feasibility)5Using the KKT conditions, we can simplify the Lagrangian.m m m1L([, 0], ) = kk2 T2 + X(iyixi) +X(iyi0) + i2i=1 i=1 i=1(We just Expanded terms. Now well plug in the first KKT condition.)Xm1= ||||22 || ||2 2 0Xm(iyi) + i2Xi=1 i=1(Plug in the second KKT condition.)Again using the first KKT condition, we can rewrite the first term.Plugging back into the Lagrangian (2), which now only depends on , andputting in the second and third KKT conditions gives us the dual problem;max1 X T i 0 i = 1 . . . m L() = i ikyiykxWell use the last two KKT conditions in what follows,Pfor instance to get conditionson 0, but what weve already done is enough to define the dual problemfor .We can solve this dual problem. Either (i) wed use a generic quadratic programmingsolver, or (ii) use another algorithm, like SMO, which I will discuss6later. For now, assume we solved it. So we have 1, . . . , m. We can use thesolution of the dual problem to get the solution of the primal problem. We canplug into the first KKT condition to get, but we can see something cool in the process.Support VectorsLook at the Complementary slackness KKT condition and the primal and dualfeasibility conditions:The examples in the first category, for which the scaled margin is 1 and theconstraints are active are called support vectors. They are the closest to thedecision boundary.7Finish What We Were Doing EarlierTo get 0, use the complementarity condition for any of the support vectors (in other words, use the fact that the unnormalized margin of the support vectors is one):If you take a positive support vector, yi = 1,Written another way, since the support vectors have the smallest margins,So thats the solution! Just to recap, to get the scoring function ffor SVM,youd compute from the dual problem (3), plug it into (4) to get , plug thatinto the equation above to get 0, and thats the solution to the primal problem,and the Coefficients for f.8Support vectorsImage by MIT OpenCourseWare.Because of the form of the solution:it is possible that is very fast to calculate.Why is that? Think support vectors.The Nonseparable CaseIf there is no separating hyperplane,there is no feasible solution to the problem we wrote above. Most real problemsare nonseparable.Lets fix our SVM so it can accommodate the nonseparable case. The new formulationwill penalize mistakes the farther they are from the decision boundary.So we are allowed to make mistakes now, but we pay a price.9Lets change our primal problem (1) to this new primal problem:So the constraints allow some slack of size i, but we pay a price for it in theobjective. That is, if yif(xi) 1 then i gets set to 0, penalty is 0. Otherwise,if yif(xi) = 1 i, we Pay price i.2 Parameter C trades off between the twin goals of making the 2|| ||2small (makingwhat-was-the-minimum-margin 1/ ||||2large) and ensuring that most examples2have margin at least 1/ ||||2.Going on a Little TangentRewrite the penalty another way:If yif(xi) 1, zero penalty. Else, pay price i = 1 yif(xi)Third times the charm:Pay price i = b1 yif(xi)c+where This notation bzc+ means take the maximum of z and 0.Equation (5) becomes:m1min ||||22 + CXb1 yif(xi),0 2i=1c+Does that look familiar?10The Dual for the Nonseparable CaseForm the Lagrangian of (5):m m m1L( , , r) =2||||2, b, T2 + CXi i yi( xi + 0) 1 + i riiwhere is and ris are Lagrange multipliers (constrained to be 0). The dualturns out to be (after some work)i=1 iyi = 0So the only difference from the original problemPs Lagrangian (3) is that 0 iwas changed to 0 i C. Neat!Solving the dual problem with SMOSMO (Sequential Minimal Optimization) is a type of coordinate ascent algorithm,but adapted to SVM so that the solution always stays within the feasibleregion.Start with (6). Lets say you want to hold 2, . . . , m fixed and take a coordinatestep in the first direction. That is, change 1 to maximize the objective in (6).Can we make any progress? Can we get a better feasible solution by doing this?m Turns out, no. Look at the constraint in (6), i=1 iyi = 0. This means:So, since 2, . . . , m are fixed, 1 is also fixed.11So, if we want to update any of the is, we need to update at least 2 of themsimultaneously to keep the solution feasible (i.e., to keep the constraints satis-fied).Start with a feasible vector . Lets update 1 and 2, holding 3, . . . , m fixed.What values of 1 and 2 are we allowed to choose?Again, the constraint is: 1y1 + 2y2 = i=3 iyi =: (fixed constant). PmWe are only allowed to choose 1, 2 on the line, so when we pick 2, we get 1automatically, from11 = (y1 2y2)= y1( 2y2) (y1 = 1/y1 since y1 {+1, 1}).Also, the other constraints in (6) say 0 1, 2 C. So, 2 needs to be within[L,H] on the figure (in order for 1 to stay within [0, C]), where we will alwayshave 0 L,H C. To do the coordinate ascent step, we will optimize theobjective over 2, keeping it within [L,H]. Intuitively, (6) becomes:max1 + 2 + constants XikyiykxTi xk where 1 = y1( 2y2).2[L,H] 2i,k (7)The objective is Quadratic in 2. This means we can just set its derivative to 0to optimize it and get 2 for the next iteration of SMO. If the optimal value is12outside of [L,H], just choose 2 to be either L or H for the next iteration.For instance, if this is a plot of (7)s Objective (sometimes it doesnt look likethis, sometimes its upside-down), then well choose :Note: there are heuristics to choose the order of is chosen to update.13MIT OpenCourseWare https://ocw.mit.edu15.097 Prediction: Machine Learning and StatisticsSpring 2012For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.如有需要,请加QQ:99515681 或邮箱:99515681@qq.com

添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导