Computing the Gradient of RSS
It can be shown the RSS function is convex, so that there is a single unique solution to the minimization problem AND the gradient descent algorithm will converge to the minimum.
Remember our RSS function is the sum of the squares of differences between our predicted and observed values;
ASIDE: When doing this, it is helpful to remember that the derivative of a sum is the same as the sum of the derivatives of each of the components of the sum;
or simply;
In our case of the RSS function,
So to calculate the gradient, we take the partial derivative of each element of our coefficient vector . For the first element ;
and similarly the partial derivative of RSS(w) with respect to is given by;
If we do the work to complete these partial derivatives, then we find that the gradient for our RSS function is given by a two-element vector;