MAC0460 / MAC5832 (2020)¶

EP2: linear regression, analytic solution¶

Goals:¶

to implement and test the analytic solution for the linear regression task (see, for instance, Slides of Lecture 03 and Lecture 03 of Learning from Data)
to understand the core idea (optimization of a loss or cost function) for parameter adjustment in machine learning

This notebook makes use of additional auxiliary functions in util/

Linear regression¶

Given a dataset $\{(\mathbf{x}^{(1)}, y^{(1)}), \dots ,(\mathbf{x}^{(N)}, y^{(N)})\}$ with $\mathbf{x}^{(i)} \in \mathbb{R}^{d}$ and $y^{(i)} \in \mathbb{R}$, we would like to approximate the unknown function $f:\mathbb{R}^{d} \rightarrow \mathbb{R}$ (recall that $y^{(i)} =f(\mathbf{x}^{(i)})$) by means of a linear model $h$: $$ h(\mathbf{x}^{(i)}; \mathbf{w}, b) = \mathbf{w}^\top \mathbf{x}^{(i)} + b $$

Note that $h(\mathbf{x}^{(i)}; \mathbf{w}, b)$ is, in fact, an affine transformation of $\mathbf{x}^{(i)}$. As commonly done, we will use the term "linear" to refer to an affine transformation.

The output of $h$ is a linear transformation of $\mathbf{x}^{(i)}$. We use the notation $h(\mathbf{x}^{(i)}; \mathbf{w}, b)$ to make clear that $h$ is a parametric model, i.e., the transformation $h$ is defined by the parameters $\mathbf{w}$ and $b$. We can view vector $\mathbf{w}$ as a weight vector that controls the effect of each feature in the prediction.

By adding one component with value equal to 1 to the observations $\mathbf{x}^{(i)}$ -- artificial coordinate -- we can simplify the notation:

$$ h(\mathbf{x}^{(i)}; \mathbf{w}) = \hat{y}^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} $$

We would like to determine the optimal parameters $\mathbf{w}$ such that prediction $\hat{y}^{(i)}$ is as closest as possible to $y^{(i)}$ according to some error metric. Adopting the mean square error as such metric we have the following cost function:

\begin{equation} J(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^{N}\big(\hat{y}^{(i)} - y^{(i)}\big)^{2} \end{equation}

Thus, the task of determining a function $h$ that is closest to $f$ is reduced to the task of finding the values $\mathbf{w}$ that minimizes $J(\mathbf{w})$.

Now we will explore this model, starting with a simple dataset.

Import auxiliary functions¶

# all imports
import numpy as np
import time
from util.util import get_housing_prices_data, r_squared
from util.plots import plot_points_regression 

%matplotlib inline

The dataset¶

The first dataset we will use is a toy dataset. We will generate $N=100$ observations with only one feature and a real value associated to each of them. We can view these observations as being pairs (area of a real state in square meters, price of the real state). Our task is to construct a model that is able to predict the price of a real state, given its area.

X, y = get_housing_prices_data(N=100)

Ploting the data¶

plot_points_regression(X,
                       y,
                       title='Real estate prices prediction',
                       xlabel="m\u00b2",
                       ylabel='$')

The solution¶

Given $f:\mathbb{R}^{N\times d} \rightarrow \mathbb{R}$ and $\mathbf{A} \in \mathbb{R}^{N\times d}$, we define the gradient of $f$ with respect to $\mathbf{A}$ as:

\begin{equation*} \nabla_{\mathbf{A}}f = \frac{\partial f}{\partial \mathbf{A}} = \begin{bmatrix} \frac{\partial f}{\partial \mathbf{A}_{1,1}} & \dots & \frac{\partial f}{\partial \mathbf{A}_{1,m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial \mathbf{A}_{n,1}} & \dots & \frac{\partial f}{\partial \mathbf{A}_{n,m}} \end{bmatrix} \end{equation*}

Let $\mathbf{X} \in \mathbb{R}^{N\times d}$ be a matrix whose rows are the observations of the dataset (sometimes also called the design matrix) and let $\mathbf{y} \in \mathbb{R}^{N}$ be the vector consisting of all values of $y^{(i)}$ (i.e., $\mathbf{X}^{(i,:)} = \mathbf{x}^{(i)}$ and $\mathbf{y}^{(i)} = y^{(i)}$). It can be verified that:

\begin{equation} J(\mathbf{w}) = \frac{1}{N}(\mathbf{X}\mathbf{w} - \mathbf{y})^{T}(\mathbf{X}\mathbf{w} - \mathbf{y}) \end{equation}

Using basic matrix derivative concepts we can compute the gradient of $J(\mathbf{w})$ with respect to $\mathbf{w}$:

\begin{equation} \nabla_{\mathbf{w}}J(\mathbf{w}) = \frac{2}{N} (\mathbf{X}^{T}\mathbf{X}\mathbf{w} -\mathbf{X}^{T}\mathbf{y}) \end{equation}

Thus, when $\nabla_{\mathbf{w}}J(\mathbf{w}) = 0$ we have

\begin{equation} \mathbf{X}^{T}\mathbf{X}\mathbf{w} = \mathbf{X}^{T}\mathbf{y} \end{equation}

Hence,

\begin{equation} \mathbf{w} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \end{equation}

Note that this solution has a high computational cost. As the number of variables (features) increases, the cost for matrix inversion becomes prohibitive. See this text for more details.

Exercise 1¶

Using only NumPy (a quick introduction to this library can be found here), complete the two functions below. Recall that $\mathbf{X} \in \mathbb{R}^{N\times d}$; thus you will need to add a component of value 1 to each of the observations in $\mathbf{X}$ before performing the computation described above.

NOTE: Although the dataset above has data of dimension $d=1$, your code must be generic (it should work for $d\geq1$)

def normal_equation_weights(X, y):
    """
    Calculates the weights of a linear function using the normal equation method.
    You should add into X a new column with 1s.

    :param X: design matrix
    :type X: np.ndarray(shape=(N, d))
    :param y: regression targets
    :type y: np.ndarray(shape=(N, 1))
    :return: weight vector
    :rtype: np.ndarray(shape=(d+1, 1))
    """
    
    # START OF YOUR CODE:
    raise NotImplementedError("Function normal_equation_weights() is not implemented")
    # END YOUR CODE

    return w

# test of function normal_equation_weights()

w = 0  # this is not necessary
w = normal_equation_weights(X, y)
print("Estimated w = ", w)

def normal_equation_prediction(X, w):
    """
    Calculates the prediction over a set of observations X using the linear function
    characterized by the weight vector w.
    You should add into X a new column with 1s.

    :param X: design matrix
    :type X: np.ndarray(shape=(N, d))
    :param w: weight vector
    :type w: np.ndarray(shape=(d+1, 1))
    :param y: regression prediction
    :type y: np.ndarray(shape=(N, 1))
    """
    
    # START OF YOUR CODE:
    raise NotImplementedError("Function normal_equation_prediction() is not implemented")
    # END YOUR CODE
    
    return prediction

You can use the $R^2$ metric to evaluate how well the linear model fits the data.

It is expected that $𝑅^2$ is a value close to 0.5.

# test of function normal_equation_prediction()
prediction = normal_equation_prediction(X, w)
r_2 = r_squared(y, prediction)
plot_points_regression(X,
                       y,
                       title='Real estate prices prediction',
                       xlabel="m\u00b2",
                       ylabel='$',
                       prediction=prediction,
                       legend=True,
                       r_squared=r_2)

Additional tests¶

Let us compute a prediction for $x=650$

# Let us use the prediction function
x = np.asarray([650]).reshape(1,1)
prediction = normal_equation_prediction(x, w)
print("Area = %.2f  Predicted price = %.4f" %(x[0], prediction))

# another way of computing the same
y = np.dot(np.asarray((1,x)), w)
print("Area = %.2f  Predicted price = %.4f" %(x, y))

Exercise 2: Effect of the number of samples¶

Change the number of samples $N$ and observe how processing time varies.

# Testing different values for N
X, y = get_housing_prices_data(N=1000000)
init = time.time()
w = normal_equation_weights(X, y)
prediction = normal_equation_prediction(X,w)
init = time.time() - init

print("Execution time = {:.8f}(s)".format(init))

Exercise 2: Effect of the data dimension¶

Test your code for data with $𝑑>1$. You can create your own dataset (if you do so, you can share the code by posting it to the moodle's Forum -- only the code for the dataset generation!). If you have no idea on how to generate such dataset, you can use existing datasets such as the one in scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston

If you used an existing dataset or one generated by a colleague, please acknowledge the fact clearly. Thanks!

# Testing data with dimension  d > 1