This notebook makes use of additional auxiliary functions in
Given a dataset $\{(\mathbf{x}^{(1)}, y^{(1)}), \dots ,(\mathbf{x}^{(N)}, y^{(N)})\}$ with $\mathbf{x}^{(i)} \in \mathbb{R}^{d}$ and $y^{(i)} \in \mathbb{R}$, we would like to approximate the unknown function $f:\mathbb{R}^{d} \rightarrow \mathbb{R}$ (recall that $y^{(i)} =f(\mathbf{x}^{(i)})$) by means of a linear model $h$: $$ h(\mathbf{x}^{(i)}; \mathbf{w}, b) = \mathbf{w}^\top \mathbf{x}^{(i)} + b $$
Note that $h(\mathbf{x}^{(i)}; \mathbf{w}, b)$ is, in fact, an affine transformation of $\mathbf{x}^{(i)}$. As commonly done, we will use the term "linear" to refer to an affine transformation.
The output of $h$ is a linear transformation of $\mathbf{x}^{(i)}$. We use the notation $h(\mathbf{x}^{(i)}; \mathbf{w}, b)$ to make clear that $h$ is a parametric model, i.e., the transformation $h$ is defined by the parameters $\mathbf{w}$ and $b$. We can view vector $\mathbf{w}$ as a weight vector that controls the effect of each feature in the prediction.
By adding one component with value equal to 1 to the observations $\mathbf{x}^{(i)}$ -- artificial coordinate -- we can simplify the notation:
$$ h(\mathbf{x}^{(i)}; \mathbf{w}) = \hat{y}^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} $$
We would like to determine the optimal parameters $\mathbf{w}$ such that prediction $\hat{y}^{(i)}$ is as closest as possible to $y^{(i)}$ according to some error metric. Adopting the mean square error as such metric we have the following cost function:
\begin{equation} J(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^{N}\big(\hat{y}^{(i)} - y^{(i)}\big)^{2} \end{equation}
Thus, the task of determining a function $h$ that is closest to $f$ is reduced to the task of finding the values $\mathbf{w}$ that minimizes $J(\mathbf{w})$.
Now we will explore this model, starting with a simple dataset.
# all imports
import numpy as np
import time
from util.util import get_housing_prices_data, r_squared
from util.plots import plot_points_regression
%matplotlib inline
The first dataset we will use is a toy dataset. We will generate $N=100$ observations with only one feature and a real value associated to each of them. We can view these observations as being pairs (area of a real state in square meters, price of the real state). Our task is to construct a model that is able to predict the price of a real state, given its area.
X, y = get_housing_prices_data(N=100)
plot_points_regression(X,
y,
title='Real estate prices prediction',
xlabel="m\u00b2",
ylabel='$')
Given $f:\mathbb{R}^{N\times d} \rightarrow \mathbb{R}$ and $\mathbf{A} \in \mathbb{R}^{N\times d}$, we define the gradient of $f$ with respect to $\mathbf{A}$ as:
\begin{equation*} \nabla_{\mathbf{A}}f = \frac{\partial f}{\partial \mathbf{A}} = \begin{bmatrix} \frac{\partial f}{\partial \mathbf{A}_{1,1}} & \dots & \frac{\partial f}{\partial \mathbf{A}_{1,m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial \mathbf{A}_{n,1}} & \dots & \frac{\partial f}{\partial \mathbf{A}_{n,m}} \end{bmatrix} \end{equation*}
Let $\mathbf{X} \in \mathbb{R}^{N\times d}$ be a matrix whose rows are the observations of the dataset (sometimes also called the design matrix) and let $\mathbf{y} \in \mathbb{R}^{N}$ be the vector consisting of all values of $y^{(i)}$ (i.e., $\mathbf{X}^{(i,:)} = \mathbf{x}^{(i)}$ and $\mathbf{y}^{(i)} = y^{(i)}$). It can be verified that:
\begin{equation} J(\mathbf{w}) = \frac{1}{N}(\mathbf{X}\mathbf{w} - \mathbf{y})^{T}(\mathbf{X}\mathbf{w} - \mathbf{y}) \end{equation}
Using basic matrix derivative concepts we can compute the gradient of $J(\mathbf{w})$ with respect to $\mathbf{w}$:
\begin{equation} \nabla_{\mathbf{w}}J(\mathbf{w}) = \frac{2}{N} (\mathbf{X}^{T}\mathbf{X}\mathbf{w} -\mathbf{X}^{T}\mathbf{y}) \end{equation}
Thus, when $\nabla_{\mathbf{w}}J(\mathbf{w}) = 0$ we have
\begin{equation} \mathbf{X}^{T}\mathbf{X}\mathbf{w} = \mathbf{X}^{T}\mathbf{y} \end{equation}
Hence,
\begin{equation} \mathbf{w} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \end{equation}
Note that this solution has a high computational cost. As the number of variables (features) increases, the cost for matrix inversion becomes prohibitive. See this text for more details.
Using only NumPy (a quick introduction to this library can be found here), complete the two functions below. Recall that $\mathbf{X} \in \mathbb{R}^{N\times d}$; thus you will need to add a component of value 1 to each of the observations in $\mathbf{X}$ before performing the computation described above.
NOTE: Although the dataset above has data of dimension $d=1$, your code must be generic (it should work for $d\geq1$)
def normal_equation_weights(X, y):
"""
Calculates the weights of a linear function using the normal equation method.
You should add into X a new column with 1s.
:param X: design matrix
:type X: np.ndarray(shape=(N, d))
:param y: regression targets
:type y: np.ndarray(shape=(N, 1))
:return: weight vector
:rtype: np.ndarray(shape=(d+1, 1))
"""
# START OF YOUR CODE:
raise NotImplementedError("Function normal_equation_weights() is not implemented")
# END YOUR CODE
return w
# test of function normal_equation_weights()
w = 0 # this is not necessary
w = normal_equation_weights(X, y)
print("Estimated w = ", w)
def normal_equation_prediction(X, w):
"""
Calculates the prediction over a set of observations X using the linear function
characterized by the weight vector w.
You should add into X a new column with 1s.
:param X: design matrix
:type X: np.ndarray(shape=(N, d))
:param w: weight vector
:type w: np.ndarray(shape=(d+1, 1))
:param y: regression prediction
:type y: np.ndarray(shape=(N, 1))
"""
# START OF YOUR CODE:
raise NotImplementedError("Function normal_equation_prediction() is not implemented")
# END YOUR CODE
return prediction
You can use the $R^2$ metric to evaluate how well the linear model fits the data.
It is expected that $𝑅^2$ is a value close to 0.5.
# test of function normal_equation_prediction()
prediction = normal_equation_prediction(X, w)
r_2 = r_squared(y, prediction)
plot_points_regression(X,
y,
title='Real estate prices prediction',
xlabel="m\u00b2",
ylabel='$',
prediction=prediction,
legend=True,
r_squared=r_2)
Let us compute a prediction for $x=650$
# Let us use the prediction function
x = np.asarray([650]).reshape(1,1)
prediction = normal_equation_prediction(x, w)
print("Area = %.2f Predicted price = %.4f" %(x[0], prediction))
# another way of computing the same
y = np.dot(np.asarray((1,x)), w)
print("Area = %.2f Predicted price = %.4f" %(x, y))
Change the number of samples $N$ and observe how processing time varies.
# Testing different values for N
X, y = get_housing_prices_data(N=1000000)
init = time.time()
w = normal_equation_weights(X, y)
prediction = normal_equation_prediction(X,w)
init = time.time() - init
print("Execution time = {:.8f}(s)".format(init))
Test your code for data with $𝑑>1$. You can create your own dataset (if you do so, you can share the code by posting it to the moodle's Forum -- only the code for the dataset generation!). If you have no idea on how to generate such dataset, you can use existing datasets such as the one in scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston
If you used an existing dataset or one generated by a colleague, please acknowledge the fact clearly. Thanks!
# Testing data with dimension d > 1