\n",
"\n",
"## EP2: linear regression, analytic solution\n",
"\n",
"### Goals:\n",
"\n",
"- to implement and test the analytic solution for the linear regression task (see, for instance, Slides of Lecture 03 and Lecture 03 of *Learning from Data*)\n",
"- to understand the core idea (*optimization of a loss or cost function*) for parameter adjustment in machine learning\n",
"\n",
"This notebook makes use of additional auxiliary functions in util/\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear regression\n",
"\n",
"Given a dataset $\\{(\\mathbf{x}^{(1)}, y^{(1)}), \\dots ,(\\mathbf{x}^{(N)}, y^{(N)})\\}$ with $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{d}$ and $y^{(i)} \\in \\mathbb{R}$, we would like to approximate the unknown function $f:\\mathbb{R}^{d} \\rightarrow \\mathbb{R}$ (recall that $y^{(i)} =f(\\mathbf{x}^{(i)})$) by means of a linear model $h$:\n",
"$$\n",
"h(\\mathbf{x}^{(i)}; \\mathbf{w}, b) = \\mathbf{w}^\\top \\mathbf{x}^{(i)} + b\n",
"$$\n",
"\n",
"Note that $h(\\mathbf{x}^{(i)}; \\mathbf{w}, b)$ is, in fact, an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation) of $\\mathbf{x}^{(i)}$. As commonly done, we will use the term \"linear\" to refer to an affine transformation.\n",
"\n",
"The output of $h$ is a linear transformation of $\\mathbf{x}^{(i)}$. We use the notation $h(\\mathbf{x}^{(i)}; \\mathbf{w}, b)$ to make clear that $h$ is a parametric model, i.e., the transformation $h$ is defined by the parameters $\\mathbf{w}$ and $b$. We can view vector $\\mathbf{w}$ as a *weight* vector that controls the effect of each *feature* in the prediction.\n",
"\n",
"By adding one component with value equal to 1 to the observations $\\mathbf{x}^{(i)}$ -- artificial coordinate -- we can simplify the notation:\n",
"\n",
"$$\n",
"h(\\mathbf{x}^{(i)}; \\mathbf{w}) = \\hat{y}^{(i)} = \\mathbf{w}^\\top \\mathbf{x}^{(i)}\n",
"$$\n",
"\n",
"We would like to determine the optimal parameters $\\mathbf{w}$ such that prediction $\\hat{y}^{(i)}$ is as closest as possible to $y^{(i)}$ according to some error metric. Adopting the *mean square error* as such metric we have the following cost function:\n",
"\n",
"\\begin{equation}\n",
"J(\\mathbf{w}) = \\frac{1}{N}\\sum_{i=1}^{N}\\big(\\hat{y}^{(i)} - y^{(i)}\\big)^{2}\n",
"\\end{equation}\n",
"\n",
"Thus, the task of determining a function $h$ that is closest to $f$ is reduced to the task of finding the values $\\mathbf{w}$ that minimizes $J(\\mathbf{w})$.\n",
"\n",
"**Now we will explore this model, starting with a simple dataset.**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import auxiliary functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# all imports\n",
"import numpy as np\n",
"import time\n",
"from util.util import get_housing_prices_data, r_squared\n",
"from util.plots import plot_points_regression \n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The dataset \n",
"\n",
"The first dataset we will use is a toy dataset. We will generate $N=100$ observations with only one *feature* and a real value associated to each of them. We can view these observations as being pairs *(area of a real state in square meters, price of the real state)*. Our task is to construct a model that is able to predict the price of a real state, given its area."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X, y = get_housing_prices_data(N=100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ploting the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_points_regression(X,\n",
" y,\n",
" title='Real estate prices prediction',\n",
" xlabel=\"m\\u00b2\",\n",
" ylabel='$')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The solution\n",
"\n",
"Given $f:\\mathbb{R}^{N\\times d} \\rightarrow \\mathbb{R}$ and $\\mathbf{A} \\in \\mathbb{R}^{N\\times d}$, we define the gradient of $f$ with respect to $\\mathbf{A}$ as:\n",
"\n",
"\\begin{equation*}\n",
"\\nabla_{\\mathbf{A}}f = \\frac{\\partial f}{\\partial \\mathbf{A}} = \\begin{bmatrix}\n",
"\\frac{\\partial f}{\\partial \\mathbf{A}_{1,1}} & \\dots & \\frac{\\partial f}{\\partial \\mathbf{A}_{1,m}} \\\\\n",
"\\vdots & \\ddots & \\vdots \\\\\n",
"\\frac{\\partial f}{\\partial \\mathbf{A}_{n,1}} & \\dots & \\frac{\\partial f}{\\partial \\mathbf{A}_{n,m}}\n",
"\\end{bmatrix}\n",
"\\end{equation*}\n",
"\n",
"Let $\\mathbf{X} \\in \\mathbb{R}^{N\\times d}$ be a matrix whose rows are the observations of the dataset (sometimes also called the *design matrix*) and let $\\mathbf{y} \\in \\mathbb{R}^{N}$ be the vector consisting of all values of $y^{(i)}$ (i.e., $\\mathbf{X}^{(i,:)} = \\mathbf{x}^{(i)}$ and $\\mathbf{y}^{(i)} = y^{(i)}$). It can be verified that: \n",
"\n",
"\\begin{equation}\n",
"J(\\mathbf{w}) = \\frac{1}{N}(\\mathbf{X}\\mathbf{w} - \\mathbf{y})^{T}(\\mathbf{X}\\mathbf{w} - \\mathbf{y})\n",
"\\end{equation}\n",
"\n",
"Using basic matrix derivative concepts we can compute the gradient of $J(\\mathbf{w})$ with respect to $\\mathbf{w}$:\n",
"\n",
"\\begin{equation}\n",
"\\nabla_{\\mathbf{w}}J(\\mathbf{w}) = \\frac{2}{N} (\\mathbf{X}^{T}\\mathbf{X}\\mathbf{w} -\\mathbf{X}^{T}\\mathbf{y}) \n",
"\\end{equation}\n",
"\n",
"Thus, when $\\nabla_{\\mathbf{w}}J(\\mathbf{w}) = 0$ we have \n",
"\n",
"\\begin{equation}\n",
"\\mathbf{X}^{T}\\mathbf{X}\\mathbf{w} = \\mathbf{X}^{T}\\mathbf{y}\n",
"\\end{equation}\n",
"\n",
"Hence,\n",
"\n",
"\\begin{equation}\n",
"\\mathbf{w} = (\\mathbf{X}^{T}\\mathbf{X})^{-1}\\mathbf{X}^{T}\\mathbf{y}\n",
"\\end{equation}\n",
"\n",
"Note that this solution has a high computational cost. As the number of variables (*features*) increases, the cost for matrix inversion becomes prohibitive. See [this text](http://cs229.stanford.edu/notes/cs229-notes1.pdf) for more details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1\n",
"Using only **NumPy** (a quick introduction to this library can be found [here](http://cs231n.github.io/python-numpy-tutorial/)), complete the two functions below. Recall that $\\mathbf{X} \\in \\mathbb{R}^{N\\times d}$; thus you will need to add a component of value 1 to each of the observations in $\\mathbf{X}$ before performing the computation described above.\n",
"\n",
"NOTE: Although the dataset above has data of dimension $d=1$, your code must be generic (it should work for $d\\geq1$)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def normal_equation_weights(X, y):\n",
" \"\"\"\n",
" Calculates the weights of a linear function using the normal equation method.\n",
" You should add into X a new column with 1s.\n",
"\n",
" :param X: design matrix\n",
" :type X: np.ndarray(shape=(N, d))\n",
" :param y: regression targets\n",
" :type y: np.ndarray(shape=(N, 1))\n",
" :return: weight vector\n",
" :rtype: np.ndarray(shape=(d+1, 1))\n",
" \"\"\"\n",
" \n",
" # START OF YOUR CODE:\n",
" raise NotImplementedError(\"Function normal_equation_weights() is not implemented\")\n",
" # END YOUR CODE\n",
"\n",
" return w"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# test of function normal_equation_weights()\n",
"\n",
"w = 0 # this is not necessary\n",
"w = normal_equation_weights(X, y)\n",
"print(\"Estimated w = \", w)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def normal_equation_prediction(X, w):\n",
" \"\"\"\n",
" Calculates the prediction over a set of observations X using the linear function\n",
" characterized by the weight vector w.\n",
" You should add into X a new column with 1s.\n",
"\n",
" :param X: design matrix\n",
" :type X: np.ndarray(shape=(N, d))\n",
" :param w: weight vector\n",
" :type w: np.ndarray(shape=(d+1, 1))\n",
" :param y: regression prediction\n",
" :type y: np.ndarray(shape=(N, 1))\n",
" \"\"\"\n",
" \n",
" # START OF YOUR CODE:\n",
" raise NotImplementedError(\"Function normal_equation_prediction() is not implemented\")\n",
" # END YOUR CODE\n",
" \n",
" return prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"You can use the [$R^2$](https://pt.wikipedia.org/wiki/R%C2%B2) metric to evaluate how well the linear model fits the data.\n",
"\n",
"**It is expected that $𝑅^2$ is a value close to 0.5.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# test of function normal_equation_prediction()\n",
"prediction = normal_equation_prediction(X, w)\n",
"r_2 = r_squared(y, prediction)\n",
"plot_points_regression(X,\n",
" y,\n",
" title='Real estate prices prediction',\n",
" xlabel=\"m\\u00b2\",\n",
" ylabel='$',\n",
" prediction=prediction,\n",
" legend=True,\n",
" r_squared=r_2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional tests\n",
"\n",
"Let us compute a prediction for $x=650$\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let us use the prediction function\n",
"x = np.asarray([650]).reshape(1,1)\n",
"prediction = normal_equation_prediction(x, w)\n",
"print(\"Area = %.2f Predicted price = %.4f\" %(x[0], prediction))\n",
"\n",
"# another way of computing the same\n",
"y = np.dot(np.asarray((1,x)), w)\n",
"print(\"Area = %.2f Predicted price = %.4f\" %(x, y))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: Effect of the number of samples\n",
"\n",
"Change the number of samples $N$ and observe how processing time varies."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Testing different values for N\n",
"X, y = get_housing_prices_data(N=1000000)\n",
"init = time.time()\n",
"w = normal_equation_weights(X, y)\n",
"prediction = normal_equation_prediction(X,w)\n",
"init = time.time() - init\n",
"\n",
"print(\"Execution time = {:.8f}(s)\".format(init))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: Effect of the data dimension\n",
"\n",
"Test your code for data with $𝑑>1$. You can create your own dataset (if you do so, you can share the code by posting it to the moodle's Forum -- only the code for the dataset generation!). If you have no idea on how to generate such dataset, you can use existing datasets such as the \n",
" one in scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston\n",
"\n",
"If you used an existing dataset or one generated by a colleague, please acknowledge the fact clearly. Thanks!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Testing data with dimension d > 1"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}