{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MAC0460 / MAC5832 (2020)\n", "
\n", "\n", "## EP2: linear regression, analytic solution\n", "\n", "### Goals:\n", "\n", "- to implement and test the analytic solution for the linear regression task (see, for instance, Slides of Lecture 03 and Lecture 03 of *Learning from Data*)\n", "- to understand the core idea (*optimization of a loss or cost function*) for parameter adjustment in machine learning\n", "\n", "This notebook makes use of additional auxiliary functions in util/\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear regression\n", "\n", "Given a dataset $\\{(\\mathbf{x}^{(1)}, y^{(1)}), \\dots ,(\\mathbf{x}^{(N)}, y^{(N)})\\}$ with $\\mathbf{x}^{(i)} \\in \\mathbb{R}^{d}$ and $y^{(i)} \\in \\mathbb{R}$, we would like to approximate the unknown function $f:\\mathbb{R}^{d} \\rightarrow \\mathbb{R}$ (recall that $y^{(i)} =f(\\mathbf{x}^{(i)})$) by means of a linear model $h$:\n", "$$\n", "h(\\mathbf{x}^{(i)}; \\mathbf{w}, b) = \\mathbf{w}^\\top \\mathbf{x}^{(i)} + b\n", "$$\n", "\n", "Note that $h(\\mathbf{x}^{(i)}; \\mathbf{w}, b)$ is, in fact, an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation) of $\\mathbf{x}^{(i)}$. As commonly done, we will use the term \"linear\" to refer to an affine transformation.\n", "\n", "The output of $h$ is a linear transformation of $\\mathbf{x}^{(i)}$. We use the notation $h(\\mathbf{x}^{(i)}; \\mathbf{w}, b)$ to make clear that $h$ is a parametric model, i.e., the transformation $h$ is defined by the parameters $\\mathbf{w}$ and $b$. We can view vector $\\mathbf{w}$ as a *weight* vector that controls the effect of each *feature* in the prediction.\n", "\n", "By adding one component with value equal to 1 to the observations $\\mathbf{x}^{(i)}$ -- artificial coordinate -- we can simplify the notation:\n", "\n", "$$\n", "h(\\mathbf{x}^{(i)}; \\mathbf{w}) = \\hat{y}^{(i)} = \\mathbf{w}^\\top \\mathbf{x}^{(i)}\n", "$$\n", "\n", "We would like to determine the optimal parameters $\\mathbf{w}$ such that prediction $\\hat{y}^{(i)}$ is as closest as possible to $y^{(i)}$ according to some error metric. Adopting the *mean square error* as such metric we have the following cost function:\n", "\n", "\\begin{equation}\n", "J(\\mathbf{w}) = \\frac{1}{N}\\sum_{i=1}^{N}\\big(\\hat{y}^{(i)} - y^{(i)}\\big)^{2}\n", "\\end{equation}\n", "\n", "Thus, the task of determining a function $h$ that is closest to $f$ is reduced to the task of finding the values $\\mathbf{w}$ that minimizes $J(\\mathbf{w})$.\n", "\n", "**Now we will explore this model, starting with a simple dataset.**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import auxiliary functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# all imports\n", "import numpy as np\n", "import time\n", "from util.util import get_housing_prices_data, r_squared\n", "from util.plots import plot_points_regression \n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The dataset \n", "\n", "The first dataset we will use is a toy dataset. We will generate $N=100$ observations with only one *feature* and a real value associated to each of them. We can view these observations as being pairs *(area of a real state in square meters, price of the real state)*. Our task is to construct a model that is able to predict the price of a real state, given its area." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = get_housing_prices_data(N=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ploting the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_points_regression(X,\n", " y,\n", " title='Real estate prices prediction',\n", " xlabel=\"m\\u00b2\",\n", " ylabel='$')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The solution\n", "\n", "Given $f:\\mathbb{R}^{N\\times d} \\rightarrow \\mathbb{R}$ and $\\mathbf{A} \\in \\mathbb{R}^{N\\times d}$, we define the gradient of $f$ with respect to $\\mathbf{A}$ as:\n", "\n", "\\begin{equation*}\n", "\\nabla_{\\mathbf{A}}f = \\frac{\\partial f}{\\partial \\mathbf{A}} = \\begin{bmatrix}\n", "\\frac{\\partial f}{\\partial \\mathbf{A}_{1,1}} & \\dots & \\frac{\\partial f}{\\partial \\mathbf{A}_{1,m}} \\\\\n", "\\vdots & \\ddots & \\vdots \\\\\n", "\\frac{\\partial f}{\\partial \\mathbf{A}_{n,1}} & \\dots & \\frac{\\partial f}{\\partial \\mathbf{A}_{n,m}}\n", "\\end{bmatrix}\n", "\\end{equation*}\n", "\n", "Let $\\mathbf{X} \\in \\mathbb{R}^{N\\times d}$ be a matrix whose rows are the observations of the dataset (sometimes also called the *design matrix*) and let $\\mathbf{y} \\in \\mathbb{R}^{N}$ be the vector consisting of all values of $y^{(i)}$ (i.e., $\\mathbf{X}^{(i,:)} = \\mathbf{x}^{(i)}$ and $\\mathbf{y}^{(i)} = y^{(i)}$). It can be verified that: \n", "\n", "\\begin{equation}\n", "J(\\mathbf{w}) = \\frac{1}{N}(\\mathbf{X}\\mathbf{w} - \\mathbf{y})^{T}(\\mathbf{X}\\mathbf{w} - \\mathbf{y})\n", "\\end{equation}\n", "\n", "Using basic matrix derivative concepts we can compute the gradient of $J(\\mathbf{w})$ with respect to $\\mathbf{w}$:\n", "\n", "\\begin{equation}\n", "\\nabla_{\\mathbf{w}}J(\\mathbf{w}) = \\frac{2}{N} (\\mathbf{X}^{T}\\mathbf{X}\\mathbf{w} -\\mathbf{X}^{T}\\mathbf{y}) \n", "\\end{equation}\n", "\n", "Thus, when $\\nabla_{\\mathbf{w}}J(\\mathbf{w}) = 0$ we have \n", "\n", "\\begin{equation}\n", "\\mathbf{X}^{T}\\mathbf{X}\\mathbf{w} = \\mathbf{X}^{T}\\mathbf{y}\n", "\\end{equation}\n", "\n", "Hence,\n", "\n", "\\begin{equation}\n", "\\mathbf{w} = (\\mathbf{X}^{T}\\mathbf{X})^{-1}\\mathbf{X}^{T}\\mathbf{y}\n", "\\end{equation}\n", "\n", "Note that this solution has a high computational cost. As the number of variables (*features*) increases, the cost for matrix inversion becomes prohibitive. See [this text](http://cs229.stanford.edu/notes/cs229-notes1.pdf) for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1\n", "Using only **NumPy** (a quick introduction to this library can be found [here](http://cs231n.github.io/python-numpy-tutorial/)), complete the two functions below. Recall that $\\mathbf{X} \\in \\mathbb{R}^{N\\times d}$; thus you will need to add a component of value 1 to each of the observations in $\\mathbf{X}$ before performing the computation described above.\n", "\n", "NOTE: Although the dataset above has data of dimension $d=1$, your code must be generic (it should work for $d\\geq1$)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def normal_equation_weights(X, y):\n", " \"\"\"\n", " Calculates the weights of a linear function using the normal equation method.\n", " You should add into X a new column with 1s.\n", "\n", " :param X: design matrix\n", " :type X: np.ndarray(shape=(N, d))\n", " :param y: regression targets\n", " :type y: np.ndarray(shape=(N, 1))\n", " :return: weight vector\n", " :rtype: np.ndarray(shape=(d+1, 1))\n", " \"\"\"\n", " \n", " # START OF YOUR CODE:\n", " raise NotImplementedError(\"Function normal_equation_weights() is not implemented\")\n", " # END YOUR CODE\n", "\n", " return w" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# test of function normal_equation_weights()\n", "\n", "w = 0 # this is not necessary\n", "w = normal_equation_weights(X, y)\n", "print(\"Estimated w = \", w)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def normal_equation_prediction(X, w):\n", " \"\"\"\n", " Calculates the prediction over a set of observations X using the linear function\n", " characterized by the weight vector w.\n", " You should add into X a new column with 1s.\n", "\n", " :param X: design matrix\n", " :type X: np.ndarray(shape=(N, d))\n", " :param w: weight vector\n", " :type w: np.ndarray(shape=(d+1, 1))\n", " :param y: regression prediction\n", " :type y: np.ndarray(shape=(N, 1))\n", " \"\"\"\n", " \n", " # START OF YOUR CODE:\n", " raise NotImplementedError(\"Function normal_equation_prediction() is not implemented\")\n", " # END YOUR CODE\n", " \n", " return prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You can use the [$R^2$](https://pt.wikipedia.org/wiki/R%C2%B2) metric to evaluate how well the linear model fits the data.\n", "\n", "**It is expected that $𝑅^2$ is a value close to 0.5.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# test of function normal_equation_prediction()\n", "prediction = normal_equation_prediction(X, w)\n", "r_2 = r_squared(y, prediction)\n", "plot_points_regression(X,\n", " y,\n", " title='Real estate prices prediction',\n", " xlabel=\"m\\u00b2\",\n", " ylabel='$',\n", " prediction=prediction,\n", " legend=True,\n", " r_squared=r_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional tests\n", "\n", "Let us compute a prediction for $x=650$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let us use the prediction function\n", "x = np.asarray([650]).reshape(1,1)\n", "prediction = normal_equation_prediction(x, w)\n", "print(\"Area = %.2f Predicted price = %.4f\" %(x[0], prediction))\n", "\n", "# another way of computing the same\n", "y = np.dot(np.asarray((1,x)), w)\n", "print(\"Area = %.2f Predicted price = %.4f\" %(x, y))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: Effect of the number of samples\n", "\n", "Change the number of samples $N$ and observe how processing time varies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Testing different values for N\n", "X, y = get_housing_prices_data(N=1000000)\n", "init = time.time()\n", "w = normal_equation_weights(X, y)\n", "prediction = normal_equation_prediction(X,w)\n", "init = time.time() - init\n", "\n", "print(\"Execution time = {:.8f}(s)\".format(init))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: Effect of the data dimension\n", "\n", "Test your code for data with $𝑑>1$. You can create your own dataset (if you do so, you can share the code by posting it to the moodle's Forum -- only the code for the dataset generation!). If you have no idea on how to generate such dataset, you can use existing datasets such as the \n", " one in scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston\n", "\n", "If you used an existing dataset or one generated by a colleague, please acknowledge the fact clearly. Thanks!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Testing data with dimension d > 1" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }