{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MAC0460 / MAC5832 (2020)\n", "
\n", "\n", "## EP4: Logistic regression\n", "\n", "### Topics / concepts explored in this EP:\n", "\n", "- Implementation of the **logistic regression algorithm**, using the gradient descent technique\n", "- Application on binary classification of images, using their feature representation\n", "- Confusion matrix, effects of unbalanced classes\n", "\n", "Complete and submit this notebook. **PLEASE do no change the file name.**\n", "\n", "### Evaluation \n", "- Correctitude of the algorithms\n", "- Code\n", " - do not change the prototype of the functions\n", " - efficiency (you should avoid unnecessary loops; use matrix/vector computation with NumPy wherever appropriate)\n", " - cleanliness (do not leave any commented code or useless variables)\n", "- Appropriateness of the answers\n", "- File format: Complete and submit this notebook with the outputs of the execution. **Do no change the file name.**\n", "
\n", "\n", "### Hints\n", "- It might be wise to first make sure your implementation is correct. For instance, you can compare the results of your implementation and of the one implemented in scikit-learn. After you feel confident, paste your code in the notebook, and then run the rest of the code in the notebook.\n", "- If you face difficulties with Keras or other libraries used in this notebook, as well as clarity issues in the exercises in this notebook, post a message in the Forum for discussions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Logistic regression \n", "\n", "Here we use the formulation described in the book [2].\n", "\n", "The cost function to be minimized is\n", "$$\n", "E_{in}(\\mathbf{w}) = \\frac{1}{N} \\sum_{i=1}^{N} \\ln(1 + e^{-y^{(i)} \\mathbf{w}^T \\mathbf{x}^{(i)}}) \\tag{1}\n", "$$\n", "\n", "Its gradient is given by\n", "\n", "$$\\nabla E_{in}(\\mathbf{w}) = - \\frac{1}{N}\\sum_{i=1}^{N} \\frac{y^{(i)} \\mathbf{x}^{(i)}}{1 + e^{y^{(i)} \\mathbf{w}^T \\mathbf{x}^{(i)}}} \\tag{2}$$\n", "\n", "The logistic (sigmoid) function is\n", "$$\\sigma(z) = \\frac{1}{1 + e^{-z}} \\tag{3}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "In the next three code cells, write the code to implement four functions that will be used below for logistic regression training and prediction. Use vectorial computation with NumPy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-entropy loss and cross-entropy gradient\n", "The two functions in the following cell should implement, respectively, equations (1) and (2) above. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def cross_entropy(w, X, y):\n", " \"\"\"\n", " Computes the loss (equation 1)\n", " :param w: weight vector\n", " :type: np.ndarray(shape=(1+d, 1))\n", " :param X: design matrix\n", " :type X: np.ndarray(shape=(N, 1+d))\n", " :param y: class labels\n", " :type y: np.ndarray(shape=(N, 1))\n", " :return loss: loss (equation 1)\n", " :rtype: float\n", " \"\"\" \n", " ### Your code begins here _____________________\n", " raise NotImplementedError(\"Function cross_entropy() is not implemented\")\n", " ### Your code ends here _____________________\n", " \n", "\n", "def cross_entropy_gradient(w, X, y):\n", " \"\"\"\n", " Computes the gradient of the loss function (equation 2)\n", " :param w: weight vector\n", " :type: np.ndarray(shape=(1+d, 1))\n", " :param X: design matrix\n", " :type X: np.ndarray(shape=(N, 1+d))\n", " :param y: class labels\n", " :type y: np.ndarray(shape=(N, 1))\n", " :return grad: gradient (equation 2)\n", " :rtype: float\n", " \"\"\" \n", " ### Your code begins here _____________________\n", " raise NotImplementedError(\"Function cross_entropy_gradient() is not implemented\")\n", " ### Your code ends here _____________________\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic regression training\n", "\n", "The function below receives the data matrix X (shape = (N, d)) and the ouput vector y (shape = (N,)), and should return the final weight vector w and, optionally (when \n", "parameter return_history = True), a list of size num_iterations+1 with the cross-entropy loss values at the beginning and after each of the iterations.\n", "\n", "Note that the data matrix needs to be extended with a column of 1's." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def train_logistic(X, y, learning_rate = 1e-1, w0 = None,\\\n", " num_iterations = 1000, return_history = False):\n", " \"\"\"\n", " Computes the weight vector applying the gradient descent technique\n", " :param X: design matrix\n", " :type X: np.ndarray(shape=(N, d))\n", " :param y: class label\n", " :type y: np.ndarray(shape=(N, 1))\n", " :return: weight vector\n", " :rtype: np.ndarray(shape=(1+d, 1))\n", " :return: the history of loss values (optional)\n", " :rtype: list of float\n", " \"\"\" \n", " \n", " ### Your code begins here _____________________\n", " raise NotImplementedError(\"Function train_logistic() is not implemented\")\n", " ### Your code ends here _____________________\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic regression prediction\n", "The function in the next cell will be used to do the prediction of logistic regression. Recall that the prediction is a score in $[0,1]$, given by the sigmoid value of the linear combination." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sigmoid(z):\n", " return 1 / (1 + np.exp(-z))\n", "\n", "\n", "def predict_logistic(X, w):\n", " \"\"\"\n", " Computes the logistic regression prediction\n", " :param X: design matrix\n", " :type X: np.ndarray(shape=(N, d))\n", " :param w: weight vector\n", " :rtype: np.ndarray(shape=(1+d, 1))\n", " :return: predicted classes \n", " :rtype: np.ndarray(shape=(N, ))\n", " \"\"\" \n", " ### Your code begins here _____________________\n", " raise NotImplementedError(\"Function predict_logistic() is not implemented\")\n", " ### Your code ends here _____________________\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. MNIST Dataset #\n", "\n", "This is a well known dataset, commonly used as a first example of image classification tasks. We could say it is the \"Hello world!\" of image classification. It consists of handwritten digits, divided into $60000$ training images and $10000$ test images. All images are gray-scale (one channel with pixel intensities varying from 0 to 255) and have size $28 \\times 28$. There are 10 classes, corresponding to digits 0 to 9.\n", "\n", "The dataset is available in many places. Here we will use the one available with Keras [1]. \n", "More information on MNIST can be found at the [oficial site](http://yann.lecun.com/exdb/mnist/).\n", "\n", "[1]: François Chollet and others, Keras, https://keras.io, 2015" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.datasets import mnist\n", "\n", "(X_train_all, y_train_all), (X_test_all, y_test_all) = mnist.load_data()\n", "\n", "print(X_train_all.shape, y_train_all.shape)\n", "print(X_test_all.shape, y_test_all.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Class distribution of MNIST (training set)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "unique, counts = np.unique(y_train_all, return_counts=True)\n", "plt.bar(unique, counts, 0.5)\n", "plt.title('Class Frequency')\n", "plt.xlabel('Class')\n", "plt.ylabel('Frequency')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "Repeat **Class distribution of MNIST** for the testing set (next cell) and compare and comment about the distributions of the training and of the testing sets. Do you think this type of comparison is important? Comment.\n", "\n", "###Your comments here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here ___________________\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization of some of the examples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "fig, ax = plt.subplots(2, 3, figsize = (9, 6))\n", "\n", "for i in range(6):\n", " ax[i//3, i%3].imshow(X_train_all[i], cmap='gray')\n", " ax[i//3, i%3].axis('off')\n", " ax[i//3, i%3].set_title(\"Class: %d\"%y_train_all[i])\n", " \n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Feature extraction\n", "\n", "Note that the images consist of 28×28=784 pixel values and they could be used as \"raw features\" of the images. However, here we will extract some features from the images and perform classification using them instead of the pixel values.\n", "\n", "### Mean intensity\n", "\n", "In the book _Learning from Data_ [2], one of the attributes (features) used by the authors is the mean intensity of pixel values. This feature is directly related to the proportion of the pixels corresponding to the digit in the image. For instance, it is reasonable to expect that a digit 5 or 2 occupies more pixels than the digit 1 and, therefore, the mean intensity of the first two should be larger than that of the digit 1.\n", "\n", "### Symmetry\n", "\n", "The second attribute used by the authors is horizontal symmetry.\n", "\n", "Symmetry will be defined in terms of asymetry. We define asymmetry as the pixelwise mean of the absolute difference between the pixels values from the original image and those from the corresponding horizotally flipped image. Then, symmetry is defined as the negative of asymmetry.\n", "\n", "[2]: Yaser S Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, Learning from Data, 2012" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def mean_intensity(image):\n", " return np.mean(image)\n", "\n", "def Hsimmetry(image):\n", " # The processing below invert the order of the columns of the image\n", " reflected_image = image[:, ::-1]\n", " return -np.mean(np.abs(image - reflected_image))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pixels $\\rightarrow$ Features ##\n", "\n", "The above functions for feature extraction will be applied to the samples, both on the training and the test sets. After the feature extraction process below, each image will be represented by two features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "\n", "Fill in the appropriate places in the two code cells below as indicated, to print the mean value of each feature in the training and test sets. Compare the mean values on training and test sets after normalization and comment on them.\n", "\n", "\n", "###Your comments here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Function that converts an image into a list of features,\n", "# using the feature computation functions defined above\n", "def convert2features(image):\n", " return np.array([mean_intensity(image),\n", " Hsimmetry(image)])\n", "\n", "# feature names\n", "F = ['Mean intensity', 'Hsimmetry']\n", "\n", "# Generate the feature representation for all images\n", "X_train_features = np.array([convert2features(image) for image in X_train_all])\n", "X_test_features = np.array([convert2features(image) for image in X_test_all])\n", "\n", "print(X_train_features.shape)\n", "print(X_test_features.shape)\n", "\n", "# print the mean value of each of the features\n", "### Your code here __________________________\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Normalization** of feature values is a common procedure. Here we apply the z-score formula. Note that the normalization parameters (mean and standard deviation) are computed only on training data. To normalize the test data, we use the same parameters.\n", "(We suggest you to think why we should not compute the mean and standard deviation over the training+test set. There is no need to answer this here.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adjust the scale of feature values; standardize them.\n", "# (Yes, the features in the test set should be standardized using\n", "# the statistics of the features in the training set) -- why ??\n", "for i in range(X_train_features.shape[1]):\n", " avg = np.mean(X_train_features[:, i])\n", " stddev = np.std(X_train_features[:, i])\n", " X_train_features[:, i] = (X_train_features[:, i] - avg) / stddev\n", " X_test_features[:, i] = (X_test_features[:, i] - avg) / stddev\n", "\n", "print(X_train_features.shape)\n", "print(X_test_features.shape)\n", "\n", "# print the mean value of each of the features\n", "print(\"Training set, after normalization:\")\n", "### Your code here __________________________\n", " \n", "print(\"Testing set, after normalization:\")\n", "### Your code here __________________________\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Logistic regression training and testing\n", "\n", "## 4.1 Select a subset from two of the classes\n", "\n", "Here we select two classes as well as a subset of the examples in each class. All code from here on will use the selected subset. You may change later the selected classes and the number of samples in each class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select two classes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "P = 5 # positive class\n", "N = 1 # negative class\n", "\n", "X_train_P = X_train_features[y_train_all == P]\n", "X_train_N = X_train_features[y_train_all == N]\n", "y_train_P = y_train_all[y_train_all == P]\n", "y_train_N = y_train_all[y_train_all == N]\n", "\n", "X_test_P = X_test_features[y_test_all == P]\n", "X_test_N = X_test_features[y_test_all == N]\n", "y_test_P = y_test_all[y_test_all == P]\n", "y_test_N = y_test_all[y_test_all == N]\n", "\n", "print(\"Positive class: \", X_train_P.shape, y_train_P.shape)\n", "print(\"Negative class: \", X_train_N.shape, y_train_N.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select a subset of examples from each of the two classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "In the following cell, write the code to change the label of the positive class to +1 and of the negative class to -1. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of positives and negatives to be effectively considered\n", "# in the training data to be explored in the remainder of this notebook\n", "nP = 100\n", "nN = 100\n", "\n", "X_train = np.concatenate([X_train_P[:nP], X_train_N[:nN]], axis = 0)\n", "y_train = np.concatenate([y_train_P[:nP], y_train_N[:nN]], axis = 0).astype('float32')\n", "\n", "X_test = np.concatenate([X_test_P, X_test_N], axis = 0)\n", "y_test = np.concatenate([y_test_P, y_test_N], axis = 0).astype('float32')\n", "\n", "\n", "# Change positive class label to +1 and negative class label to -1\n", "### Your code begins here _____________________\n", "\n", "### Your code ends here _____________________\n", "\n", "\n", "# Shuffle\n", "np.random.seed(56789)\n", "def shuffle(X, y):\n", " # input and output must be shuffled equally\n", " perm = np.random.permutation(len(X))\n", " return X[perm], y[perm]\n", "\n", "X_train, y_train = shuffle(X_train, y_train)\n", "X_test, y_test = shuffle(X_test, y_test)\n", "\n", "print(X_train.shape, y_train.shape)\n", "print(X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the selected data\n", "\n", "Let us plot the selected subset of data. Negative examples will be plotted in red and positive ones in blue." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_features(ax,X,y):\n", " # negatives in red\n", " ax.scatter(X[y==-1,0], \\\n", " X[y==-1,1], \\\n", " label='Class=%d'%N, c = 'red', alpha = 0.4)\n", "\n", " # and positives in blue\n", " ax.scatter(x=X[y==1,0], \\\n", " y=X[y==1,1], \\\n", " label='Class=%d'%P, c = 'blue', alpha = 0.4)\n", "\n", " ax.set_xlabel(F[0])\n", " ax.set_ylabel(F[1])\n", " ax.legend(loc='best')\n", " \n", "fig = plt.figure(figsize=(6,6))\n", "ax = fig.add_subplot(111)\n", "plot_features(ax,X_train,y_train)\n", "plt.show() \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5\n", "Run the code in the following cell a few times, each time with different values for the learning rate and the number of iterations. Comment the behavior of the loss curve. Which values do you consider as good choices?\n", "\n", "**Note:** for your submission, keep the execution output corresponding to the best parameter values you have found.\n", " \n", "###Your comments here\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "np.random.seed(56789)\n", "w_logistic, loss = train_logistic(X_train, y_train,\\\n", " learning_rate = 0.005,\n", " num_iterations = 1000,\\\n", " return_history = True)\n", "\n", "print(\"Final weight:\\n\", w_logistic)\n", "print(\"Final loss:\\n\", loss[-1])\n", "\n", "plt.figure(figsize = (12, 8))\n", "plt.plot(loss)\n", "plt.xlabel('Iteration #')\n", "plt.ylabel('Cross Entropy')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ploting the scores and decision boundary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib.lines import Line2D\n", "\n", "y_pred = predict_logistic(X_train, w_logistic)\n", "\n", "fig = plt.figure(figsize=(20,10))\n", "ax1 = fig.add_subplot(121)\n", "\n", "ax1.scatter(x = X_train[:,0], y = X_train[:,1],\\\n", " c = -y_pred, cmap = 'coolwarm')\n", "legend_elements = [Line2D([0], [0], marker='o', color='r',\\\n", " label='Class %d'%N, markerfacecolor='r',\\\n", " markersize=10),\\\n", "Line2D([0], [0], marker='o', color='b',\\\n", " label='Class %d'%P, markerfacecolor='b',\\\n", " markersize=10)]\n", "ax1.set_xlabel(F[0])\n", "ax1.set_ylabel(F[1])\n", "ax1.legend(handles=legend_elements, loc='best')\n", " \n", "ax2 = fig.add_subplot(122)\n", "plot_features(ax2,X_train,y_train)\n", "\n", "p1 = (-2, -(w_logistic[0] - 2*w_logistic[1])/w_logistic[2])\n", "p2 = (2, -(w_logistic[0] + 2*w_logistic[1])/w_logistic[2])\n", "\n", "lines = ax2.plot([p1[0], p2[0]], [p1[1], p2[1]], '-')\n", "plt.setp(lines, color='g', linewidth=4.0)\n", "\n", "plt.show()\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confusion matrix\n", "\n", "Recall that the logistic regression returns a score $\\hat{p}$ in $[0,1]$, which can be interpreted as the probability $p(y=1|\\mathbf{x})$. To compute the confusion matrix, one needs to choose a threshold value $T$ to decide the final class label (that is $\\hat{y} = 1 \\Longleftrightarrow \\hat{p} \\geq T$ )." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sn\n", "import pandas as pd\n", "\n", "def plot_confusion_matrix(y, y_pred):\n", " \"\"\"\n", " It receives an array with the ground-truth (y)\n", " and another with the prediction (y_pred), both with binary labels\n", " (positve=+1 and negative=-1) and plots the confusion\n", " matrix.\n", " It uses P (positive class id) and N (negative class id)\n", " which are \"global\" variables ...\n", " \"\"\"\n", " TP = np.sum((y_pred == 1) * (y == 1))\n", " TN = np.sum((y_pred == -1) * (y == -1))\n", "\n", " FP = np.sum((y_pred == 1) * (y == -1))\n", " FN = np.sum((y_pred == -1) * (y == 1))\n", "\n", " total = TP+FP+TN+FN\n", " print(\"TP = %4d FP = %4d\\nFN = %4d TN = %4d\"%(TP,FP,FN,TN))\n", " print(\"Accuracy = %d / %d (%f)\\n\" %((TP+TN),total, (TP+TN)/total))\n", " confusion = [\n", " [TP/(TP+FN), FP/(TN+FP)],\n", " [FN/(TP+FN), TN/(TN+FP)]\n", " ]\n", "\n", " df_cm = pd.DataFrame(confusion, \\\n", " ['$\\hat{y} = %d$'%P, '$\\hat{y} = %d$'%N],\\\n", " ['$y = %d$'%P, '$y = %d$'%N])\n", " plt.figure(figsize = (8,4))\n", " sn.set(font_scale=1.4)\n", " sn.heatmap(df_cm, annot=True)\n", " plt.show()\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6\n", "Play with the threshold value in the code (following cell). Did you manage to find a threshold value (other than 0.5) that improves accuracy? How threshold relates to TP, FP, TN and FN ? Comment.\n", "\n", "###Your comment here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sn\n", "import pandas as pd\n", "\n", "threshold = 0.5\n", "\n", "p_hat = predict_logistic(X_train, w_logistic)\n", "y_hat = np.where(p_hat >= threshold, 1, -1)\n", "\n", "total = len(y_hat)\n", "\n", "plot_confusion_matrix(y_train, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.3 Testing\n", "\n", "### Exercise 7\n", "Repeat score and boundary ploting, as well as confusion matrix ploting with respect to test set. Comment what you observed.\n", "\n", "###Your comment here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Your code for score and boundary plotting (for the test set)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Your code for confusion matrix plotting (for the test set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 8\n", "If you got to this point, make a copy of your notebook.\n", "Run the copy notebook changing the number of positive and negative examples. Try an unbalanced training set and observe if there are any effects in the accuracy on the test set. Additionaly, you may try with a different pair of classes. Summarize HERE the experiments you did and comment whatever you found interesting. There is no need to submit the copy notebook.\n", "\n", "###Your comments here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1]: François Chollet and others, Keras, https://keras.io, 2015\n", "\n", "[2]: Yaser S Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, _Learning from Data_, 2012" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }