{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MAC0459/MAC5865 - Data Science and Engineering\n", "\n", "### Sejam bem-vindas, sejam bem-vindos! \n", "\n", "### Entre no link https://app.sli.do/event/ipitzcm3 faça suas perguntas da aula. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pcz285F9QO7V", "slideshow": { "slide_type": "slide" } }, "source": [ "# MAC0459 - Ciência e Engenharia de Dados\n", "\n", "\n", "## Class 9: Exploratory Data Analysis\n", "\n", "https://app.sli.do/event/ipitzcm3\n", "\n", "- Histogram vs Barplot\n", "- EDA to find structure in the data\n", "- Boxplot and violin plot\n", "- Correlation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from PIL import Image\n", "import matplotlib as mtplb\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# https://stackoverflow.com/questions/7391945/how-do-i-read-image-data-from-a-url-in-python\n", "\n", "import shutil\n", "import requests\n", "\n", "lena_url = 'https://imagej.nih.gov/ij/images/lena.jpg'\n", "response = requests.get(lena_url, stream=True)\n", "with open('lena.jpg', 'wb') as file:\n", " shutil.copyfileobj(response.raw, file)\n", "del response" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "im = Image.open(\"lena.jpg\").convert('L')\n", "im.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "histLena = np.array(im.histogram())\n", "plt.plot(histLena,'.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "blobs_url = 'https://imagej.nih.gov/ij/images/blobs.gif'\n", "response = requests.get(blobs_url, stream=True)\n", "with open('blobs.gif', 'wb') as file:\n", " shutil.copyfileobj(response.raw, file)\n", "del response" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "im = Image.open(\"blobs.gif\").convert('L')\n", "im.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "histBlobs = np.array(im.histogram())\n", "plt.plot(histBlobs,'.')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OyS0BARSQO7c", "slideshow": { "slide_type": "slide" } }, "source": [ "## Histogram vs Barplot\n", "\n", "- Histogram - numerical values\n", "\n", "- Barplot - categorical values" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "26mrECh1B5bP", "slideshow": { "slide_type": "slide" } }, "source": [ "## A Theory of Extramarital Affairs\n", "\n", "Ray Fair\n", "\n", "https://fairmodel.econ.yale.edu/rayfair/pdf/1978a200.pdf\n", "\n", "- \"The purpose of this paper is to consider the determinants of leisure time spent in on particular time of activity with nonhousehold members: extramarital affairs.\"\n", "\n", "- \"Some data are available from two recent magazine surveys, conducted respectively by Psycology Today and Redbook...\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "4rxN7hQ6cLzK", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "\n", "import warnings; warnings.simplefilter('ignore')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 277 }, "colab_type": "code", "id": "8ZlayVb3QO78", "outputId": "52dd42f1-4b02-40cc-bc1f-5b04deb4a05d", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')\n", "\n", "affairs.info()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "DesbJBOIB_1l", "outputId": "ffc62364-9a48-4acf-f786-8d7303cff574", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "FkaNWcp1-9G2", "outputId": "fd83defc-a29a-4f5f-84dd-1ec4fea9d2ac", "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['nbaffairs'].value_counts()\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Seaborn distplot\n", "- distplot can show a histogram, or a bar plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "colab_type": "code", "id": "ydvxF2mEgKo9", "outputId": "58ae01f1-a6fe-4372-e39a-68c9306b4961", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import seaborn as sns\n", "import numpy as np\n", "\n", "sns.distplot(affairs['nbaffairs'], rug=True, kde=False);\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "colab_type": "code", "id": "_z9y7UXdA85h", "outputId": "7cfcdeae-1a86-4f19-c044-72db95271ba7", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['ym'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(affairs['ym'], rug=True, kde=False);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## EDA - Find structure in the dataset\n", "- Mainly structure that is not supposed to be there\n", "- \"I would like to convince you that the histogram is old-fashioned...\" John Tukey.\n", "- Histogram is old-fashioned but it still is a good tool.\n", "- How about combining two or more variables?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(affairs['ym'], affairs['nbaffairs'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Structure in the data found: ALERT!\n", "\n", "- There are two interesting things to be noticed:\n", " - The strange lattice distribution\n", " - The number of affairs should not be correlated with years of marriage?\n", " \n", "- Time to check if we understood the dataset :-)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The data\n", "- First survey - 1969 - Psychology Today\n", " - About 20000 replies\n", " - About 2000 were coded onto tape\n", "- Second survey - women only - 1974 - Redbook\n", " - About 100000 replies\n", " - About 18000 were coded onto tape\n", "- https://fairmodel.econ.yale.edu/vote2012/affairs.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The data details\n", "- How often engaged in extramarital sexual intercourse during past year (nbaffairs)?\n", " - 0 = none\n", " - 1 = once\n", " - 2 = twice\n", " - 3 = 3 times\n", " - 7 = 4-10\n", " - 12 = monthly\n", " - 12 = weekly\n", " - 12 = daily" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(affairs['nbaffairs'],kde=False, bins=50,rug=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The data details\n", "- Age?\n", " - 17.5 = under 20\n", " - 22.0 = 20 - 24\n", " - 27.0 = 25 - 29\n", " - 32.0 = 30 - 34\n", " - 37.0 = 35 - 39\n", " - 42.0 = 40 - 44\n", " - 47.0 = 45 - 49\n", " - 52.0 = 50 - 54\n", " - 57 = 55 or over" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 480 }, "colab_type": "code", "id": "QP1vABfZhHS3", "outputId": "8325af6c-34cc-43d3-9004-86414baf5c49", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(affairs['age'],kde=False, bins=50,rug=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The data details\n", "- Years of marriage\n", " - .125 = 3 months or less\n", " - .417 = 4-6 months\n", " - .75 = 6 months - 1 year\n", " - 1.5 = 1 - 2 years\n", " - 4.0 = 3 - 5 years\n", " - 7.0 = 6 - 8 years\n", " - 10.0 = 9 - 11 years\n", " - 15.0 = 12 or more" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(affairs['ym'],kde=False, bins=50,rug=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The AHA, OHOH genes\n", "- NIH challenges Dr. Dougherty to create a method to detect gene expression better than state of the art methods\n", "- They create a microarray experiment (real images) and send to Dr. Dougherty's lab\n", "- To make sure the method is working, they make some genes up: AHA must be detected as over expressed; OHOH must be detected as under expressed\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Bar plot for categorical data\n", "\n", "- seaborn's distplot can also be used to plot a barplot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 139 }, "colab_type": "code", "id": "0XZqo-APL9Pc", "outputId": "755b947e-1e1a-49fc-9d9a-92fe27f3ed1b", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('religious').size()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(affairs['religious'], rug=True, kde=False);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Next slides are just to show some nice plots\n", "\n", "First, does ym is correlated to nbffairs?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(affairs['ym'], affairs['nbaffairs'], kind='reg')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(affairs['ym'], affairs['age'], kind='reg')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Boxplot\n", "- Another way to present a distribution\n", "\n", "\"Drawing\"\n", "https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/visualization/boxplot.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.boxplot(x=\"age\", data=affairs);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(affairs['age'],kde=False, bins=50,rug=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.boxplot(x=\"sex\", y=\"ym\", data=affairs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.boxplot(x=\"sex\", y=\"ym\", hue=\"child\", data=affairs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.boxplot (x=\"religious\", y=\"nbaffairs\", data=affairs)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Violin plot\n", "- same as boxplot, but the shape represents the distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.violinplot (x=\"religious\", y=\"nbaffairs\", hue=\"sex\", data=affairs, split=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.violinplot(x=\"occupation\", y=\"nbaffairs\", hue=\"sex\", data=affairs, split=True);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Reverse numbering in the dataset\n", " \n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.violinplot(x=\"education\", y=\"nbaffairs\", hue=\"sex\", data=affairs, split=True);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multivariate data\n", "- Real problems usually deal with multivariate data\n", "- One interesting tool to analyse multivariate datasets is the correlation matrix\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Covariance matrix\n", "\n", "- Let $\\mathbf{X} = (X_1,X_2, \\ldots,X_d)$ a multivariate random variable\n", "\n", "- $\\mu=E[\\mathbf{X}]$ (mean vetor)\n", "\n", "- $\\Sigma = Cov(\\mathbf{X})$ (covariance matrix)\n", "\n", "$$\n", "\\left[ \\begin{array}{cccc}\n", "E[X_1-\\mu_1]^2 & E[(X_1-\\mu_1)(X_2-\\mu_2)] & \\ldots & E[(X_1-\\mu_1)(X_d-\\mu_d)] \\\\\n", "E[(X_2-\\mu_2)(X_1-\\mu_1)] & E[(X_2-\\mu_2)^2] & \\ldots & E[(X_2-\\mu_2)(X_d-\\mu_d)] \\\\\n", "\\vdots & \\vdots & \\ldots & \\vdots \\\\\n", "E[(X_d-\\mu_d)(X_1-\\mu_1)] & E[(X_d-\\mu_d)(X_2-\\mu_2)] & \\ldots &\n", "E[(X_d-\\mu_d)^2]\n", "\\end{array}\n", "\\right]\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Covariance matrix\n", "\n", "- $E[(X_i-\\mu_i)(X_j-\\mu_j)] = E[(X_j-\\mu_j)(X_i-\\mu_i)] = \\sigma^2_{ij}$\n", "\n", "- $\\sigma^2_{ii}$ variance of $X_i$\n", "\n", "- $\\sigma^2_{ij}$ covariance between $X_i$ e $X_j$\n", "\n", "- Example\n", "$$\n", "\\Sigma = \\left[ \\begin{array}{cccc}\n", "\\sigma^2_{11} & \\sigma^2_{12} & \\sigma^2_{13} \\\\\n", "\\sigma^2_{12} & \\sigma^2_{22} & \\sigma^2_{23} \\\\\n", "\\sigma^2_{13} & \\sigma^2_{23} & \\sigma^2_{33} \\\\\n", "\\end{array}\n", "\\right]\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Correlation matrix\n", "$$\n", "\\rho_{ij} = {\\sigma^2_{ij} \\over\n", " {\\sqrt{\\sigma^2_{ii}}\\sqrt{\\sigma^2_{jj}}}}\n", "$$\n", "\n", "- Example\n", "$$\n", "\\mathbf{R} = \\left[ \\begin{array}{cccc}\n", "1 & \\rho_{12} & \\rho_{13} \\\\\n", "\\rho_{12} & 1 & \\rho_{23} \\\\\n", "\\rho_{13} & \\rho_{23} & 1 \\\\\n", "\\end{array}\n", "\\right]\n", "$$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.heatmap(affairs.corr(), cmap='coolwarm')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Some words about color palettes\n", "## Why is it very difficult to use and combine colors?\n", "\n", "- Grayscale range representation of light intensity forms a chain, ie, a full ordered set.\n", "- \"Color\" is usually a tristimulus system.\n", "- If color is represented by Red, Green and Blue (RGB), each in $[0,255]$, the space of all possible values can be represented by a complete lattice. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Some words about color palettes\n", "- The problem with complete lattices is that some colors can not be compared.\n", "- A color palette is a finite set of colors that are presented in an image.\n", "- In image visualization, a color palette is a map that imposes a complete ordered set of colors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Return a list of colors defining a color palette.\n", "current_palette = sns.color_palette()\n", "sns.palplot(current_palette)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Qualitative palettes\n", "\n", "- Categorical datasets \n", "- No specific order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "cmap = sns.choose_colorbrewer_palette('qualitative')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Diverging palettes\n", "\n", "- Non categorical datasets.\n", "- A specific order.\n", "- Low and high values are of equal interest." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "dmap = sns.choose_colorbrewer_palette('diverging')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequential palettes\n", "\n", "- Categorical or non categorical datasets \n", "- Some specific order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "smap = sns.choose_colorbrewer_palette('sequential')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pay some respect to color blind people\n", "- https://en.wikipedia.org/wiki/Color_blindness\n", "- Color pattern to help people to see your graphic." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.set_palette('colorblind')\n", "sns.heatmap(affairs.corr())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs[['age','ym','religious','rate','nbaffairs']].corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.heatmap(affairs[['age','ym','religious','rate','nbaffairs']].corr(), cmap='coolwarm')" ] } ], "metadata": { "celltoolbar": "Slideshow", "colab": { "name": "MAC0459-Class4.ipynb", "provenance": [], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "latex_envs": { "LaTeX_envs_menu_present": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "263.2px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }