{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# MAC0459/MAC5865 - Data Science and Engineering\n", "\n", "### Sejam bem-vindas, sejam bem-vindos! \n", "\n", "### Entre no link https://app.sli.do/event/c6apbrqu faça suas perguntas da aula. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pcz285F9QO7V", "slideshow": { "slide_type": "slide" } }, "source": [ "# MAC0459 - Topics in Engineering and Data Science\n", "\n", "\n", "## Class 10: Exploratory Data Analysis\n", "\n", "- Reviewing Histogram vs Barplot\n", "- Scatterplot\n", "- Other plots\n", "- Beautiful Visualization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Some words about color palettes\n", "## Why is it very difficult to use and combine colors?\n", "\n", "- Grayscale range representation of light intensity forms a chain, ie, a full ordered set.\n", "- \"Color\" is usually a tristimulus system.\n", "- If color is represented by Red, Green and Blue (RGB), each in $[0,255]$, the space of all possible values can be represented by a complete lattice. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Some words about color palettes\n", "- The problem with complete lattices is that some colors can not be compared.\n", "- A color palette is a finite set of colors that are presented in an image.\n", "- In image visualization, a color palette is a map that imposes a complete ordered set of colors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import seaborn as sns\n", "import numpy as np\n", "\n", "# Return a list of colors defining a color palette.\n", "current_palette = sns.color_palette()\n", "sns.palplot(current_palette)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Qualitative palettes\n", "\n", "- Categorical datasets \n", "- No specific order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "cmap = sns.choose_colorbrewer_palette('qualitative')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Diverging palettes\n", "\n", "- Non categorical datasets.\n", "- A specific order.\n", "- Low and high values are of equal interest." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "dmap = sns.choose_colorbrewer_palette('diverging')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequential palettes\n", "\n", "- Categorical or non categorical datasets \n", "- Some specific order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "smap = sns.choose_colorbrewer_palette('sequential')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pay some respect to color blind people\n", "- https://en.wikipedia.org/wiki/Color_blindness\n", "- Color pattern to help people to see your graphic." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')\n", "\n", "affairs.corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.set_palette('colorblind')\n", "sns.heatmap(affairs.corr())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs[['age','ym','religious','rate','nbaffairs']].corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.heatmap(affairs[['age','ym','religious','rate','nbaffairs']].corr(), cmap='coolwarm')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OyS0BARSQO7c", "slideshow": { "slide_type": "slide" } }, "source": [ "# Histogram vs Barplot\n", "\n", "## Just for you to remember\n", "- Histogram - numerical values\n", "\n", "- Barplot - categorical values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "4rxN7hQ6cLzK", "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "\n", "import warnings; warnings.simplefilter('ignore')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 277 }, "colab_type": "code", "id": "8ZlayVb3QO78", "outputId": "52dd42f1-4b02-40cc-bc1f-5b04deb4a05d", "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Seaborn distplot\n", "- distplot can show a histogram, or a bar plot\n", "- boxplot can show a boxplot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "colab_type": "code", "id": "ydvxF2mEgKo9", "outputId": "58ae01f1-a6fe-4372-e39a-68c9306b4961", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Histogram\n", "\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "sns.distplot(affairs['age'], rug=True, kde=False);\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Bar plot\n", "\n", "sns.distplot(affairs['occupation'], rug=True, kde=False);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Boxplot\n", "\"Drawing\"\n", "- fonte XKCD boxplot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('sex')['occupation'].plot.hist(alpha=0.5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Compare with boxplot\n", "\n", "This is an example that sometimes we can break the rules if we are doing this consciently. However, be sure that some marks do not make sense, for instance, the median line, or eventually outlier marks." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.boxplot(x='sex', y='occupation', data= affairs)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## EDA - Find structure in the dataset\n", "- Scatterplot\n", " - Show relations between two variables\n", " \"Drawing\"\n", " - Source XKCD - Correlation vs Causation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Remember last class\n", "- We found the scatterplot bellow very strange" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(affairs['ym'], affairs['nbaffairs'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Next slides are just to show some nice plots\n", "\n", "First, does ym is correlated to nbffairs?\n", "\n", "$$\\rho(\\mathbf{X},\\mathbf{Y}) = \\frac{E[(\\mathbf{X}-\\mu_\\mathbf{X})(\\mathbf{Y}-\\mu_\\mathbf{Y}]}{\\sigma_\\mathbf{X} \\sigma_\\mathbf{Y}}$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(affairs['ym'], affairs['nbaffairs'], kind='reg')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(affairs['ym'], affairs['age'], kind='reg')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## That remembers me of another XKCD joke\n", "\n", "\"Drawing\"\n", "- Source XKCD - Correlation and Constellations\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "spellman = pd.read_csv('http://www.exploredata.net/ftp/Spellman.csv')\n", "\n", "spellman.info()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas' Histogram\n", "- Fast" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "spellman['40'].hist()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Seaborn's histogram\n", "- Slow \n", "- Approx. 12 seconds with a i7, 16Gb RAM " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(spellman['40'], rug=True, kde=False);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Seaborn's sampling alternative\n", "- Because seaborn commands are sometimes heavy, we can use the function sample to choose part of the dataset to be shown.\n", "- sample(n=N)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.distplot(spellman['40'].sample(n=500), rug=True, kde=False);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Last class scatter plot\n", "- We can use the command jointplot to present a scatter plot of two variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.jointplot(spellman['40'], spellman['50'], kind='reg')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Your miliage vary\n", "- For some reason this command is not working in Windows, only in Linux. \n", "- relplot can also be used to draw a scatterplot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "sns.relplot(x='40',y='50',data=spellman)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Iris flower dataset\n", "- Iris flower dataset is a classic dataset for machine learning\n", "- https://en.wikipedia.org/wiki/Iris_flower_data_set\n", "- The attributes are the length and width of the sepal and petal for three species of iris flowers.\n", "- An interesting scatterplot is the pairplot, a matrix where each variable is plot against each other variable. The diagonal is the histogram of the variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = sns.load_dataset(\"iris\")\n", "sns.pairplot(df, hue=\"species\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## This plot can be cumbersome\n", "- If the number of variables is large, we are not going to see anything." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from pandas.plotting import scatter_matrix\n", "# Adjust the size of the figure\n", "plt.rcParams['figure.figsize'] = [15, 15]\n", "scatter_matrix(spellman)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Prepare the pairplot \n", "- Select only some columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "spellman.columns = ['time','40','50','60','70','80','90','100','110','120','130','140','150','160','170','180','190','200','210','220','230','240','250','260']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Adjust the size of the figure\n", "plt.rcParams['figure.figsize'] = [10, 10]\n", "scatter_matrix(spellman.loc[:,'40':'100'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Showing more than one boxplot in the same figure\n", "- Sometimes we want to compare more than one boxplot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = [15, 10]\n", "sns.boxplot(data=spellman,width=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Next slides are based on Yeseul Lee's material (Skiena's Data Science Course)\n", "\n", "- Read and learn how to create and present nice graphics. \n", "- Pay attention on the preparing the data and labels.\n", "\n", "https://github.com/yeseullee/Data-science-design-manual-notebooks/blob/master/Chapter4.ipynb\n", "\n", "https://github.com/yeseullee/Data-science-design-manual-notebooks/blob/master/Chapter6.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dicas que apareceram no slido\n", "\n", "https://www.researchgate.net/publication/221517808_Useful_Junk_The_effects_of_visual_embellishment_on_comprehension_and_memorability_of_charts\n", "\n", "https://miami.pure.elsevier.com/en/publications/graphics-lies-misleading-visuals-reflections-on-the-challenges-an\n", "\n", "https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833" ] } ], "metadata": { "celltoolbar": "Slideshow", "colab": { "name": "MAC0459-Class4.ipynb", "provenance": [], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "latex_envs": { "LaTeX_envs_menu_present": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "263.2px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }