{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pcz285F9QO7V", "slideshow": { "slide_type": "slide" } }, "source": [ "# MAC0459/MAC5865 - Data Science and Engineering\n", "\n", "### Sejam bem-vindas, sejam bem-vindos! \n", "\n", "### Entre no link https://app.sli.do/event/n58vfsrg faça suas perguntas da aula. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Class 8: Exploratory Data Analysis (EDA)\n", "\n", "- Mean\n", "- Median\n", "- Variance\n", "- Histogram" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OyS0BARSQO7c", "slideshow": { "slide_type": "slide" } }, "source": [ "## Mean? What does the mean mean?\n", "\n", "- Given a probability density function $f$ of a continous random variable $X$, the expected value of $X$ is given by:\n", "$$\n", "E[X] = \\int \\limits_{-\\infty}^{\\infty} x f(x) dx\n", "$$\n", "- A continous random variable (CRV) takes all possible values in the domain where it is defined.\n", "- It usually represents a physical quantity: distance, temperature, pressure, weight, etc.\n", "- Mathematical properties on $f$ must hold. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mTTO0JTeQO7i", "slideshow": { "slide_type": "slide" } }, "source": [ "## Mean? What does the mean mean?\n", "\n", "- Given a probability distribution function $P$ of a discrete random variable $X$, defined on a countable set $D$, the expected value of $X$ is given by:\n", "$$\n", "E[X] = \\sum \\limits_{x\\in D} x P(x)\n", "$$\n", "- A discrete random variable (DRV) usually represents a countable quantity: years, days, levels, etc.\n", "- Mathematical properties on $P$ must hold.\n", "- We never have access to $P$ in the real world. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "26mrECh1B5bP", "slideshow": { "slide_type": "slide" } }, "source": [ "## A Theory of Extramarital Affairs\n", "\n", "Ray Fair\n", "\n", "https://fairmodel.econ.yale.edu/rayfair/pdf/1978a200.pdf" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 277 }, "colab_type": "code", "id": "8ZlayVb3QO78", "outputId": "52dd42f1-4b02-40cc-bc1f-5b04deb4a05d", "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd\n", "from statistics import mean, median\n", "\n", "affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')\n", "\n", "affairs.info()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 202 }, "colab_type": "code", "id": "Hx-oFIiF-tl3", "outputId": "2f7ee920-0169-4db9-f28e-229721469330", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "FkaNWcp1-9G2", "outputId": "fd83defc-a29a-4f5f-84dd-1ec4fea9d2ac", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['nbaffairs'].value_counts()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "r6ZfgqZv_Zcv", "outputId": "354cb1a3-b4fe-4fdf-e2a7-828d92a2c15c", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "AxrXQRcG_ocI", "outputId": "45127f87-e80d-4a2f-a6b4-cf54d16b45ee", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "mean(affairs['age'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "CfWUlWMLCH5r", "outputId": "a52b4dfd-fc28-4860-bdd4-c5951db50ed1", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['age'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "2OjY1oR5_voO", "outputId": "8845215a-ce04-44bd-b60f-69b1c115ba14", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['ym'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "-ENsd4u2_3dt", "outputId": "500301a5-de5e-46f2-90f1-2fb7f5d19d4e", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['rate'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "SA-ch8t0Ahch", "outputId": "1d3bb332-15ce-4e31-fd63-6e1f884a6327", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['age'].median()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "8-oX8J_rArML", "outputId": "5915498a-2ed3-4910-dfa1-7eb605952485", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['ym'].median()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "QptizhX6A31_", "outputId": "449e4902-d8ad-4a89-95e6-2a5146d8128f", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['ym'].max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "colab_type": "code", "id": "_z9y7UXdA85h", "outputId": "7cfcdeae-1a86-4f19-c044-72db95271ba7", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs['ym'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "DesbJBOIB_1l", "outputId": "ffc62364-9a48-4acf-f786-8d7303cff574", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "9y5Gg6MYDkjk", "outputId": "975bf1b8-4b11-40a4-c374-6f381eda9063", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs[affairs['sex'] == 'female'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "sIYR8ivWDqME", "outputId": "c3dcdf5a-0a46-475f-82b5-35bb0aaf8def", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs[affairs['sex'] == 'male'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "l7SUhB8BEMn5", "outputId": "a2c7a7ea-55c8-4191-d3c8-efe22fffc36d" }, "outputs": [], "source": [ "affairs['below_30'] = affairs['age'] < 30\n", "affairs['below_30'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 202 }, "colab_type": "code", "id": "Dpz0f50zEjQK", "outputId": "873f087e-e6ca-4bbc-b0fb-6208e3fe0c68" }, "outputs": [], "source": [ "affairs.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "lyXXd_QiEtFK", "outputId": "e8f6dd1a-dff5-423a-af74-cd4e2cab07bb" }, "outputs": [], "source": [ "affairs['affbw_30'] = (affairs['nbaffairs'] != 0) & affairs['below_30']\n", "affairs['affbw_30'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 202 }, "colab_type": "code", "id": "5oSr-QN_IBgl", "outputId": "8d172e1e-aa45-487a-ad87-b27e0a01b2dd" }, "outputs": [], "source": [ "affairs.tail()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mkJXvvGLHVQt" }, "source": [ "## Split-Apply-Combine" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 485 }, "colab_type": "code", "id": "Q3YEQol2HIlG", "outputId": "6bc909f7-49c7-4bdc-e994-9f4306b5dc37" }, "outputs": [], "source": [ "def draw_dataframe(df, loc=None, width=None, ax=None, linestyle=None,\n", " textstyle=None):\n", " loc = loc or [0, 0]\n", " width = width or 1\n", "\n", " x, y = loc\n", "\n", " if ax is None:\n", " ax = plt.gca()\n", "\n", " ncols = len(df.columns) + 1\n", " nrows = len(df.index) + 1\n", "\n", " dx = dy = width / ncols\n", "\n", " if linestyle is None:\n", " linestyle = {'color':'black'}\n", "\n", " if textstyle is None:\n", " textstyle = {'size': 12}\n", "\n", " textstyle.update({'ha':'center', 'va':'center'})\n", "\n", " # draw vertical lines\n", " for i in range(ncols + 1):\n", " plt.plot(2 * [x + i * dx], [y, y + dy * nrows], **linestyle)\n", "\n", " # draw horizontal lines\n", " for i in range(nrows + 1):\n", " plt.plot([x, x + dx * ncols], 2 * [y + i * dy], **linestyle)\n", "\n", " # Create index labels\n", " for i in range(nrows - 1):\n", " plt.text(x + 0.5 * dx, y + (i + 0.5) * dy,\n", " str(df.index[::-1][i]), **textstyle)\n", "\n", " # Create column labels\n", " for i in range(ncols - 1):\n", " plt.text(x + (i + 1.5) * dx, y + (nrows - 0.5) * dy,\n", " str(df.columns[i]), style='italic', **textstyle)\n", " \n", " # Add index label\n", " if df.index.name:\n", " plt.text(x + 0.5 * dx, y + (nrows - 0.5) * dy,\n", " str(df.index.name), style='italic', **textstyle)\n", "\n", " # Insert data\n", " for i in range(nrows - 1):\n", " for j in range(ncols - 1):\n", " plt.text(x + (j + 1.5) * dx,\n", " y + (i + 0.5) * dy,\n", " str(df.values[::-1][i, j]), **textstyle)\n", "\n", "\n", "#----------------------------------------------------------\n", "# Draw figure\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline\n", "\n", "df = pd.DataFrame({'data': [1, 2, 3, 4, 5, 6]},\n", " index=['A', 'B', 'C', 'A', 'B', 'C'])\n", "df.index.name = 'key'\n", "\n", "\n", "fig = plt.figure(figsize=(8, 6), facecolor='white')\n", "ax = plt.axes([0, 0, 1, 1])\n", "\n", "ax.axis('off')\n", "\n", "draw_dataframe(df, [0, 0])\n", "\n", "for y, ind in zip([3, 1, -1], 'ABC'):\n", " split = df[df.index == ind]\n", " draw_dataframe(split, [2, y])\n", "\n", " sum = pd.DataFrame(split.sum()).T\n", " sum.index = [ind]\n", " sum.index.name = 'key'\n", " sum.columns = ['data']\n", " draw_dataframe(sum, [4, y + 0.25])\n", " \n", "result = df.groupby(df.index).sum()\n", "draw_dataframe(result, [6, 0.75])\n", "\n", "style = dict(fontsize=14, ha='center', weight='bold')\n", "plt.text(0.5, 3.6, \"Input\", **style)\n", "plt.text(2.5, 4.6, \"Split\", **style)\n", "plt.text(4.5, 4.35, \"Apply (sum)\", **style)\n", "plt.text(6.5, 2.85, \"Combine\", **style)\n", "\n", "arrowprops = dict(facecolor='black', width=1, headwidth=6)\n", "plt.annotate('', (1.8, 3.6), (1.2, 2.8), arrowprops=arrowprops)\n", "plt.annotate('', (1.8, 1.75), (1.2, 1.75), arrowprops=arrowprops)\n", "plt.annotate('', (1.8, -0.1), (1.2, 0.7), arrowprops=arrowprops)\n", "\n", "plt.annotate('', (3.8, 3.8), (3.2, 3.8), arrowprops=arrowprops)\n", "plt.annotate('', (3.8, 1.75), (3.2, 1.75), arrowprops=arrowprops)\n", "plt.annotate('', (3.8, -0.3), (3.2, -0.3), arrowprops=arrowprops)\n", "\n", "plt.annotate('', (5.8, 2.8), (5.2, 3.6), arrowprops=arrowprops)\n", "plt.annotate('', (5.8, 1.75), (5.2, 1.75), arrowprops=arrowprops)\n", "plt.annotate('', (5.8, 0.7), (5.2, -0.1), arrowprops=arrowprops)\n", " \n", "plt.axis('equal')\n", "plt.ylim(-1.5, 5);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 139 }, "colab_type": "code", "id": "0XZqo-APL9Pc", "outputId": "755b947e-1e1a-49fc-9d9a-92fe27f3ed1b", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('religious').size()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 233 }, "colab_type": "code", "id": "mPQuPAstD06N", "outputId": "e7ad72a0-0399-43e4-e5c2-9bb9377240e1", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('religious').mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "colab_type": "code", "id": "xy5vz5Y3MUse", "outputId": "8e3b252a-d076-4639-dddb-b2384bdd0841", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('occupation').size()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "RmQxtX4rIxof", "outputId": "c0b1308d-de6b-4faa-8d55-6b56261b9a58", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('occupation').mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 225 }, "colab_type": "code", "id": "JtAT05BZMbDU", "outputId": "c9116762-3b84-45fb-c225-85d69edc022d", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby(['sex','rate']).size()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 386 }, "colab_type": "code", "id": "clunHUFJKvNX", "outputId": "7445c7f3-30bf-4cc6-d2c8-c08a9a89ae4a", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby(['sex','rate']).mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "UmoJUbgpJPjR", "outputId": "cf2d6939-d685-489a-cd79-7693d8308917", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('occupation').median()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "HtX6RaKrO7Gq", "slideshow": { "slide_type": "slide" } }, "source": [ "## Variance\n", "\n", "- Given a probability distribution function $P$ of a discrete random variable $X$, defined on a countable set $D$, the variance of $X$ is given by:\n", "$$\n", "Var[X] = \\sum\\limits_{x\\in D} P(x)(x - E[X])^2\n", "$$\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "wbn4hsyGI9IM", "outputId": "7fc46598-d654-451a-a49f-669a4fa6dab1", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('occupation').var()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 294 }, "colab_type": "code", "id": "NWJaencvRgl3", "outputId": "c7e3ec8a-641a-4958-f556-f0d3a425df3d", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('occupation').std()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "sECMBVZbUb_q", "slideshow": { "slide_type": "slide" } }, "source": [ "## Histogram\n", "- Another way to group **numeric** values\n", "- Usual algorithm:\n", "- - Input: Array of values, number of bins\n", "- - Output: Array of integer values with number of bins size\n", "- - Create and initialize the output array\n", "- - For each position of the input array, count its content in the correct position of the output array " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "4rxN7hQ6cLzK", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "%matplotlib inline \n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from random import gauss, triangular, choice, vonmisesvariate, uniform\n", "\n", "def SC(): return posint(gauss(15.1, 3) + 3 * triangular(1, 4, 13)) # 30.1\n", "def KT(): return posint(gauss(10.2, 3) + 3 * triangular(1, 3.5, 9)) # 22.1\n", "def DG(): return posint(vonmisesvariate(30, 2) * 3.08) # 14.0\n", "def HB(): return posint(gauss(6.7, 1.5) if choice((True, False)) else gauss(16.7, 2.5)) # 11.7\n", "def OT(): return posint(triangular(5, 17, 25) + uniform(0, 30) + gauss(6, 3)) # 37.0\n", "\n", "def posint(x): \"Positive integer\"; return max(0, int(round(x)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "_7F9qbRcby6M", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def repeated_hist(rv, bins=10, k=100000):\n", " \"Repeat rv() k times and make a histogram of the results.\"\n", " samples = [rv() for _ in range(k)]\n", " plt.hist(samples, bins=bins)\n", " return mean(samples),median(samples)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 287 }, "colab_type": "code", "id": "3bUjpc5bb9qY", "outputId": "e1e8490d-0b48-4e9f-ba75-0b72d4783f66", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "repeated_hist(SC)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 288 }, "colab_type": "code", "id": "yAKsAOIWY64h", "outputId": "d08eaff2-7963-4e57-8f86-79e9019b9c8e", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "repeated_hist(SC, bins=range(100))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "colab_type": "code", "id": "u0kkuGpNclun", "outputId": "58596ab6-8c9f-4d17-f3a8-f05d900b78e4", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "repeated_hist(KT, bins=range(60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "colab_type": "code", "id": "0r-_wNBics7I", "outputId": "943dfebe-d4e1-4f0d-e72b-8314e7922e52" }, "outputs": [], "source": [ "repeated_hist(DG, bins=range(60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "colab_type": "code", "id": "Da0qhUB0c3Cr", "outputId": "b2fe74d5-52ec-43e9-acbe-9fa3f26f2db4" }, "outputs": [], "source": [ "repeated_hist(HB, bins=range(100))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "colab_type": "code", "id": "JNHZcnCdc-96", "outputId": "730867ac-c851-41b3-93e4-3875c56cebbb" }, "outputs": [], "source": [ "repeated_hist(OT, bins=range(60))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "colab_type": "code", "id": "yzs57n0ndGO6", "outputId": "fafe0fc1-6ada-4a19-fde1-131181076d2f" }, "outputs": [], "source": [ "def GSW(): return SC() + KT() + DG() + HB() + OT()\n", "\n", "repeated_hist(GSW, bins=range(70, 160, 2))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "E762AWi_ZOjl", "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas - Visualization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 286 }, "colab_type": "code", "id": "VAtuGqlnZHUk", "outputId": "fbb25642-f3a0-4042-83e2-651716eaa1aa", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "\n", "affairs['age'].plot.hist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 335 }, "colab_type": "code", "id": "_NTcfL5WZlp2", "outputId": "8422fb1c-ad68-46f5-82b2-81f232ac493f", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "affairs.groupby('sex')['age'].plot.hist(alpha=0.5)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "X8HTl2Bgf_nn", "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas histogram documentation\n", "\n", "https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.hist.html\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "eEVN--W4XaN2" }, "source": [ "## Seaborn\n", "- A beautiful visualization package\n", "\n", "https://seaborn.pydata.org/tutorial/distributions.html#distribution-tutorial\n", "\n", "- Source: https://github.com/MartinSeeler/python-data-exploration/blob/master/Exploring%20Datasets.ipynb" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "colab_type": "code", "id": "ydvxF2mEgKo9", "outputId": "58ae01f1-a6fe-4372-e39a-68c9306b4961" }, "outputs": [], "source": [ "import seaborn as sns\n", "import numpy as np\n", "\n", "sns.set(color_codes=True)\n", "sns.set_context('talk')\n", "\n", "x = np.random.normal(size=100)\n", "\n", "sns.distplot(x, kde=True, rug=True);\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "colab_type": "code", "id": "thCgiFd2gxnM", "outputId": "89941f7e-6bd5-407f-fd01-1f5b66f5d49c" }, "outputs": [], "source": [ "sns.distplot(x, bins=20, kde=False, rug=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 480 }, "colab_type": "code", "id": "QP1vABfZhHS3", "outputId": "8325af6c-34cc-43d3-9004-86414baf5c49" }, "outputs": [], "source": [ "sns.distplot(affairs['age'],kde=False, bins=50,rug=True)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_pAw30Xhighi" }, "source": [ "## Did you see that?\n", "\n", "- Ages seem to end with two or seven" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 477 }, "colab_type": "code", "id": "lr1aOUcziSEI", "outputId": "d87a22ad-3fef-48c8-ec49-cca55007d2b2" }, "outputs": [], "source": [ "\n", "sns.distplot(affairs['ym'], bins=10, kde=False)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "beaEYcXOjUKL" }, "source": [ "## Mean vs years of marriage\n", "- \"The average age of our people is around 32, but the most people are married for more than 14 years!\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1590 }, "colab_type": "code", "id": "VD15dX8DkVYX", "outputId": "edd5b80b-bdff-4e7c-ac63-2fb574e90cd5" }, "outputs": [], "source": [ "from PIL import Image\n", "import matplotlib as mtplb\n", "\n", "# https://stackoverflow.com/questions/7391945/how-do-i-read-image-data-from-a-url-in-python\n", "\n", "import shutil\n", "import requests\n", "\n", "lena_url = 'https://imagej.nih.gov/ij/images/lena.jpg'\n", "response = requests.get(lena_url, stream=True)\n", "with open('lena.jpg', 'wb') as file:\n", " shutil.copyfileobj(response.raw, file)\n", "del response\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "im = Image.open(\"lena.jpg\").convert('L')\n", "im.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histLena = np.array(im.histogram())\n", "plt.plot(histLena,'.')" ] } ], "metadata": { "celltoolbar": "Slideshow", "colab": { "name": "MAC0459-Class4.ipynb", "provenance": [], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "latex_envs": { "LaTeX_envs_menu_present": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "263.2px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }