{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Algums experimentos com estatísticas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.style.use('ggplot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para gerar dados com distribuições dadas, o NumPy tem disponível o módulo `random`, que já usamos ocasionalmente.\n", "Veja a [documentação](https://numpy.org/doc/1.18/reference/random/index.html).\n", "\n", "Para a análise estatística de dados, o NumPy providencia alguma funções estatísticas. Veja a [documentação](https://numpy.org/doc/1.18/reference/routines.statistics.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para lidar diretamente com as distribuições, ao invés de gerar dados com base nelas, o módulo `scipy.stats` possui uma grande quantidade de funções e tipos de dados, incluindo:\n", "\n", "- Diversas distribuições contínuas e discretas de uma variável.\n", "- Algumas distribuições multivariadas (mais de uma variável).\n", "- Cálculo de diversas estatísticas para um conjunto de dados.\n", "- Análise de frequências.\n", "- Correlação.\n", "- Testes estatísticos.\n", "- E diversos outros.\n", "\n", "Dê uma olhada na documentação no [site oficial](https://docs.scipy.org/doc/scipy/reference/stats.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vamos simular sortei com um dado justo, que pode gerar os valores 1 a 6 todos com a mesma probabilidade." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "generator = np.random.default_rng()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dice = generator.integers(1, 6, endpoint=True, size=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dice" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(dice, bins=np.arange(0.5, 7.5, 1.0));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(dice, bins=np.arange(0.5, 7.5, 1.0), density=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dice = generator.integers(1, 6, endpoint=True, size=10000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(dice, bins=np.arange(0.5, 7.5, 1.0), density=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.amin(dice), np.amax(dice)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.quantile(dice, [1/6, 1/3, 1/2, 2/3, 5/6])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.median(dice), np.quantile(dice, 0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(dice), np.std(dice, ddof=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg = (1 + 6)/2\n", "dev = np.sqrt(np.sum(np.arange(1, 7)**2)/6 - avg**2)\n", "avg, dev" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bins = np.arange(0.5, 6.5 + 1, 1) # 6 caixas com o meio nos valores inteiros 1 a 6\n", "hist, bins2 = np.histogram(dice, bins=bins, density=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert np.all(bins == bins2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "1/6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agora vamos usar uma distribuição contínua, com valores aleatórios entre 2 e 10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "values = (10 - 2) * generator.random(size=10000) + 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.amin(values), np.amax(values) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.quantile(values, [0.25, 0.5, 0.75])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.median(values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(values), np.std(values, ddof=1), np.var(values, ddof=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stats.uniform.mean(loc=2, scale=(10-2)), stats.uniform.std(loc=2, scale=(10-2)), stats.uniform.var(loc=2, scale=(10-2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xvalues = np.linspace(2, 10, 50 + 1)\n", "plt.hist(values, bins=50, density=True, label='data');\n", "plt.plot(xvalues, stats.uniform.pdf(xvalues, loc=2, scale=(10 - 2)), label='distribution')\n", "plt.legend();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(values, bins=20);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(values, bins=10);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "values = (10 - 2) * generator.random(size=1000000) + 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xvalues = np.linspace(2, 10, 50 + 1)\n", "plt.hist(values, bins=50, density=True, label='data');\n", "plt.plot(xvalues, stats.uniform.pdf(xvalues, loc=2, scale=(10 - 2)), label='distribution')\n", "plt.legend();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(values), np.std(values, ddof=1), np.var(values, ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vejamos agora uma distribuição normal com média 5 e desvio padrão 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "values = generator.normal(5, 2, size=10000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.amin(values), np.amax(values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.quantile(values, [0.25, 0.5, 0.75])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(values), np.std(values, ddof=1), np.var(values, ddof=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minx = np.amin(values)\n", "maxx = np.amax(values)\n", "xvalues = np.linspace(minx, maxx, 50 + 1)\n", "plt.hist(values, bins=50, density=True, label='data');\n", "plt.plot(xvalues, stats.norm.pdf(xvalues, loc=5, scale=2), label='distribution')\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vejamos como fica a comparação entre a binomial, a Poisson e a normal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 1000000\n", "n = 10000000\n", "mean = 10\n", "p = mean / n\n", "deviation = np.sqrt(n*p*(1-p))\n", "data_binomial = generator.binomial(n, p, size=m)\n", "data_poisson = generator.poisson(mean, size=m)\n", "data_normal = generator.normal(mean, deviation, size=m)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.amin(data_binomial)\n", "b = np.amax(data_binomial)\n", "bins = np.arange(a, b+2)\n", "plt.hist([data_normal, data_binomial, data_poisson], bins=bins, density=True, label=['normal', 'binomial', 'poisson'])\n", "plt.legend();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "3*(np.mean(data_binomial)-np.median(data_binomial))/np.std(data_binomial, ddof=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "3*(np.mean(data_poisson)-np.median(data_poisson))/np.std(data_poisson, ddof=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agora a distribuição lognormal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_lognormal = generator.lognormal(5, 2, size=100000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgnrm = stats.lognorm(2, loc=0, scale=np.exp(5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.amin(data_lognormal), np.amax(data_lognormal)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.quantile(data_lognormal, [0.25, 0.5, 0.75])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(data_lognormal), np.std(data_lognormal, ddof=1), np.var(data_lognormal, ddof=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = np.exp(5 + 2*2/2)\n", "s = np.sqrt(np.exp(2 * 5 + 2*2)*(np.exp(2*2) - 1))\n", "m, s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(data_lognormal, bins=50, density=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start = np.amin(data_lognormal)\n", "finish = np.amax(data_lognormal)\n", "bins = np.logspace(np.log10(start), np.log10(finish), 50)\n", "h, b = np.histogram(data_lognormal, bins=bins, density=True)\n", "plt.loglog(b[:-1], h, label='data')\n", "plt.loglog(bins, lgnrm.pdf(bins), label='distribution')\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note como a lognormal tem uma ampla gama de valores (várias ordens de grandeza nos dois eixos) e também como aparenta ter uma região quase reta na gráfico loglog, o que tenderia a fazer pensar que ela segue uma regra de lei de potência.\n", "\n", "Porisso, é importante tomar cuidado com deduções experimentais que aparentam encontrar leis de potência: elas podem ser uma lognormal. A importância é que uma lei de potência tende a indicar a existência de operação num ponto crítico (possivelmente com criticalidade auto-organizada), mas se for uma lognormal, a conclusão é menos impressionante: apenas que os dados são resultantes de processo aleatórios multiplicativos!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }