{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **PARTE 3: Projeto 2 (Censo Americano)**" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# importa as bibliotecas\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Primeiro, é necessário carregar o dataset. Para baixar o dataset e mais informações, acesse: http://archive.ics.uci.edu/ml/datasets/Adult\n", "\n", "Este dataset contém informações do Censo Americano e o objetivo dele é predizer se uma determinada pessoa tem ganho maior que 50K em um ano. Essa predição é baseada nas informações sociais coletadas no Censo. Essa situação poderia ser utilizada para obter um empréstimo ou não (se ganha mais de 50K libera o empréstimo). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrytarget
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "\n", " marital-status occupation relationship race sex \\\n", "0 Never-married Adm-clerical Not-in-family White Male \n", "1 Married-civ-spouse Exec-managerial Husband White Male \n", "2 Divorced Handlers-cleaners Not-in-family White Male \n", "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", "4 Married-civ-spouse Prof-specialty Wife Black Female \n", "\n", " capital-gain capital-loss hours-per-week native-country target \n", "0 2174 0 40 United-States <=50K \n", "1 0 0 13 United-States <=50K \n", "2 0 0 40 United-States <=50K \n", "3 0 0 40 United-States <=50K \n", "4 0 0 40 Cuba <=50K " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# importa o data set\n", "\n", "# para baixar o dataset, acesse: \n", "# Observação 1: Este dataset já é dividido em treino e teste previamente. Porém, vou usar somente o conjunto de \n", "# teste como sendo o dataset original, pois a maioria dos datasets não são previamentes divididos em conjuntos\n", "# de teste e conjunto de treino (isso é feito no processo de Aprendizado de Máquina)\n", "\n", "# do site eu recuperei o nome de cada coluna do data set\n", "columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', \n", " 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', \n", " 'target']\n", "\n", "# abro o dataset e nomeio cada coluna com meu vetor columns\n", "data = pd.read_csv('C:/Users/Mariana/Dropbox (Pessoal)/2º sem 2020/SME0123/Codigos/adult.data', names=columns)\n", "\n", "# Fazendo a mesma coisa do Google Colab\n", "# from google.colab import drive\n", "# drive.mount('/content/drive')\n", "# data = pd.read_csv('/content/drive/My Drive/EABDA 2020/adult.data', names=columns)\n", "\n", "# elimina os espaços no começo e no fim das strings\n", "data = data.applymap(lambda x: x.strip() if isinstance(x, str) else x)\n", "\n", "# visualiza a tabela, o método head() mostra as 5 primeiras linhas da tabela\n", "# note que a coluna 'target' é a variável dependente ou resposta (classe).\n", "# o restante das colunas são as variáveis explicativas ou independentes ou preditoras (features, atributos), e cada linha é um indivíduo (instância ou exemplo).\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32561, 15)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# mostra o número de linhas (unidades amostrais) e colunas (variáveis)\n", "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Geralmente, os datasets vem com dados faltantes, no caso deste dataset os dados faltantes é simbolizado pelo caractere '?'. Existem diversas estratégias para trabalhar com dados faltantes. Algumas metodologias podem lidar com \"missings\" sem nenhum problema, mas outras necessitam de dados completos. Nesses casos, pode-se imputar dados. Há formas sofisticadas e simples de imputação, sendo uma delas a substituição pelo valor médio. Porém, no nosso caso, por questões didáticas e de simplificação, vou excluir todos os dados faltantes da nossa base de dados. " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Execute: preprocessing_remove_rows\n", "---Removed 0 rows in the table\n" ] } ], "source": [ "# elimina as linhas com dados faltantes\n", "# data = pd.DataFrame\n", "# null_symbol = caractere que simboliza que há dados faltantes\n", "def remove_rows(data, null_symbol=np.nan):\n", " # salva o total de exemplos\n", " total = len(data)\n", " # substitui o null_symbol por nulo\n", " data = data.replace(null_symbol, np.nan)\n", " # elimina as linhas que contém algum valor nulo\n", " data.dropna(inplace=True)\n", " print('Execute: preprocessing_remove_rows\\n---Removed %d rows in the table' % (total - len(data)))\n", " return data\n", "\n", "data = remove_rows(data, '?')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(30162, 15)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# o tamanho atualizado do conjunto de dados\n", "data.shape" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age
countminmaxmedianmeanstd
sex
Female978217903536.88345913.532427
Male2038017903839.18400412.873243
\n", "
" ], "text/plain": [ " age \n", " count min max median mean std\n", "sex \n", "Female 9782 17 90 35 36.883459 13.532427\n", "Male 20380 17 90 38 39.184004 12.873243" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataF = data.loc[data.sex=='Female', \"age\"]\n", "dataM = data.loc[data.sex=='Male', \"age\"]\n", "\n", "data.groupby('sex').agg({'age': ['count',min, max, 'median','mean', 'std']})" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ttest_indResult(statistic=-14.039122252627893, pvalue=1.5227465818368628e-44)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy import stats\n", "\n", "# Teste t de Student (bicaudal) para média de duas populações Normais com variâncias iguais\n", "stats.ttest_ind(dataF,dataM, equal_var = False)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LeveneResult(statistic=43.976656455846715, pvalue=3.3792425846568063e-11)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Teste de Levene para igualdade de variâncias\n", "stats.levene(dataF,dataM)\n", "\n", "#stats.bartlett(dataF,dataM)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ttest_indResult(statistic=-2.8989252152629614, pvalue=0.0037942674534447514)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Selecione uma amostra de 95% e outra de 0,5% e repare como o resultado do teste muda\n", "sampleF=dataF.sample(frac=0.09)\n", "sampleM=dataM.sample(frac=0.09)\n", "\n", "# Teste t de Student (bicaudal) para média de duas populações Normais com variâncias iguais\n", "stats.ttest_ind(sampleF,sampleM, equal_var = False)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "\n", "# Gráfico de probabilidade da Normal, para checar suposição de tal distribuição dos dados\n", "# para o teste t de Student\n", "stats.probplot(sampleF, plot=plt)\n", "stats.probplot(sampleM, plot=plt)\n", "fig = plt.figure()" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Mariana\\Anaconda3\\lib\\site-packages\\scipy\\stats\\morestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.\n", " warnings.warn(\"p-value may not be accurate for N > 5000.\")\n" ] }, { "data": { "text/plain": [ "(0.975343644618988, 0.0)" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Teste de Shapiro-Wilks para normalidade\n", "stats.shapiro(dataM)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KstestResult(statistic=0.08001335891281569, pvalue=8.035805158495415e-55)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Teste de Kolmogorov-Smirnov para normalidade\n", "mu = dataF.mean()\n", "sigma = dataF.std()\n", "stats.kstest(dataF,'norm',args=(mu,sigma))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LeveneResult(statistic=1.5414156796712146, pvalue=0.21635809587326393)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Teste de Levene para igualdade de variâncias\n", "stats.levene(sampleF,sampleM)\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
counts%
<=50K2265475.1
>50K750824.9
\n", "
" ], "text/plain": [ " counts %\n", "<=50K 22654 75.1\n", ">50K 7508 24.9" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Há injustiça no valor salarial?\n", "c = data[\"target\"].value_counts(sort=False)\n", "p = data[\"target\"].value_counts(normalize=True,sort=False).round(3) * 100\n", "pd.concat([c,p], axis=1, keys=['counts', '%'])\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target<=50K>50KAll
sex
Female867011129782
Male13984639620380
All22654750830162
\n", "
" ], "text/plain": [ "target <=50K >50K All\n", "sex \n", "Female 8670 1112 9782\n", "Male 13984 6396 20380\n", "All 22654 7508 30162" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tabela de frequências cruzadas ou tabela de contingência\n", "pd.crosstab(data.sex, data.target,margins=True)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target<=50K>50K
sex
Female0.8863220.113678
Male0.6861630.313837
\n", "
" ], "text/plain": [ "target <=50K >50K\n", "sex \n", "Female 0.886322 0.113678\n", "Male 0.686163 0.313837" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Percentual de linhas\n", "pd.crosstab(data.sex, data.target,normalize='index')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1415.2864042410245,\n", " 1.00155254124934e-309,\n", " 1,\n", " array([[ 7347.04024932, 2434.95975068],\n", " [15306.95975068, 5073.04024932]]))" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import chi2_contingency\n", "\n", "# teste quiquadrado para independência/hamogeneidade de distribuição\n", "obs = pd.crosstab(data.sex, data.target)\n", "chi2_contingency(obs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }