PMR3508 - Aprendizado de Máquina e Reconhecimento de Padrões
Testando kNN com a base adult obtida no UCI repository. Iniciando com carregamento da base e com análise básica da base e dos atributos.
Autor: Fabio G. Cozman Data: 09/08/2018
import pandas as pd
import sklearn
adult = pd.read_csv("/Users/imac/Desktop/HOME/Didatico/Aulas/Graduacao/PMR3508/2018/Datasets/Adult-UCI/adult.data.txt",
names=[
"Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
"Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week", "Country", "Target"],
sep=r'\s*,\s*',
engine='python',
na_values="?")
adult.shape
adult.head()
adult["Country"].value_counts()
import matplotlib.pyplot as plt
adult["Age"].value_counts().plot(kind="bar")
adult["Sex"].value_counts().plot(kind="bar")
adult["Education"].value_counts().plot(kind="bar")
adult["Occupation"].value_counts().plot(kind="bar")
Retirando linhas com dados faltantes.
nadult = adult.dropna()
nadult
Fazendo o mesmo processo com os dados de teste.
testAdult = pd.read_csv("/Users/imac/Desktop/HOME/Didatico/Aulas/Graduacao/PMR3508/2018/Datasets/Adult-UCI/adult.test.txt",
names=[
"Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
"Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week", "Country", "Target"],
sep=r'\s*,\s*',
engine='python',
na_values="?")
nTestAdult = testAdult.dropna()
Primeiro teste: seleção de atributos numéricos, com kNN para k=3.
Xadult = nadult[["Age","Education-Num","Capital Gain", "Capital Loss", "Hours per week"]]
Yadult = nadult.Target
XtestAdult = nTestAdult[["Age","Education-Num","Capital Gain", "Capital Loss", "Hours per week"]]
YtestAdult = nTestAdult.Target
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, Xadult, Yadult, cv=10)
scores
knn.fit(Xadult,Yadult)
YtestPred = knn.predict(XtestAdult)
YtestPred
from sklearn.metrics import accuracy_score
accuracy_score(YtestAdult,YtestPred)
Outro teste: mesmos dados, porém kNN com k=30. Melhor resultado obtido.
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(Xadult,Yadult)
scores = cross_val_score(knn, Xadult, Yadult, cv=10)
scores
YtestPred = knn.predict(XtestAdult)
accuracy_score(YtestAdult,YtestPred)
Passando todos os dados não-numéricos para valores numéricos, e fazendo alguns testes com vários conjuntos de atributos (mantendo k=30, pois foi o valor de k que levou a melhor acurácia).
from sklearn import preprocessing
numAdult = nadult.apply(preprocessing.LabelEncoder().fit_transform)
numTestAdult = nTestAdult.apply(preprocessing.LabelEncoder().fit_transform)
Xadult = numAdult.iloc[:,0:14]
Yadult = numAdult.Target
XtestAdult = numTestAdult.iloc[:,0:14]
YtestAdult = numTestAdult.Target
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(Xadult,Yadult)
YtestPred = knn.predict(XtestAdult)
accuracy_score(YtestAdult,YtestPred)
Xadult = numAdult[["Age", "Workclass", "Education-Num", "Martial Status",
"Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week", "Country"]]
XtestAdult = numTestAdult[["Age", "Workclass", "Education-Num", "Martial Status",
"Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week", "Country"]]
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(Xadult,Yadult)
YtestPred = knn.predict(XtestAdult)
accuracy_score(YtestAdult,YtestPred)
Xadult = numAdult[["Age", "Workclass", "Education-Num",
"Occupation", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week"]]
XtestAdult = numTestAdult[["Age", "Workclass", "Education-Num",
"Occupation", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week"]]
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(Xadult,Yadult)
YtestPred = knn.predict(XtestAdult)
accuracy_score(YtestAdult,YtestPred)