# MAC0459/MAC5865 - Data Science and Engineering

### Sejam bem-vindas, sejam bem-vindos! 

### Entre no link https://app.sli.do/event/n58vfsrg faça suas perguntas da aula. 

## Class 8: Exploratory Data Analysis (EDA)

- Mean
- Median
- Variance
- Histogram

## Mean? What does the mean mean?

- Given a probability density function $f$ of a continous random variable $X$, the expected value of $X$ is given by:
$$
E[X] = \int \limits_{-\infty}^{\infty} x f(x) dx
$$
- A continous random variable (CRV) takes all possible values in the domain where it is defined.
- It usually represents a physical quantity: distance, temperature, pressure, weight, etc.
- Mathematical properties on $f$ must hold. 

## Mean? What does the mean mean?

- Given a probability distribution function $P$ of a discrete random variable $X$, defined on a countable set $D$, the expected value of $X$ is given by:
$$
E[X] = \sum \limits_{x\in D} x P(x)
$$
- A discrete random variable (DRV) usually represents a countable quantity: years, days, levels, etc.
- Mathematical properties on $P$ must hold.
- We never have access to $P$ in the real world. 

## A Theory of Extramarital Affairs

Ray Fair

https://fairmodel.econ.yale.edu/rayfair/pdf/1978a200.pdf

In [None]:
import pandas as pd
from statistics import mean, median

affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')

affairs.info()


In [None]:
affairs.head()

In [None]:
affairs.tail()

In [None]:
affairs['nbaffairs'].value_counts()


In [None]:
affairs['sex'].value_counts()

In [None]:
mean(affairs['age'])

In [None]:
affairs['age'].mean()

In [None]:
affairs['ym'].mean()

In [None]:
affairs['rate'].mean()

In [None]:
affairs['age'].median()

In [None]:
affairs['ym'].median()

In [None]:
affairs['ym'].max()

In [None]:
affairs['ym'].describe()

In [None]:
affairs.describe()

In [None]:
affairs[affairs['sex'] == 'female'].describe()

In [None]:
affairs[affairs['sex'] == 'male'].describe()

In [None]:
affairs['below_30'] = affairs['age'] < 30
affairs['below_30'].value_counts()

In [None]:
affairs.tail()

In [None]:
affairs['affbw_30'] = (affairs['nbaffairs'] != 0) & affairs['below_30']
affairs['affbw_30'].value_counts()

In [None]:
affairs.tail()

## Split-Apply-Combine

In [None]:
def draw_dataframe(df, loc=None, width=None, ax=None, linestyle=None,
 textstyle=None):
 loc = loc or [0, 0]
 width = width or 1

 x, y = loc

 if ax is None:
 ax = plt.gca()

 ncols = len(df.columns) + 1
 nrows = len(df.index) + 1

 dx = dy = width / ncols

 if linestyle is None:
 linestyle = {'color':'black'}

 if textstyle is None:
 textstyle = {'size': 12}

 textstyle.update({'ha':'center', 'va':'center'})

 # draw vertical lines
 for i in range(ncols + 1):
 plt.plot(2 * [x + i * dx], [y, y + dy * nrows], **linestyle)

 # draw horizontal lines
 for i in range(nrows + 1):
 plt.plot([x, x + dx * ncols], 2 * [y + i * dy], **linestyle)

 # Create index labels
 for i in range(nrows - 1):
 plt.text(x + 0.5 * dx, y + (i + 0.5) * dy,
 str(df.index[::-1][i]), **textstyle)

 # Create column labels
 for i in range(ncols - 1):
 plt.text(x + (i + 1.5) * dx, y + (nrows - 0.5) * dy,
 str(df.columns[i]), style='italic', **textstyle)
 
 # Add index label
 if df.index.name:
 plt.text(x + 0.5 * dx, y + (nrows - 0.5) * dy,
 str(df.index.name), style='italic', **textstyle)

 # Insert data
 for i in range(nrows - 1):
 for j in range(ncols - 1):
 plt.text(x + (j + 1.5) * dx,
 y + (i + 0.5) * dy,
 str(df.values[::-1][i, j]), **textstyle)


#----------------------------------------------------------
# Draw figure
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

df = pd.DataFrame({'data': [1, 2, 3, 4, 5, 6]},
 index=['A', 'B', 'C', 'A', 'B', 'C'])
df.index.name = 'key'


fig = plt.figure(figsize=(8, 6), facecolor='white')
ax = plt.axes([0, 0, 1, 1])

ax.axis('off')

draw_dataframe(df, [0, 0])

for y, ind in zip([3, 1, -1], 'ABC'):
 split = df[df.index == ind]
 draw_dataframe(split, [2, y])

 sum = pd.DataFrame(split.sum()).T
 sum.index = [ind]
 sum.index.name = 'key'
 sum.columns = ['data']
 draw_dataframe(sum, [4, y + 0.25])
 
result = df.groupby(df.index).sum()
draw_dataframe(result, [6, 0.75])

style = dict(fontsize=14, ha='center', weight='bold')
plt.text(0.5, 3.6, "Input", **style)
plt.text(2.5, 4.6, "Split", **style)
plt.text(4.5, 4.35, "Apply (sum)", **style)
plt.text(6.5, 2.85, "Combine", **style)

arrowprops = dict(facecolor='black', width=1, headwidth=6)
plt.annotate('', (1.8, 3.6), (1.2, 2.8), arrowprops=arrowprops)
plt.annotate('', (1.8, 1.75), (1.2, 1.75), arrowprops=arrowprops)
plt.annotate('', (1.8, -0.1), (1.2, 0.7), arrowprops=arrowprops)

plt.annotate('', (3.8, 3.8), (3.2, 3.8), arrowprops=arrowprops)
plt.annotate('', (3.8, 1.75), (3.2, 1.75), arrowprops=arrowprops)
plt.annotate('', (3.8, -0.3), (3.2, -0.3), arrowprops=arrowprops)

plt.annotate('', (5.8, 2.8), (5.2, 3.6), arrowprops=arrowprops)
plt.annotate('', (5.8, 1.75), (5.2, 1.75), arrowprops=arrowprops)
plt.annotate('', (5.8, 0.7), (5.2, -0.1), arrowprops=arrowprops)
 
plt.axis('equal')
plt.ylim(-1.5, 5);

In [None]:
affairs.groupby('religious').size()

In [None]:
affairs.groupby('religious').mean()

In [None]:
affairs.groupby('occupation').size()

In [None]:
affairs.groupby('occupation').mean()

In [None]:
affairs.groupby(['sex','rate']).size()

In [None]:
affairs.groupby(['sex','rate']).mean()

In [None]:
affairs.groupby('occupation').median()

## Variance

- Given a probability distribution function $P$ of a discrete random variable $X$, defined on a countable set $D$, the variance of $X$ is given by:
$$
Var[X] = \sum\limits_{x\in D} P(x)(x - E[X])^2
$$



In [None]:
affairs.groupby('occupation').var()

In [None]:
affairs.groupby('occupation').std()

## Histogram
- Another way to group **numeric** values
- Usual algorithm:
- - Input: Array of values, number of bins
- - Output: Array of integer values with number of bins size
- - Create and initialize the output array
- - For each position of the input array, count its content in the correct position of the output array 

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt

In [None]:
from random import gauss, triangular, choice, vonmisesvariate, uniform

def SC(): return posint(gauss(15.1, 3) + 3 * triangular(1, 4, 13)) # 30.1
def KT(): return posint(gauss(10.2, 3) + 3 * triangular(1, 3.5, 9)) # 22.1
def DG(): return posint(vonmisesvariate(30, 2) * 3.08) # 14.0
def HB(): return posint(gauss(6.7, 1.5) if choice((True, False)) else gauss(16.7, 2.5)) # 11.7
def OT(): return posint(triangular(5, 17, 25) + uniform(0, 30) + gauss(6, 3)) # 37.0

def posint(x): "Positive integer"; return max(0, int(round(x)))

In [None]:
def repeated_hist(rv, bins=10, k=100000):
 "Repeat rv() k times and make a histogram of the results."
 samples = [rv() for _ in range(k)]
 plt.hist(samples, bins=bins)
 return mean(samples),median(samples)

In [None]:
repeated_hist(SC)

In [None]:
repeated_hist(SC, bins=range(100))

In [None]:
repeated_hist(KT, bins=range(60))

In [None]:
repeated_hist(DG, bins=range(60))

In [None]:
repeated_hist(HB, bins=range(100))

In [None]:
repeated_hist(OT, bins=range(60))

In [None]:
def GSW(): return SC() + KT() + DG() + HB() + OT()

repeated_hist(GSW, bins=range(70, 160, 2))

## Pandas - Visualization

In [None]:

affairs['age'].plot.hist()

In [None]:
affairs.groupby('sex')['age'].plot.hist(alpha=0.5)

## Pandas histogram documentation

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.hist.html



## Seaborn
- A beautiful visualization package

https://seaborn.pydata.org/tutorial/distributions.html#distribution-tutorial

- Source: https://github.com/MartinSeeler/python-data-exploration/blob/master/Exploring%20Datasets.ipynb

In [None]:
import seaborn as sns
import numpy as np

sns.set(color_codes=True)
sns.set_context('talk')

x = np.random.normal(size=100)

sns.distplot(x, kde=True, rug=True);


In [None]:
sns.distplot(x, bins=20, kde=False, rug=True);

In [None]:
sns.distplot(affairs['age'],kde=False, bins=50,rug=True)

## Did you see that?

- Ages seem to end with two or seven

In [None]:

sns.distplot(affairs['ym'], bins=10, kde=False)

## Mean vs years of marriage
- "The average age of our people is around 32, but the most people are married for more than 14 years!"

In [None]:
from PIL import Image
import matplotlib as mtplb

# https://stackoverflow.com/questions/7391945/how-do-i-read-image-data-from-a-url-in-python

import shutil
import requests

lena_url = 'https://imagej.nih.gov/ij/images/lena.jpg'
response = requests.get(lena_url, stream=True)
with open('lena.jpg', 'wb') as file:
 shutil.copyfileobj(response.raw, file)
del response


In [None]:
im = Image.open("lena.jpg").convert('L')
im.show()

In [None]:
histLena = np.array(im.histogram())
plt.plot(histLena,'.')