# MAC0459/MAC5865 - Data Science and Engineering

### Sejam bem-vindas, sejam bem-vindos! 

### Entre no link https://app.sli.do/event/c6apbrqu faça suas perguntas da aula. 

# MAC0459 - Topics in Engineering and Data Science


## Class 10: Exploratory Data Analysis

- Reviewing Histogram vs Barplot
- Scatterplot
- Other plots
- Beautiful Visualization

# Some words about color palettes
## Why is it very difficult to use and combine colors?

- Grayscale range representation of light intensity forms a chain, ie, a full ordered set.
- "Color" is usually a tristimulus system.
- If color is represented by Red, Green and Blue (RGB), each in $[0,255]$, the space of all possible values can be represented by a complete lattice. 

# Some words about color palettes
- The problem with complete lattices is that some colors can not be compared.
- A color palette is a finite set of colors that are presented in an image.
- In image visualization, a color palette is a map that imposes a complete ordered set of colors.

In [None]:
import seaborn as sns
import numpy as np

# Return a list of colors defining a color palette.
current_palette = sns.color_palette()
sns.palplot(current_palette)

# Qualitative palettes

- Categorical datasets 
- No specific order

In [None]:
cmap = sns.choose_colorbrewer_palette('qualitative')

# Diverging palettes

- Non categorical datasets.
- A specific order.
- Low and high values are of equal interest.

In [None]:
dmap = sns.choose_colorbrewer_palette('diverging')

# Sequential palettes

- Categorical or non categorical datasets 
- Some specific order

In [None]:
smap = sns.choose_colorbrewer_palette('sequential')

# Pay some respect to color blind people
- https://en.wikipedia.org/wiki/Color_blindness
- Color pattern to help people to see your graphic.

In [None]:
import pandas as pd

affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')

affairs.corr()

In [None]:
sns.set_palette('colorblind')
sns.heatmap(affairs.corr())

In [None]:
affairs[['age','ym','religious','rate','nbaffairs']].corr()

In [None]:
sns.heatmap(affairs[['age','ym','religious','rate','nbaffairs']].corr(), cmap='coolwarm')

# Histogram vs Barplot

## Just for you to remember
- Histogram - numerical values

- Barplot - categorical values

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt

import warnings; warnings.simplefilter('ignore')

In [None]:
import pandas as pd

affairs = pd.read_csv('http://vision.ime.usp.br/~hirata/Fair.csv')

## Seaborn distplot
- distplot can show a histogram, or a bar plot
- boxplot can show a boxplot

In [None]:
# Histogram

import seaborn as sns
import numpy as np

sns.distplot(affairs['age'], rug=True, kde=False);


In [None]:
# Bar plot

sns.distplot(affairs['occupation'], rug=True, kde=False);

## Boxplot
"Drawing"
- fonte XKCD boxplot

In [None]:
affairs.groupby('sex')['occupation'].plot.hist(alpha=0.5)

## Compare with boxplot

This is an example that sometimes we can break the rules if we are doing this consciently. However, be sure that some marks do not make sense, for instance, the median line, or eventually outlier marks.

In [None]:
sns.boxplot(x='sex', y='occupation', data= affairs)

## EDA - Find structure in the dataset
- Scatterplot
 - Show relations between two variables
 "Drawing"
 - Source XKCD - Correlation vs Causation

## Remember last class
- We found the scatterplot bellow very strange

In [None]:
sns.jointplot(affairs['ym'], affairs['nbaffairs'])

## Next slides are just to show some nice plots

First, does ym is correlated to nbffairs?

$$\rho(\mathbf{X},\mathbf{Y}) = \frac{E[(\mathbf{X}-\mu_\mathbf{X})(\mathbf{Y}-\mu_\mathbf{Y}]}{\sigma_\mathbf{X} \sigma_\mathbf{Y}}$$

In [None]:
sns.jointplot(affairs['ym'], affairs['nbaffairs'], kind='reg')

In [None]:
sns.jointplot(affairs['ym'], affairs['age'], kind='reg')

## That remembers me of another XKCD joke

"Drawing"
- Source XKCD - Correlation and Constellations


In [None]:
spellman = pd.read_csv('http://www.exploredata.net/ftp/Spellman.csv')

spellman.info()

## Pandas' Histogram
- Fast

In [None]:
spellman['40'].hist()

## Seaborn's histogram
- Slow 
- Approx. 12 seconds with a i7, 16Gb RAM 

In [None]:
sns.distplot(spellman['40'], rug=True, kde=False);

## Seaborn's sampling alternative
- Because seaborn commands are sometimes heavy, we can use the function sample to choose part of the dataset to be shown.
- sample(n=N)

In [None]:
sns.distplot(spellman['40'].sample(n=500), rug=True, kde=False);

## Last class scatter plot
- We can use the command jointplot to present a scatter plot of two variables.

In [None]:
sns.jointplot(spellman['40'], spellman['50'], kind='reg')

## Your miliage vary
- For some reason this command is not working in Windows, only in Linux. 
- relplot can also be used to draw a scatterplot.

In [None]:
sns.relplot(x='40',y='50',data=spellman)

## Iris flower dataset
- Iris flower dataset is a classic dataset for machine learning
- https://en.wikipedia.org/wiki/Iris_flower_data_set
- The attributes are the length and width of the sepal and petal for three species of iris flowers.
- An interesting scatterplot is the pairplot, a matrix where each variable is plot against each other variable. The diagonal is the histogram of the variable.

In [None]:
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

## This plot can be cumbersome
- If the number of variables is large, we are not going to see anything.

In [None]:
from pandas.plotting import scatter_matrix
# Adjust the size of the figure
plt.rcParams['figure.figsize'] = [15, 15]
scatter_matrix(spellman)

## Prepare the pairplot 
- Select only some columns

In [None]:
spellman.columns = ['time','40','50','60','70','80','90','100','110','120','130','140','150','160','170','180','190','200','210','220','230','240','250','260']

In [None]:
# Adjust the size of the figure
plt.rcParams['figure.figsize'] = [10, 10]
scatter_matrix(spellman.loc[:,'40':'100'])

## Showing more than one boxplot in the same figure
- Sometimes we want to compare more than one boxplot

In [None]:
plt.rcParams['figure.figsize'] = [15, 10]
sns.boxplot(data=spellman,width=1)

## Next slides are based on Yeseul Lee's material (Skiena's Data Science Course)

- Read and learn how to create and present nice graphics. 
- Pay attention on the preparing the data and labels.

https://github.com/yeseullee/Data-science-design-manual-notebooks/blob/master/Chapter4.ipynb

https://github.com/yeseullee/Data-science-design-manual-notebooks/blob/master/Chapter6.ipynb

## Dicas que apareceram no slido

https://www.researchgate.net/publication/221517808_Useful_Junk_The_effects_of_visual_embellishment_on_comprehension_and_memorability_of_charts

https://miami.pure.elsevier.com/en/publications/graphics-lies-misleading-visuals-reflections-on-the-challenges-an

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833