# Data Normalization

## Learning objectives
*After completing this activity, you should be able to:*
- discuss the importance of normalization and scaling
- implement *min-max scaling* in Python
- utilize scikit-learn to perform normalization

## Groups
You may optionally work in groups for this assignment.  The maximum size of a group is 2.


Please place your complete group members name here in the list below:
- person A
- person B


## Background
The features in your your data can have large differences in terms 
of magnitude, units, and range. The similarity and dissimilarity measures that ML algorithms employ are not aware of these differences, and depending on the technique applied, can produce misleading results if care is not taken to preprocess this information.
The matrix, *X* (shown below) has 2 dimensions, column 0 represents the feature *age* and column 1 represents the *salary*.  Each row respresents one data point/person/observation.

$
\begin{bmatrix} 
23   &  56000 \\
35   &  75000 \\ 
55   &  76000 \\
\end{bmatrix}
$

Consider a single data example, *p*, where p = [39, 75750].


### <span style="color:red">Question #1</span>
Write code to create a numpy matrix *X* and vector *p*
(represent *p* still as a matrix with one row if you need to) 
with this data.


In [None]:
## Your Code here.

import numpy as np

## declare X with values as shown above 
# declare p with values as shown above
# YOUR CODE HERE
raise NotImplementedError()

### <span style="color:red">Question #2</span>
Without using the computer and just using your thoughts,
which row/observation in matrix *X* is the most similar 
to the observation represented by vector *p*? 
Think about what the columns are encoding and explain your answer 
(do not do any distance calculations, imagine you were not a machine 
 learning student or a computer scientist).


## *YOUR ANSWER HERE*

### <span style="color:red">Question #3</span>

Compute the distance between *p* and each example in *X* using 
the Euclidian distance. You can use the [sklearn.metrics.pairwise_distances_argmin_min function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances_argmin_min.html), 
that returns a 2 column array with the index of the nearest point and the actual distance calculated. 

Return the results of the function into a variable named **p_closest**. Which vector (row number, starting from 0) in X did this method identify as the closest to p.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### <span style="color:red">Question #4</span>
What point was the closest distance wise?  Is this the same point as you identified in question 2?


## *YOUR ANSWER HERE*

## Difference Scales in the Feature Space

To compensate for the differences in magnitude, scale, units, etc.,
it is important to **normalize** the data. This prevents features with different scales from dominating distance calculation. 
This formula below will transform the data using 
[min-max normalization](https://en.wikipedia.org/wiki/Feature_scaling).

$
x′_i = \frac{x_i − min(X_i)}{max(X_i) - min(X_i)}
$

The max($X_i$) is the maximum value for column i (and the same idea for min). $x_i$ is the original data for a specific row and column $i$
and $x′_i$ is the modified/scaled version.

### <span style="color:red">Question #5</span>
Write Pyhon code in the cell below that creates a new matrix, X_norm, that represents a translated version of 
*X* using the formula shown in the above cell.  Please note:
- *do* use vectorized numpy operations
- **do NOT** use scikit learn or other normalization packages for these operations


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### <span style="color:red">Question #6</span>
Review the data in **X_norm** what are the range of values for each column. 
How do they compare to the original **X** matrix?

## *YOUR ANSWER HERE*

### <span style="color:red">Question #7</span>
Is the transformation of data between *X* and **X_norm**
represent a *linear* transform?  Recall that a linear transformation
of distances means that for each dimension, the relative distance between points is maintained after the transformation.  

Show the distance between two points (two rows in matrix) and show the relative distance in each dimension both before and after the transformation.


## *YOUR ANSWER HERE*

## Scikit Learn Min_max scaler object

For models that are sensitive to scales between features, it is
common practice to scale all the training/testing data. Scikit learn
provides an object called [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to perform these operations. Here is an example of using this
object on *X*.

The code illustrates that the fit and transform operations can be performed in a single function call.  Fit determines how to scale 
each column (think of the equation from earlier, we need to record
each columns min and max values) and transform computes $x'$ for each
entry in the matrix.  Once the scaling object is *fitted*, we can
then transform additional data as required (illustrated below on the 
vector **p**).  

In [None]:
# Example code of using scikit learn's scaler
import numpy as np
import sklearn.preprocessing as skp

skScaler = skp.MinMaxScaler()
X_norm_scikit = skScaler.fit_transform(X)
p_min_max = skScaler.transform(p)

### <span style="color:red">Question #8</span>
In the code block below, find the nearest point in *X_norm_scikit* to *p_min_max*.  Is the nearest point the same as the one identified in question 2 and/or 3?  Make a few comments on why or why not the same points were identified and which one might be "better".

In [None]:
# Code for question 8

## *YOUR ANSWER HERE*

### <span style="color:red">Question #9</span>
List at least one complications that can arise when applying min_max scaling?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()