So, before we actually going to analyze the data,

we need to do some cleaning of the data.

And, for

this, I'm going to present some methods that are related to data normalization.

So, in biology, when we repeat experiments,

we measure exactly the same thing, we rarely get the same exact result.

There is a scatter of the results that are repeated.

So, the data that represent the measurement of protein, mRNA or

a gene has a center or an average.

And some scatter around it.

In some cases the source of this scatter is biological because of

the inherent noise within the biological system.

And in some cases, this is an experimental procedure or an instrumental error.

If it is the latter, we want to normalize the data so

that experimental bias is removed as much as possible.

So now, we going to go over several methods for normalizing data.

Trying to remove this shift in the mean of the data.

So first, how do we quantify the center of the data and the spread of the data.

So this is, a, simply can be computed using the mean and standard deviation.

The mean is simply sum of all of the values,

the repeated values that were measured, divided by the number of measurements.

So to compute the standard deviation, we simply subtract the mean from each value.

We square those differences to make them positive,

then we divide the sum of all those, those differences by the number of measurements.

And then we take the square root of this product.

And this give us a quantification of the spread in the data.

So the first method of normalization is called Z-score normalization and

what Z-score normalization does to data, it makes the mean become 0.

It centers the data on to 0 as well as making the standard deviation the same.

The standard deviation after Z-score normalization becomes 1.

So in order to do that, first, you first computing the mean and

the standard deviation for each row.

Then we subtract the mean from each element in the matrix, and

then divide by the standard deviation.

The final product is a new data set with a centered mean.

And are even standard deviation.

Another data that is measured in

biology follows this normal distribution that I'm going to mention later.

Quantile normalization is less sensitive to the type of distribution the data has.

And it's also a common method used to

normalize data from gene expression microarrays.

With quantile normalization we normalize the columns.

So first, we sort each column by the values,

keeping track of where the values, which row the values came from.

After we sort the values, we compute the average for each row.

And then, we replace the value in each row with the average value of each row.

And then, we put the values back to their original rows where they came from.

So this makes the new data matrix, the normalized data matrix,

have the same sum across all columns.

A third normalization method is called Median Polish Normalization.

And it's also known to be the last step of

RMA normalization applied to gene expression microarrays.

So the median polish is normalizing both the rows and columns at the same time.

And it follows several steps.

So in the first step, we identify the medians of each row.

The median of the first row is 4.

And then, we subtract that median number from all the elements of each row.

Producing a new data matrix that now have the differences between the median and

the other values in the data matrix.

And then, we take that data matrix, the resultant difference matrix.

And then, divide the medians of the column, and we subtract those medians from

the columns from each value, resulting in another data matrix.

And then, we repeat this process.

Again, looking for the medians from each row, and we repeat this process until we

converge to have medians of 0 across the rows and columns.

Once the algorithm converged, we can use that

newly generated data matrix and subtract that matrix from the original data.

To normalize the original data set and

what we are remained with is a normalized matrix.

And now, the averages of each row are considered robust

means of RMA normalized data matrix.

Another data cleaning strategy that is very common is log transforming the data.

So many times when we measure a lot of variables, we faced with extreme values.

So here in this particular example it's

the data sets from phosphoproteomic experiments.

And we see that we have extreme values.

So, when we plot the role values, we can see a sharp peak in the center while

there is a lot of, single values that are very high or very low.

So, typically, what we, we do to avoid that dominance of extreme value,

we log transform the data.

And this is before and after log transform this particular phosphoproteomic data,

making it a more normal looking.

So what we looked at now is a histogram of all of

the values of a specific experiment.

In the first few slides, we talked about normal distribution.

This is the most famed distribution, it's the bell shaped curve distribution.

I have a defined mean and the standard deviations from the mean