This tutorial makes use of the following R package(s): ** ggplot2**.

Datasets do not always follow a nice symmetrical distribution nor do their spreads behave systematically across different levels (e.g. medians). Such distributions do not lend themselves well to visual exploration since they can mask simple patterns. They can also be a problem when testing hypotheses using traditional statistical procedures. A solution to this problem is non-linear **re-expression** (aka transformation) of the values. In univariate analysis, we often seek to **symmetrize** the distribution and/or **equalize** the spread. In multivariate analysis, the objective is to usually **linearize** the relationship between variables and/or to **normalize** the residual in a regression model.

One popular form of re-expression is the **log** (natural or base 10). The other is the **box-cox** family of transformations (of which the *log* is a special case).

One of the most popular transformations used in data analysis is the logarithm. The log, \(y\), of a value \(x\) is the power to which the base must be raised to produce \(x\). This requires that the log function be defined by a **base**, \(b\), such as `10`

, `2`

or `exp(1)`

(the latter defining the natural log).

\[ y = log_b(x) \Leftrightarrow x=b^y \]

In R, the base is defined by passing the parameter `base=`

to the `log()`

function as in `log(x , base=10)`

.

Re-expressing with the log is particularly useful when the change in one value as a function of another is multiplicative and not additive. An example of such a dataset is the compounding interest. Let’s assume that we start off with $1000 in an investment account that yields 10% interest each year. We can calculate the size of our investment for the next 50 years as follows:

```
rate <- 0.1 # Rate is stored as a fraction
y <- vector(length=50) # Create an empty vector that can hold 50 values
y[1] <- 1000 # Start 1st year with $1000
# Next, compute the investment ammount for years 2, 3, ..., 50.
# Each iteration of the loop computes the new amount for year i based
# on the previous year's amount (i-1).
for(i in 2:length(y)){
y[i] <- y[i-1] + (y[i-1] * rate) # Or y[i-1] * (1 + rate)
}
```

The vector `y`

gives us the amount of our investment for each year over the course of 50 years.

```
[1] 1000.000 1100.000 1210.000 1331.000 1464.100 1610.510 1771.561 1948.717 2143.589 2357.948
[11] 2593.742 2853.117 3138.428 3452.271 3797.498 4177.248 4594.973 5054.470 5559.917 6115.909
[21] 6727.500 7400.250 8140.275 8954.302 9849.733 10834.706 11918.177 13109.994 14420.994 15863.093
[31] 17449.402 19194.342 21113.777 23225.154 25547.670 28102.437 30912.681 34003.949 37404.343 41144.778
[41] 45259.256 49785.181 54763.699 60240.069 66264.076 72890.484 80179.532 88197.485 97017.234 106718.957
```

We can plot the values as follows:

```
# Note that a scatter plot is created from 2 variables, however, if only one
# is passed to the plot() function, R will assume that the x variable is
# an equally spaced index.
plot(y, pch=20)
```

The change in difference between values from year to year is **not additive**, in other words, the difference between years 48 and 49 is different than that for years 3 and 4.

Years | Difference |
---|---|

`y[49] - y[48]` |
8819.75 |

`y[4] - y[3]` |
121 |

However, the ratios between the pairs of years are identical:

Years | Ratio |
---|---|

`y[49] / y[48]` |
1.1 |

`y[4] / y[3]` |
1.1 |

We say that the change in value is *multiplicative* across the years. In other words, the value amount 6 years out is \(value(6) = (yearly\_increase)^{6} \times 1000\) or `1.1^6 * 1000`

= 1771.561 which matches value `y[7]`

.

When we expect a variable to change multiplicatively as a function of another variable, it is usually best to transform the variable using the logarithm. To see why, plot the log of `y`

.

`plot(log(y), pch=20)`

Note the change from a curved line to a perfectly straight line. The logarithm will produce a straight line if the rate of change for `y`

is constant over the range of `x`

. This is a nifty property since it makes it so much easier to see if and where the rate of change differs. For example, let’s look at the population growth rate of the US from 1850 to 2013.

```
dat <- read.csv("http://personal.colby.edu/personal/m/mgimond/R/Data/Population.csv", header=TRUE)
plot(US ~ Year, dat, type="l")
```

The population count for the US follows a slightly curved (convex) pattern. It’s difficult to see from this plot if the rate of growth is consistent across the years (though there is an obvious jump in population count around the 1950’s). Let’s log the population count.

`plot(log(US) ~ Year, dat, type="l") `

It’s clear from the log plot that the rate of growth for the US has not been consistent over the years (had it been consistent, the line would have been straight). In fact, there seems to be a gradual decrease in growth rate over the 150 year period (though a more thorough analysis would be needed to see where and when the growth rates changed).

A logarithm is defined by a base. Some of the most common bases are `10`

, `2`

and `exp(1)`

with the latter being the natural log. The bases can be defined in the call to `log()`

by adding a second parameter to that function. For example, to apply the log base 2 to the 5th value of the vector `y`

, type `log( y[5], 2)`

. To apply the natural log to that same value, simply type `log( y[5], exp(1))`

. If you don’t specify a base, `R`

will default to the natural log.

The choice of a log base will not impact the shape of the logged values in the plot, only in its absolute value. So unless interpretation of the logged value is of concern, any base will do. Generally, you want to avoid difficult to interpret logged values. For example, if you apply log base 10 to the investment dataset, you will end up with a smaller range of values thus more decimal places to work with whereas a base 2 logarithm will generate a wider range of values and thus fewer decimal places to work with.

A rule of thumb is to use log base 10 when the range of values to be logged covers 3 or more powers of ten, \(\geq 10^3\) (for example, a range of 5 to 50,000); if the range of values covers 2 or fewer powers of ten, \(\leq 10^2\)(for example, a range of 5 to 500) then a natural log or a log base 2 log is best.

The Box-Cox family of transformations offers a broader range of re-expression options (which includes the log). The values are re-expressed using the algorithm:

\[ re-expressed(x) = \frac{x^p -1}{p}\; for\; p\neq 0 \] \[ re-expressed(x) = ln(x)\; for\; p = 0 \]

The objective is to find a value for \(p\) from a “ladder” of powers (e.g. -1, -1/2, 0, 1/2, 1) that does a good job in re-expressing the batch of values. Technically, \(p\) can take on any value. But in practice, we normally pick a value for \(p\) that may be “interpretable” in the context of our analysis. For example, a log transformation (`p=0`

) may make sense if the process we are studying has a steady growth rate. A cube root transformation (p = 1/3) may make sense if the entity being measured is a volume (e.g. rain fall measurements). But sometimes, the choice of \(p\) may not be directly interpretable or may not be of concern to the analyst.

A nifty solution to finding an appropriate \(p\) is to create a function whose input is the vector (that we want to re-express) and a \(p\) parameter we want to explore.

```
BC <- function(x, p = 0) {
if(p == 0) {
z <- log(x)
} else {
z <- (x^p - 1)/p
}
}
```

To use the custom function `BC`

simply pass two vectors: the batch of numbers being re-expressed and the \(p\) parameter.

```
# Create a skewed distribution of 50 random values
set.seed(9)
a <- rgamma(50, shape=1)
# Let's look at the skewed distribution
boxplot(a, horizontal = TRUE)
```

The batch is strongly skewed to the right. Let’s first try a square-root transformation (`p=1/2`

)

```
a.re <- BC(a, p = 1/2)
boxplot(a.re, horizontal = TRUE)
```

That certainly helps minimize the skew, but the distribution still lacks symmetry. Let’s try a log transformation (`p=0`

):

```
a.re <- BC(a, p = 0)
boxplot(a.re, horizontal = TRUE)
```

That’s a little too much over-correction; we don’t want to substitute a right skew for a left skew. Let’s try a power in between (i.e. `p=1/4`

):

```
a.re <- BC(a, p = 1/4)
boxplot(a.re, horizontal = TRUE)
```

That’s much better. The distribution is now nicely balanced about its median.

It can be helpful to simulate distributions of difference skewness to see how a quantile or a normal quantile plot may behave.

A beta distribution is used to simulate a range of distributions from right skew to left skew. The top row shows the different density distribution plots; the bottom row shows the **quantile plots** for each distribution.

A beta distribution is used to simulate a range of distributions from right skew to left skew. The top row shows the different density distribution plots; the bottom row shows the **normal q-q plots** for each distribution.

A spread vs level plot not only tells us if there is a systematic relationship between spread and level, it can also suggest the power transformation to use. Note that the s-l method discussed here is not the one presented in Cleveland’s book (see the section titled “Spread-location plot” on the course website for an alternative version of the plot). We will make use of a custom function called `sl`

available in the `ES218.R`

script to help construct the s-l plot using .

```
source("http://mgimond.github.io/ES218/es218.R")
df <- read.csv("http://mgimond.github.io/ES218/Data/Food_web.csv")
# Create s-l table from custom sl() function
df.sl <- sl(x=dimension, y=mean.length, df)
# Plot spread vs median
plot(sprd ~ med, df.sl, pch=16)
```

The plot suggests a monotonic relationship between spread and median. Next, we’ll fit a line to this scatter plot and compute its slope. We’ll use the `lm()`

function, but note that many other line fitting operations could be used as well.

```
# Run regression model
M <- lm(sprd ~ med, df.sl)
abline(M, col="red")
```

The slope can be used to come up with the best power transformation to minimize the systematic increase in spread: \(p = 1 - slope\).

The slope can be extracted from the model `M`

as using the `coef`

function:

`coef(M)`

```
(Intercept) med
-2.317414 2.649166
```

The second value in the output is the slope. So the power to use is `1 - 2.65`

or `-1.65`

. We will use the custom Box-Cox power transformation function `bc`

(also available in the `ES218.R`

script) to re-express the `mean.length`

values. We’ll add the re-expressed values as a new column to `df`

:

`df$bc1.mean.length <- bc(df$mean.length, -1.65)`

Let’s compare boxplots between the original values with the re-expressed values.

```
OP <- par(mfrow = c(1,2))
boxplot(mean.length ~ dimension, df, main = "Original data")
boxplot(bc1.mean.length ~ dimension, df, main = "Re-expressed data")
par(OP)
```

Recall that our goal here was to minimize any systematic relationship between spread and median. The re-expression seems to have done the trick.

In his book, Cleveland proposes a power transformation of -1; a power less aggressive than ours. Let’s see how well it performs compared to our transformation using the s-l plot.

```
# Re-express data using Cleveland's power of -1 (or 1/x)
df$bc2.mean.length <- bc(df$mean.length, -1)
# Generate s-l table
df1.sl <- sl(x=dimension, y=bc1.mean.length, df)
df2.sl <- sl(x=dimension, y=bc2.mean.length, df)
# Plot the data
OP <- par(mfrow=c(1,2))
plot(sprd ~ med, df2.sl, pch=16, main="p = -1")
plot(sprd ~ med, df1.sl, pch=16, main="p = -1.65")
par(OP)
```

Neither plot seems to do a bad job in stabilizing spread. To help decide which power to choose, let’s see how well they help balance the spread about the median. We’ll generate letter value summary plots using the custom `lsum`

function available in the `ES218.R`

course script.

```
library(ggplot2)
# Generate letter value summary plot for original data
df %>% group_by(dimension) %>%
do(as.data.frame( lsum(.$mean.length, l=6) )) %>%
ggplot(.) + aes(x=letter, y=mid) + geom_point() + facet_grid(.~dimension) + ylab("Original data")
# Generate letter value summary plot for p = -1.65
df %>% group_by(dimension) %>%
do(as.data.frame( lsum(.$bc1.mean.length, l=6) )) %>%
ggplot(.) + aes(x=letter, y=mid) + geom_point() + facet_grid(.~dimension) + ylab("p = -1.65")
# Generate letter value summary plot for p = -1
df %>% group_by(dimension) %>%
do(as.data.frame( lsum(.$bc2.mean.length, l=6) )) %>%
ggplot(.) + aes(x=letter, y=mid) + geom_point() + facet_grid(.~dimension) + ylab("p = -1")
```

The original data does not suffer from skewness except for the three-dimension dataset which exhibits a consistent skew across all depth of the data towards larger values. The power of `-1.65`

reverses that skew–an outcome we don’t want–whereas the power of `-1`

seems to do a better job in balancing the spread across all depths. The power of `-1`

does not seem to introduce significant skew in the `Mixed`

and `Two`

batches which were fine to start with; this is good news.

If we are interested in both removing skewness AND in stabilizing spread across all levels, the power of `-1`

seems to accomplish this quite well with this dataset. But be aware that most datasets are not that accommodating and that one may be faced with having to sacrifice one characteristic of the spread for another. Also note that re-expressions may not be a common fix across all factors (categories) of the data. This suggests that the data may behave differently under different circumstances. For example, in our food web dataset, skewness is prominent for the mixed dimension batch but not for the two or three dimension batches.