### MATH 401 - Lab 3 - Spring 2016 #### SLR II + Transformations ```{r, message=FALSE} require(mosaic) ``` Let's start by generating some random data. Note: R is *very* good at this! ```{r} # create a vector of the integers from 1 to 100 x = 1:100 # generate a standard normal random variable p = rnorm(1);p # raise x to the p^th power, and add some randomness y = jitter(x^p, factor = length(x) / 2) # Combine x and y into a data frame Rand = data.frame(x,y) ``` Now, let's pretend that we don't know the value of $p$. That is, we know that $y$ is a function of $x$ raised to some power, but we don't know what that power is. Our interest is to transform the $y$ variable so that we can figure out a linear regression model that works well. To do this, we are going to employ graphical techniques in an attempt to linearize the relationship between our data (which most likely is related in a non-linear fashion). #### Choose * Plot y as a function of x. Does the relationship appear to be linear? Note: everyone is obtaining a random value for $p$, so it is possible some of you may get one that looks linear. If so, re-execute the chunk of code above that generates $p$ and saves the new variable $y$. ```{r, eval=FALSE} plotPoints(y ~ x, data=Rand) ``` #### Fit * Fit a linear regression model to the data, and generate diagnostic plots. ```{r, eval=FALSE} fm = lm(y ~ x, data=Rand) summary(fm) plot(fm, which=1)#generates the residuals vs. fitted plot - NEW option plot(fm, which=2)#generates the QQ plot of residuals - NEW option ``` #### Assess * Assess the fit. Is the linear model appropriate? Why or why not? > ANSWER ### Tukey's Bulging Rule and Ladder of Powers At this point you have two variables ($x$ and $y$) which are clearly related, but most likely not in an obvious linear pattern. We are going to **manually** apply transformations to the $y$ variable in an attempt to straighten out the relationship between $x$ and $y$. A helpful heuristic for applying transformations is given by Tukey's Bulging Rule and Ladder of Powers. The ladder gives us a set of transformations that correspond to an exponent, to which we are going to raise $y$. Here is the ladder: Rung (exponent $q$) | Step (transformation $f_q(y)$) ------------- | ------------- 3 | $y^3$ 2 | $y^2$ 1 | $y$ $\frac{1}{2}$ | $\sqrt{y}$ 0 | $\log{(y)}$ -1 | $-\frac{1}{y}$ -2 | $-\frac{1}{y^2}$ -3 | $-\frac{1}{y^3}$ Note that in general, we can think of moving up or down the ladder as raising $y$ to the power corresponding to the rung in the ladder $q$. However, this is not *literally* true (for example, $y^0 \neq \log{(y)}$!). The actual rule that will generate the ladder is: $$ f_q(y) = \begin{cases} y^q & \text{if } q > 0 \\ \log{y} & \text{if } q = 0 \\ -y^{q} & \text{if } q < 0 \end{cases}, \qquad \text{where }q\text{ is the rung of the ladder} $$ The following diagram gives us insight as to which transformations are likely to work. For example, suppose the pattern we see in the data is shaped similar to the blue line in the first quadrant of the diagram below. Then since both $x$ and $y$ are positive in that quadrant, we should try moving up the ladder in either $y$, $x$, or both. Many people find it easiest to think about transformations of the $y$ variable, so let's try that first. #### Application of Tukey's Rule: Manual Approach * Use Tukey's Bulging Rule to determine an appropriate transformation of the $y$ variable. Create a new variable in your data frame containing the transformed $y$ values. I've created an example of how that would work for a log transformed version of $y$. Note that because we are using random data, it's possible (but unlikely) that the code below might try to take the log of a negative number. If that happens you can either remove the resulting NaN points via subsetting or simply re-run the Rand code from above to create a new random dataset. The command below DOES add the new variable to the data frame - see if you can figure out how. ```{r} # e.g. for a log transformation Rand$y.new = with(data=Rand, log(y)) ``` Note that you may have to try multiple transformations until you find something that works well. ```{r, eval=FALSE} plot(y.new ~ x, data=Rand) ``` * Once you have something that looks roughly linear, re-fit the linear model, and assess it. Does the new model fit the assumptions for linear regression better than the old one? > ANSWER #### Application of Tukey's Rule: Automatic Approach using **manipulate** package The previous method for applying transformations was labor-intensive, since we had to repeatedly compute the transformations and then re-plot them. We can take advantage of the **manipulate** functionality in RStudio to do this automatically. * Reset the $y$ values by executing the following code ```{r} p = rnorm(1) Rand$y = jitter(x^p, factor = length(x) / 2) ``` What is your current value of p? > ANSWER * Repeat the entire process above using the **manipulate** package. You may need to install the package, or at least make sure it is checked on in the packages menu. Execute the following code and **move the slider to best linearize the data**. Again, if your reported value of $p$ is near to 1, you should re-execute the code chunk above to get a relationship that isn't as linear to play with. ```{r, eval=FALSE} tukeyLadder = function (x, q=NULL) { if(is.null(q)) { return(x) } if (q==0) { x.new = log(x) } else { if (q < 0) { x.new = - x^q } else { x.new = x^q } } return(x.new) } tukeyPlot = function (x, y, q.y, q.x=1, ...) { y.new = tukeyLadder(y, q.y) x.new = tukeyLadder(x, q.x) y.center = mean(y.new, na.rm=TRUE) x.center = mean(x.new, na.rm=TRUE) x.bottom = 0.1 * (max(y.new) - min(y.new)) + min(y.new) y.left = 0.1 * (max(x.new) - min(x.new)) + min(x.new) xyplot(y.new ~ x.new , panel = function(x,y,...) { panel.xyplot(x,y, pch=19, alpha=0.2, cex=2) panel.text(y.left, y.center, paste("q.y =", q.y), col="red", cex=2) panel.text(x.center, x.bottom, paste("q.x =", q.x), col="red", cex=2) } ) } require(manipulate) manipulate( with(Rand, tukeyPlot(x, y, q.y)) , q.y = slider(-3, 3, step=0.25, initial=1) ) ``` * Describe the shape of the original scatterplot between $x$ and $y$. Which quadrant in Tukey's Bulge coordinate system does it lie in? Use your slider to find the optimal power to raise $y$ to in order to linearize the data, and report what you found. > ANSWER * Does the transformation you prefer to linearize the data agree with Tukey's Bulging Rule and Ladder? > ANSWER Try running the last two code chunks again to get a different relationship to try to linearize one more time. #### Real Data - Xforms + ANOVA + Outliers In the previous exercise, we fabricated data using the relationship $y = x^p$. Next, let's look at some real data, where the relationship between $y$ and $x$ is truly unknown. We'll also practice working with outliers and the ANOVA output for regression. * Load the **Boston** dataset. ```{r} boston = read.csv("http://personal.denison.edu/~whiteda/files/Teaching/boston.csv",header=TRUE) ``` This is a data set on 14 variables on housing values in the suburbs of Boston. Variables include crime, zone (deals with residential proportion of housing), indust (deals with industry proportion), charles (on river or not), nox (amount of nitric oxides), rooms (avg. number per dwelling), age (proportion built before 1940), distance (weighted to 5 employment centers), radial (index of access to highways), tax (property tax per $10,000), ptratio (pupil-teacher ratio), minor (index of proportion of minorities), lstat (% lower status of population), and medv (median value of homes). Zone is always between 0 and 100. Charles is a binary variable. Radial is an index from 1- 24 (integers). The oddest variable is perhaps minor. Values from 196 and below indicate a large proportion of minorities. Values around 396 indicate a completely non-minority area, and values below 396 down to 196 indicate areas that have decreasing proportions of minorities. * Suppose we want to try to predict *nox* as a function of *distance*. ```{r} xyplot(nox ~ distance, data=boston) ``` * Try to use Tukey's Bulging Rule and Ladder, and the **manipulate** function to linearize the data. This time, we have sliders to transform both the **nox** and the **distance** variables. ```{r, eval=FALSE} manipulate( with(boston, tukeyPlot(distance, nox, q.y, q.x)) , q.y = slider(-3, 3, step=0.25, initial=1) , q.x = slider(-3, 3, step=0.25, initial=1) ) ``` (a) Try moving one and then the other. Does it work? Why or why not? * Find a set of transformations you are satisfied with to fit a linear regression model. An example plot generated based on the square root of distance and log of nox is shown. Report your fitted regression line in terms of the new variables you used. You may find it useful to add the new variables to the dataset to call them directly. We'll have an informal competition with our randomly decided groups to see who can come up with the "best" model using transformations. Think about how "best" should be evaluated in this context. It will help motivate some of our later discussions. ```{r, eval=FALSE} xyplot(log(nox) ~ sqrt(distance), type=c("p","r"),data=boston) ``` Perhaps you are growing concerned about outliers and working with the ANOVA output. Do you see any points that might be considered outliers based on your preferred fit? You probably do. Now we'll figure out how to identify outliers based on the log(nox) and sqrt(distance) example. You can adjust as needed for your preferred xforms. ```{r} which(log(boston$nox)>-0.2)#IDs points with log(nox)>-0.2 which(sqrt(boston$distance)>3.2)#IDs points with sqrt(distance)>3.2 which(log(boston$nox)>-0.2 & sqrt(boston$distance)<1.2)#IDs points with log(nox)>-0.2 and sqrt(distance)<1.2 ``` Recall that we also could just look for residuals larger than 2 in absolute value when standardized. abs() does absolute value. So, if you had a fitted model called fm, you could do something like the following: ```{r} fm=lm(log(nox)~sqrt(distance),data=boston) summary(fm) standres=rstandard(fm)#computes standardized residuals studres=rstudent(fm)#computes studentized residuals which(abs(standres)>2)#which standardized residuals are > 2 in absolute value which(abs(studres)>2)#which studentized residuals are > 2 in absolute value ``` You can SAVE the vector of unusual points, to then aid in their temporary removal from the dataset. The example below saves ones with high values of log(nox) - you could adjust it to just save ones with standardized or studentized residuals greater than 2 in absolute value. ```{r} highnox=which(log(boston$nox)>-0.2)#IDs points with log(nox)>-0.2 boston2=boston[-highnox,] ``` (b) Report your vector of outliers. How large is this vector? Do you think that makes sense relative to the size of the data? Remove any points you consider outliers from your dataset, and refit your model. (c) Justify why those points are considered outliers in your analysis. Make reference to the actual data, not just a blind mathematical equation. (d) Compare the resulting model to the original model. Now, use the **anova** command on the fitted model (anova(fm)) to view the ANOVA table. Verify that you can compute the R^2 from the ANOVA table. ```{r} fm2=lm(log(nox)~sqrt(distance),data=boston2) anova(fm2) ``` (e) What R^2 did you find? Interpret that value with a sentence about how much of the variation is explained. (f) Are you satisfied with your R^2 value? In other words, do you think this is a good model? (g) Use the predict function to generate a confidence interval for a mean response of nox when distance is 4.5, and interpret your results. See the class example for suggested code, or your R Manual. (h) Finally, use the predict function to generate a prediction interval for an individual response of nox when distance is 4.5, and interpret your results. (i) Check the conditions on the final model so that you can be happy with using distance to predict nox.