--- output: pdf_document --- ### MATH 401 - Lab 2 - Spring 2016 #### SLR Recap + Introduction to RStudio/RMarkdown > Instructions: This lab exercise is a warm-up designed to give you some familiarity working in R and RStudio. If have been seen R before, these problems should not be very hard. If you have not seen R before, these problems will give you some practice and help to develop your skills. The content will be a brief review of introductory statistics, so it should be familiar to all. > This assignment, like your midsemester and semester-long project, is prepared in [R Markdown](http://www.rstudio.com/ide/docs/authoring/using_markdown), a markup language that allows you to easily format your code for readability and reproducibility. For more information about formatting in R Markdown, click on the **?** button in the menu bar of your Markdown document in [RStudio](http://www.rstudio.org). The benefit of using R Markdown is that it enables you to seamlessly combine R commands, the output from those commands, and written analysis into one document. You should save the file after you open it somewhere you like and keep saving your work as you go! ### Introductory Example - Sleep #### Introduction NOTE: R is a programming language - there are often MULTIPLE ways to accomplish a given task. You may see different code here than in the R Manual accompanying your textbook. You can choose to work with EITHER set of commands. The R Manual is a good reference, but don't forget the slides and your class notes too! We will almost always begin by loading the **mosaic** package. This is a useful add-on for R. Make sure to include the ```message=FALSE``` option in the chunk that loads **mosaic**. This will suppress unwanted messages and keep your Markdown document neat and clean. If you've never used mosaic before, you will need to install the mosaic package. Go to the lower right of RStudio, then Packages, then Install, then type mosaic and click Install. Alernatively, you can type **install.packages("mosaic")** in R. Now we're ready to start our lab. ```{r, message=FALSE} require(mosaic) options(digits=3) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice ``` Note that R is case-sensitive. That means capitalization and spelling count! The # in a line of code is a commenting convention. R ignores anything after the # as a comment, but it will still show up in the line of code. Let's start by loading some data. We can either load data that already exists in R using the load function or read in data sets that are provided elsewhere. To get us started, here is a data set on sleep deprivation. > Lingering effects of sleep deprivation? Researchers have established that sleep deprivation has a harmful effect on visual learning. But do these effects linger for several days, or can a person "make up" for sleep deprivation by getting a full night's sleep in subsequent nights? A recent study by Stickgold and colleagues investigated this question by randomly assigning 21 subjects (volunteers between the ages of 18 and 25) to one of two groups: one group was deprived of sleep on the night following training and pre-testing with a visual discrimination task, and the other group was permitted unrestricted sleep on that first night. Both groups were then allowed as much sleep as they wanted on the following two nights. All subjects were then re-tested on the third day. Subjects' performance on the test was recorded as the minimum time (in milliseconds) between stimuli appearing on a computer screen for which they could accurately report what they had seen on the screen. The sorted data presented here are the improvements in those reporting times between the pre-test and the post-test (a negative value indicates a decrease in performance). Sleep deprivation (n=11): -14.7, -10.7, -10.7, 2.2, 2.4, 4.5, 7.2, 9.6, 10.0, 21.3, 21.8 Unrestricted sleep (n=10): -7.0, 11.6, 12.1, 12.6, 14.5, 18.6, 25.2, 30.5, 34.5, 45.6 Does it appear that subjects who get unrestricted sleep on the first night tended to have higher improvement scores than subjects who were sleep deprived on the first night? Let's address this question, using the CHOOSE-FIT-ASSESS-USE framework and learn about R as we go! We don't want to enter the data in manually so it has been loaded online and it's ready to go. The code below will run if you click Run from your menu, or put your cursor in it and hit Ctrl+Alt+C or hit Run Current Chunk from the Chunks menu. Remember to run the chunks as you work, or no output will appear! ```{r} sleep = read.csv("http://personal.denison.edu/~whiteda/files/Teaching/sleep.csv") ``` **sleep** is now in our workspaces as a *data.frame*. This is the basic data structure that we will work with most often. There are several options for figuring out what is the **sleep** data frame. The easiest way is to click on the name of it in the Workspace tab. > Note that clicking on the name of the data set in the Workspace tab will allow you to view it like a spreadsheet. This is fine for small data sets, but will not work for large ones! If the data set was provided with R, then help(datasetname) will open up documentation in the Help window. For data provided within R, you may need to run the View(Dataset) command before you can see it. ```{r} # Some useful R commands - Remember to RUN this # What type of object is it? class(sleep) # What are its dimensions? dim(sleep) # How many rows and columns? nrow(sleep) ncol(sleep) # What variables are present? names(sleep) # A brief summary of each variable summary(sleep) # A sample of the first few rows head(sleep) ``` Note that when we Knit HTML these lines of R code are being executed one-by-one, and the results are appearing with ## before each line. #### Univariate Analysis We will use the *formula* interface whenever possible. This means using the ~ (tilde) operator. To get a basic summary of a variable, use *favstats()*. ```{r} favstats(~discrim, data=sleep) ``` To specifically retrieve the mean or standard deviation, use the respective commands. ```{r} mean(~discrim, data=sleep) sd(~discrim, data=sleep) ``` Note that you can store these values to new variables if you want to use them later on. When you execute the code chunk below, you will see "mu" pop up as a new value in the workspace/environment window. ```{r} mu = mean(~discrim, data=sleep) mu ``` For a graphical description of the distribution of a single variable, we can create a histogram or a density plot. Once you generate plots, you can cycle through them using the arrow keys in the Plots window. ```{r,fig.width=8,fig.height=4} histogram(~discrim, data=sleep) densityplot(~discrim, data=sleep) ``` #### Bivariate Analysis For relating a quantitative variable to a categorical variable, we can draw a boxplot. ```{r,fig.width=8,fig.height=4} bwplot(discrim ~ deprive, data = sleep) ``` A table can also be useful here. ```{r} favstats(discrim ~ deprive, data=sleep) ``` Finally, for counting the frequency of categorical variables, you can use *tally()*. ```{r} tally(~ deprive, data=sleep) ``` If you want a contingency table, it would be ~ firstvariable + secondvariable in the command above. We'll come back to bivariate analysis for scatterplots and regression in the next example. For now, consider what you've seen with the analysis so far in relation to the question: "Does it appear that subjects who get unrestricted sleep on the first night tended to have higher improvement scores than subjects who were sleep deprived on the first night?" Is there an appropriate model we can fit? > ANSWER: PUT YOUR ANSWER HERE > Based off the class and book examples, the answer should be yes! And it's the four parameter model discussed on page 9 that will lead to a two-independent samples t-test. The response is simply a constant (mean) that differs between the two groups plus errors that are normally distributed with possibly different SDs for the two groups. Be sure to return and ASSESS your conditions with appropriate plots before using the model via the t-test. Try the plots below along with your boxplot. ```{r,fig.width=8,fig.height=4} densityplot(~ discrim, groups=deprive, auto.key=TRUE, data=sleep) qqnorm(sleep$discrim[sleep$deprive=="Yes"]) qqline(sleep$discrim[sleep$deprive=="Yes"]) qqnorm(sleep$discrim[sleep$deprive=="No"]) qqline(sleep$discrim[sleep$deprive=="No"]) ``` > The following commands are equivalent to the last four lines above: with(sleep, qqnorm(discrim[deprive=="Yes"])) with(sleep, qqline(discrim[deprive=="Yes"])) with(sleep, qqnorm(discrim[deprive=="No"])) with(sleep, qqline(discrim[deprive=="No"])) The advantage of using the *with()* function is that you don't have to type the name of the *data.frame* repeatedly. Do you have any concerns? > ANSWER: PUT YOUR ANSWER HERE Assuming no major concerns, do you want to run the t-test with or without equal variances assumed? You can adjust the code below as desired (simply change TRUE to FALSE if you don't want equal variances assumed. The default is FALSE.) ```{r} t.test(discrim ~ deprive, var.equal=TRUE, data=sleep) ``` What conclusion do you reach? > ANSWER: PUT YOUR ANSWER HERE ### Regression Example - Galton Data The **Galton** data set is part of the **mosaicData** package, and contains data on the height of many British adults, as well as the height of their mothers and fathers. To load it, we use the *data()* command. As with the **mosaic** package, you may need to Install mosaicData in the Packages pane of RStudio. ```{r} require(mosaicData) require(stats) require(graphics) data(Galton) #help(Galton)#brings up the help menu on the Galton data set #View(Galton)#allows you to view the data set instead of having to click ``` Suppose we want to understand the relationship between the height of an adult and their mother. That gives us two quantitative variables, and we might want to see if the adult height can be predicted by the mother's height. We will follow the modeling framework from our textbook. #### Bivariate Analysis > CHOOSE: For assessing the relationship between two variables, we can draw a scatterplot. ```{r,fig.width=8,fig.height=4} xyplot(height ~ mother, data=Galton) ``` Note that there are many graphical options that we can add to this plot. Here is the same basic plot, but jazzed up a bit. You should experiment with graphical options! If you don't know what is available, hit the tab key inside the function call. Or type ```help(xyplot)```. ```{r,fig.width=8,fig.height=4} xyplot(height ~ mother, groups=sex, data=Galton, type=c("p", "r", "smooth"), auto.key=list(columns = 2), pch = 19, alpha = 0.3, cex=1.5, lwd = 3, main="Relationship between Adult Height and Mother's Height", ylab="Height of Adult (inches)", xlab="Height of Adult's Mother (inches)") ``` Note that in this case we have conditioned on a third variable **sex**, by adding the *groups* argument. For a numerical assessment of the linear relationship between two quantitative variables, we can compute their correlation coefficient. ```{r} with(Galton, cor(height, mother)) ``` We can also do this by referring to the variables directly. ```{r} cor(Galton$height, Galton$mother) ``` Again, the advantage of using the *with()* function is that you don't have to type the name of the *data.frame* repeatedly. Based on your work so far, a simple linear regression model is reasonable, but you may suspect it may not have a great fit. Let's find out! > FIT: Now let's fit the model. This is probably the simplest part. ```{r} fm=lm(height~mother,data=Galton) summary(fm)#need the summary ``` What is the equation of the fitted regression line? > ANSWER: PUT YOUR ANSWER HERE How would you interpret the slope? > ANSWER: PUT YOUR ANSWER HERE Before we can USE the model, we need to ASSESS it. > ASSESS: Normality of errors condition: ```{r} #generates a histogram with fitted density curve histogram(~residuals(fm),xlab="Residuals",density=TRUE,type="density") ``` ```{r} #generates the associated qqplot qqnorm(fm$residuals) qqline(fm$residuals) ``` Constant Variance (and linearity) condition: ```{r} #Residuals vs. fitted plot xyplot(residuals(fm)~fitted(fm),type=c("p","r"),xlab="Predicted values",ylab="Residuals") ``` What do you think about the conditions necessary for SLR to be appropriate? > Remember: the required conditions are linearity of relationship, zero mean for errors (which we have because we are using least squares), constant variance of errors, and independence of errors. > ANSWER: PUT YOUR ANSWER HERE What do you think about the conditions necessary for inference using the regression? > Remember: the additional conditions are randomness of the data, and normality of the error terms. > ANSWER: PUT YOUR ANSWER HERE You will probably also want to look at a scatterplot with the regression line fit to the data. There are multiple ways to do this: ```{r} xyplot(height ~ mother, type = c("p", "r"), xlab = "Mother's Height", ylab = "Adult Height", data = Galton) ``` OR ```{r} with(Galton,plot(mother, height)) abline(fm) ``` Assuming no major issue with the conditions, USE the model to determine if adult height can be predicted by mother's height. > USE: You should already have the output needed for this. > ANSWER: PUT YOUR ANSWER HERE Review: Do you think the predictions will be very accurate? There should be TWO values in the output that you are using as a gauge of this. > ANSWER: PUT YOUR ANSWER HERE ### This Week's Lab - KidsFeet These problems use the **KidsFeet** data set, which is also part of **mosaic**. ```{r} data(KidsFeet) help(KidsFeet) ``` A correct answer to questions a-e consists of a(n) R command(s) that answers the question. For parts f and g, you will need to perform an appropriate analysis. Follow the CHOOSE-FIT-ASSESS-USE framework for part g. Use the code above to help you out, and ask me if you run into any problems! #### Problems a. How many children are present in the data set? b. What is the mean foot length? c. For how many children is their left foot bigger than their right? d. What percentage of the children were left-handed? e. Which month contained the most births? f. Were boys' feet longer, on average, than girls'? If so, by how much? Is the result statistically significant? g. Draw a scatterplot of foot length as a function of foot width. Put the least squares regression line on the plot and provide meaningful axis labels. Is foot width a significant predictor of foot length? h. Ask and answer an interesting question using this dataset and what we've learned so far in the course. ### Five Minute Warning: SAVE YOUR WORK - How this works As you work through the lab, you should save the text you type in/commands, etc. That's as easy as clicking the usual save disc. If you want to see your final printed document, hit the "Knit HTML" button in your menu. It will take a while to compile, but you'll end up with an html document that you could then print, if you desired. You'll need to do that for homework. Proofread your document before printing it! Also, be sure all the code you want to execute is RUN before knitting, etc. ### What to hand in For Friday, February 5, please submit a document titled **username_lab2_class.Rmd** that you create from this document and that fills in all the ANSWER blanks throughout the document. In addition, please submit the usual 2-3 page lab report to analyze the **KidsFeet** dataset as an Rmd file named **username_lab2.Rmd**. For both files, you should upload **both the Rmd file and the PDF** you used it to Knit, to the Shared Drive. You may work with your partner on all aspects of the project until the time when you write your own Lab Report. Use Google to try to find some background information about the dataset, so that you can discuss bias issues. Include at least one graph, interpretations of any graphs in your report, answers to questions (a)-(h), and a brief explanation of statistical techniques used. As usual, put yourself in the role of a statistical consultant, and assume your audience has only taken a first course in statistics.