This lesson is still being designed and assembled (Pre-Alpha version)

Reading in Data

Overview

Teaching: 40 min
Exercises: 5 min
Questions
  • How do I load in a data set?

  • Once I’ve loaded in my data, how do I look at it?

Objectives
  • Read a CSV file into a data.frame

  • Read data files from other format: SPSS, SAS, Excel

  • Explore a newly loaded data.frame – number of rows, columns… preview before merging

  • Understand the variable types of imported data

  • Install R packages

Reading in Data

R contains a number of built-in functions that can read in plain-text formats like CSV. To read in formats associated with particular software packages, we’ll need to use certain R packages (more on that in a bit).

Let’s say that we have a CSV file called health.csv. If we open the CSV file to look at it in a text reader (or in software like Excel), we see that it:

ID,First ,Last,Age,Gender,Blood Type,RBC Count,SBP,DPB
101,Dan,Chung,40,M,AB,7.35,121,80
102,Zara,Farooq,35,F,B,6.12,139,82
103,Sarah,Long,22,F,AB,6.89,125,81
104,Kurt,Woods,18,M,AB,5.71,126,82
105,Dan,Davis,45,M,,6.15,111,65
106,Matt,Van Jones,66,,A,4.7,,
107,John,Johnson,22,M,A,4.44,120,75
108,Veronica,Johnson,20,M,A,5.91,130,86
109,Cathy,Mallinson,32,F,,15.65,125,79
110,Mohammed,Ahmad,30,M,B,0.68,116,75

To read in the table, we can use the read.csv() function. Let’s see what what we get when we use it:

#TODO: Change the data, right now it’s from a book

read.csv('data/health.csv')
    ID    First      Last Age Gender Blood.Type RBC.Count SBP DPB
1  101      Dan     Chung  40      M         AB      7.35 121  80
2  102     Zara    Farooq  35      F          B      6.12 139  82
3  103    Sarah      Long  22      F         AB      6.89 125  81
4  104     Kurt     Woods  18      M         AB      5.71 126  82
5  105      Dan     Davis  45      M                 6.15 111  65
6  106     Matt Van Jones  66                 A      4.70  NA  NA
7  107     John   Johnson  22      M          A      4.44 120  75
8  108 Veronica   Johnson  20      M          A      5.91 130  86
9  109    Cathy Mallinson  32      F                15.65 125  79
10 110 Mohammed     Ahmad  30      M          B      0.68 116  75

Where did it find the data/health.csv file? #TODO

Let’s notice a few things about what we got out:

Next we’re going to want to perform some analysis using this data, but to do that, we need to capture it in an R object somehow.

Let’s try creating a variable called health and set it to whatever is returned by read.csv():

health <- read.csv('data/health.csv')

Running that line of code didn’t print out the data table, like last time. But in our Environment pane, we now see a new variable called health.

#TODO: other parameters in read.csv

Exercise

  1. Click on health - see what happens. Put questions here

  2. Click on the health twisty. What did you learn here?

  3. Why do you think the blank values for Gender and Blood Type came through as blanks, but the blanks in SBP and DBP came through as NA?

  4. How would you find out what class of variable health is? (answer: class())

Solution

3. Because of different variable types

## 4. `class(health)`

Talk about data.frame

Let’s look at a more realistic example of Public Health data. One of the widely-used sources of public data in the U.S. is the National Health and Nutrition Examination Survey, NHANES. Much of the data in NHANES is not provided in CSV format, and often contains many more variables and observations than in our first example.

In this case, we’ll be working with .XPT files, which are in a format used by SAS. If you work in public health, you’ll most definitely be encountering data files in SAS “xport” format. While XPT is a plain-text format, it’s not terribly easy to read.

R’s base packages don’t include a function to read in XPT files, but we can install a “contributed” package called SASxport.

A word about pacakges

Packages are sets of functions that # FIND A GOOD DEFINITION

Installing a package downloads the code to run the package onto your computer. But we don’t always want every package available in every project that we work on.

So, there’s a second step involved to load a package that you’ve installed. More on that in a moment.

To install the SASxport package:

install.packages('SASxport')
Installing package into '/Users/kerchner/Library/R/3.6/library'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

Go to the Packages pane in RStudio. Do you see that SASxport is now in the list of available packages? You’ll also notice that some of the packages are checked, and some, like SASxport, are not checked (yet).

TODO: Click on package name to get help (and what’s the other way to get help for a package?)

You’ll also notice Version numbers for all of the packages #Reproducibility

The next thing we need to do is load in the library for this particular project, using the library() function:

library(SASxport)

Notice that SASxport is now checked in the Packages pane.

We can open up the documentation page for SASxport in our Help pane by running help(SASxport). We learn there that the function we want to use is called read.xport(). let’s try it and see what we get.

DEMO_I.XPT contains demographic data for study subjects.

demographics <- read.xport('data/DEMO_I.XPT')

After reading in our dataset into the demographics dataframe, we can inspect the shape and size of the data, such as: colname(df), head(df), tail(df)

colnames(demographics)
 [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
 [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC" 
[13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
[19] "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG"  "FIAPROXY"
[25] "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
[31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
[37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
[43] "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"

glu_df <- read.xport(‘data/GLU_I.XPT’) sleep_df <- read.xport(‘data/SLQ_I.XPT’)

A word about reproducibility

Package versions - maybe put this as an open discussion question in the Exercises? With some links to ways to solve reproducibility issues?

Platform-independent? Maybe put this in the intro (R vs. software that only works on Windows, for example)

Looking at variable types after reading in - Factors etc.

And how to use the factorsAsStrings parameter

Key Points

  • Learn to use read.* and related functions

  • Learn to use str()

  • Use colnames() to see a list of column names