Reading in Data
Overview
Teaching: 40 min
Exercises: 5 minQuestions
How do I load in a data set?
Once I’ve loaded in my data, how do I look at it?
Objectives
Read a CSV file into a
data.frame
Read data files from other format: SPSS, SAS, Excel
Explore a newly loaded data.frame – number of rows, columns… preview before merging
Understand the variable types of imported data
Install R packages
Reading in Data
- This occurs in the analysis phase, after you’ve collected data (and/or using data from another source)
- Data should be in some sort of table format. CSV, Excel, SAS, SPSS, Stata, etc.
R contains a number of built-in functions that can read in plain-text formats like CSV. To read in formats associated with particular software packages, we’ll need to use certain R packages (more on that in a bit).
Let’s say that we have a CSV file called health.csv
. If we open the CSV file to look at it in a text reader (or in software like Excel), we see that it:
- has variable names in the first row (it doesn’t have to - more on that soon #TODO)
- contains one observation per row
ID,First ,Last,Age,Gender,Blood Type,RBC Count,SBP,DPB
101,Dan,Chung,40,M,AB,7.35,121,80
102,Zara,Farooq,35,F,B,6.12,139,82
103,Sarah,Long,22,F,AB,6.89,125,81
104,Kurt,Woods,18,M,AB,5.71,126,82
105,Dan,Davis,45,M,,6.15,111,65
106,Matt,Van Jones,66,,A,4.7,,
107,John,Johnson,22,M,A,4.44,120,75
108,Veronica,Johnson,20,M,A,5.91,130,86
109,Cathy,Mallinson,32,F,,15.65,125,79
110,Mohammed,Ahmad,30,M,B,0.68,116,75
To read in the table, we can use the read.csv()
function. Let’s see what what we get when we use it:
#TODO: Change the data, right now it’s from a book
read.csv('data/health.csv')
ID First Last Age Gender Blood.Type RBC.Count SBP DPB
1 101 Dan Chung 40 M AB 7.35 121 80
2 102 Zara Farooq 35 F B 6.12 139 82
3 103 Sarah Long 22 F AB 6.89 125 81
4 104 Kurt Woods 18 M AB 5.71 126 82
5 105 Dan Davis 45 M 6.15 111 65
6 106 Matt Van Jones 66 A 4.70 NA NA
7 107 John Johnson 22 M A 4.44 120 75
8 108 Veronica Johnson 20 M A 5.91 130 86
9 109 Cathy Mallinson 32 F 15.65 125 79
10 110 Mohammed Ahmad 30 M B 0.68 116 75
Where did it find the data/health.csv
file? #TODO
Let’s notice a few things about what we got out:
- There seems to be a column to the left with numbering starting at 1
- other observations?
Next we’re going to want to perform some analysis using this data, but to do that, we need to capture it in an R object somehow.
Let’s try creating a variable called health
and set it to whatever is returned by read.csv()
:
health <- read.csv('data/health.csv')
Running that line of code didn’t print out the data table, like last time. But in our Environment pane, we now see a new variable called health
.
#TODO: other parameters in read.csv
Exercise
Click on
health
- see what happens. Put questions hereClick on the
health
twisty. What did you learn here?Why do you think the blank values for Gender and Blood Type came through as blanks, but the blanks in SBP and DBP came through as
NA
?How would you find out what class of variable
health
is? (answer: class())Solution
3. Because of different variable types
## 4. `class(health)`
Talk about data.frame
Let’s look at a more realistic example of Public Health data. One of the widely-used sources of public data in the U.S. is the National Health and Nutrition Examination Survey, NHANES. Much of the data in NHANES is not provided in CSV format, and often contains many more variables and observations than in our first example.
In this case, we’ll be working with .XPT files, which are in a format used by SAS. If you work in public health, you’ll most definitely be encountering data files in SAS “xport” format. While XPT is a plain-text format, it’s not terribly easy to read.
R’s base packages don’t include a function to read in XPT files, but we can install a “contributed” package called SASxport
.
A word about pacakges
Packages are sets of functions that # FIND A GOOD DEFINITION
Installing a package downloads the code to run the package onto your computer. But we don’t always want every package available in every project that we work on.
So, there’s a second step involved to load a package that you’ve installed. More on that in a moment.
To install the SASxport
package:
install.packages('SASxport')
Installing package into '/Users/kerchner/Library/R/3.6/library'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Go to the Packages pane in RStudio. Do you see that SASxport
is now in the list of available packages? You’ll also notice that some of the packages are checked, and some, like SASxport, are not checked (yet).
TODO: Click on package name to get help (and what’s the other way to get help for a package?)
You’ll also notice Version numbers for all of the packages #Reproducibility
The next thing we need to do is load in the library for this particular project, using the library()
function:
library(SASxport)
Notice that SASxport
is now checked in the Packages pane.
We can open up the documentation page for SASxport
in our Help pane by running help(SASxport)
. We learn there that the function we want to use is called read.xport()
. let’s try it and see what we get.
DEMO_I.XPT
contains demographic data for study subjects.
demographics <- read.xport('data/DEMO_I.XPT')
After reading in our dataset into the demographics
dataframe, we can inspect the shape and size of the data, such as: colname(df), head(df), tail(df)
colnames(demographics)
[1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
[7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC"
[13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
[19] "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG" "FIAPROXY"
[25] "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
[31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
[37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
[43] "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"
glu_df <- read.xport(‘data/GLU_I.XPT’) sleep_df <- read.xport(‘data/SLQ_I.XPT’)
A word about reproducibility
Package versions - maybe put this as an open discussion question in the Exercises? With some links to ways to solve reproducibility issues?
Platform-independent? Maybe put this in the intro (R vs. software that only works on Windows, for example)
Looking at variable types after reading in - Factors etc.
And how to use the factorsAsStrings
parameter
Key Points
Learn to use read.* and related functions
Learn to use str()
Use colnames() to see a list of column names