Reading in Data

Overview

Teaching: 40 min
Exercises: 5 min

Questions

How do I load in a data set?

Once I’ve loaded in my data, how do I look at it?

Objectives

Read a CSV file into a data.frame

Read data files from other format: SPSS, SAS, Excel

Explore a newly loaded data.frame – number of rows, columns… preview before merging

Understand the variable types of imported data

Install R packages

Reading in Data

This occurs in the analysis phase, after you’ve collected data (and/or using data from another source)
Data should be in some sort of table format. CSV, Excel, SAS, SPSS, Stata, etc.

R contains a number of built-in functions that can read in plain-text formats like CSV. To read in formats associated with particular software packages, we’ll need to use certain R packages (more on that in a bit).

Let’s say that we have a CSV file called health.csv. If we open the CSV file to look at it in a text reader (or in software like Excel), we see that it:

has variable names in the first row (it doesn’t have to - more on that soon #TODO)
contains one observation per row

ID,First ,Last,Age,Gender,Blood Type,RBC Count,SBP,DPB
101,Dan,Chung,40,M,AB,7.35,121,80
102,Zara,Farooq,35,F,B,6.12,139,82
103,Sarah,Long,22,F,AB,6.89,125,81
104,Kurt,Woods,18,M,AB,5.71,126,82
105,Dan,Davis,45,M,,6.15,111,65
106,Matt,Van Jones,66,,A,4.7,,
107,John,Johnson,22,M,A,4.44,120,75
108,Veronica,Johnson,20,M,A,5.91,130,86
109,Cathy,Mallinson,32,F,,15.65,125,79
110,Mohammed,Ahmad,30,M,B,0.68,116,75

To read in the table, we can use the read.csv() function. Let’s see what what we get when we use it:

#TODO: Change the data, right now it’s from a book

read.csv('data/health.csv')

    ID    First      Last Age Gender Blood.Type RBC.Count SBP DPB
101      Dan     Chung  40      M         AB      7.35 121  80
102     Zara    Farooq  35      F          B      6.12 139  82
103    Sarah      Long  22      F         AB      6.89 125  81
104     Kurt     Woods  18      M         AB      5.71 126  82
105      Dan     Davis  45      M                 6.15 111  65
106     Matt Van Jones  66                 A      4.70  NA  NA
107     John   Johnson  22      M          A      4.44 120  75
108 Veronica   Johnson  20      M          A      5.91 130  86
109    Cathy Mallinson  32      F                15.65 125  79
110 Mohammed     Ahmad  30      M          B      0.68 116  75

Where did it find the data/health.csv file? #TODO

Let’s notice a few things about what we got out:

There seems to be a column to the left with numbering starting at 1
other observations?

Next we’re going to want to perform some analysis using this data, but to do that, we need to capture it in an R object somehow.

Let’s try creating a variable called health and set it to whatever is returned by read.csv():

health <- read.csv('data/health.csv')

Running that line of code didn’t print out the data table, like last time. But in our Environment pane, we now see a new variable called health.

#TODO: other parameters in read.csv

Exercise

Click on health - see what happens. Put questions here

Click on the health twisty. What did you learn here?

Why do you think the blank values for Gender and Blood Type came through as blanks, but the blanks in SBP and DBP came through as NA?

How would you find out what class of variable health is? (answer: class())
Solution

3. Because of different variable types
## 4. `class(health)`

Talk about data.frame

Let’s look at a more realistic example of Public Health data. One of the widely-used sources of public data in the U.S. is the National Health and Nutrition Examination Survey, NHANES. Much of the data in NHANES is not provided in CSV format, and often contains many more variables and observations than in our first example.

In this case, we’ll be working with .XPT files, which are in a format used by SAS. If you work in public health, you’ll most definitely be encountering data files in SAS “xport” format. While XPT is a plain-text format, it’s not terribly easy to read.

R’s base packages don’t include a function to read in XPT files, but we can install a “contributed” package called SASxport.

A word about pacakges

Packages are sets of functions that # FIND A GOOD DEFINITION

Installing a package downloads the code to run the package onto your computer. But we don’t always want every package available in every project that we work on.

So, there’s a second step involved to load a package that you’ve installed. More on that in a moment.

To install the SASxport package:

install.packages('SASxport')

Installing package into '/Users/kerchner/Library/R/3.6/library'
(as 'lib' is unspecified)

Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

Go to the Packages pane in RStudio. Do you see that SASxport is now in the list of available packages? You’ll also notice that some of the packages are checked, and some, like SASxport, are not checked (yet).

TODO: Click on package name to get help (and what’s the other way to get help for a package?)

You’ll also notice Version numbers for all of the packages #Reproducibility

The next thing we need to do is load in the library for this particular project, using the library() function:

library(SASxport)

Notice that SASxport is now checked in the Packages pane.

We can open up the documentation page for SASxport in our Help pane by running help(SASxport). We learn there that the function we want to use is called read.xport(). let’s try it and see what we get.

DEMO_I.XPT contains demographic data for study subjects.

demographics <- read.xport('data/DEMO_I.XPT')

After reading in our dataset into the demographics dataframe, we can inspect the shape and size of the data, such as: colname(df), head(df), tail(df)

colnames(demographics)

 [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
 [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ" "DMQADFC" 
[13] "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2" "DMDMARTL"
[19] "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG"  "FIAPROXY"
[25] "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA" "DMDHHSIZ"
[31] "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE" "DMDHRGND" "DMDHRAGE"
[37] "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU" "WTINT2YR" "WTMEC2YR"
[43] "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2" "INDFMPIR"

glu_df <- read.xport(‘data/GLU_I.XPT’) sleep_df <- read.xport(‘data/SLQ_I.XPT’)

A word about reproducibility

Package versions - maybe put this as an open discussion question in the Exercises? With some links to ways to solve reproducibility issues?

Platform-independent? Maybe put this in the intro (R vs. software that only works on Windows, for example)

Looking at variable types after reading in - Factors etc.

And how to use the factorsAsStrings parameter

Key Points

Learn to use read.* and related functions

Learn to use str()

Use colnames() to see a list of column names

previous episode

Data Analysis and Visualization in R for Public Health

next episode

Reading in Data

Overview

Reading in Data

Exercise

Solution

3. Because of different variable types

A word about pacakges

A word about reproducibility

Looking at variable types after reading in - Factors etc.

Key Points

previous episode

next episode