This lesson is still being designed and assembled (Pre-Alpha version)

Selecting and Renaming Variables, and Subsetting

Overview

Teaching: 15 min
Exercises: 2 min
Questions
  • How can I narrow down to just the variables of interest?

  • How can I rename variables?

Objectives
  • Learn how to scope a data frame down to just variables of interest

  • Learn how to select variables by name using dplyr

  • Learn how to rename variables

Creating a subset with fewer variables

Right now our data frame has hundreds of variables in columns, but for our analysis, we’re only interested in analyzing data from a limited number of those variables.

Furthermore, we didn’t pick the names for the variables. They might be names like TOTIDE26 or GLUSSPE6 or GLUSLOO6. Variable names like these are not at all uncommon when working with data from publicly available data sets, such as NHANES or other CDC data sources, the Framingham Heart Study or in the data we’re using from the SWAN study. With variable names like these, it’s especially important to download the codebook that explains each variable.

And/or, the columns in the data sets might have names that can be difficult to work with in R. For example, a column might be named Study ID or Diastolic BP, and names with spaces or other special characters might pose unique problems after the data is loaded into R.

So there are at least two steps we’d like to take at this point:

  1. Narrow down to just certain variables
  2. Rename variables

Selecting variables

One of the most convenient ways to “wrangle” data frames is to use functionality that comes with the dplyr #TODO: Say more here about piping, etc.

dplyr’s select() function narrows down to the variable names you specify as a comma-separated list of parameters.

library(dplyr)

analysis_swan_df <- swan_df %>% select(SWANID, AGE6, RACE, BMI6, 
                                       GLUCRES6, SMOKERE6, LDLRESU6, HDLRESU6, 
                                       CRPRESU6, DIABP16, SYSBP16, EXERCIS6)

str(analysis_swan_df)
'data.frame':	2424 obs. of  12 variables:
 $ SWANID  : int  10046 10056 10126 10153 10196 10245 10484 10514 10522 10532 ...
 $ AGE6    : int  58 57 54 57 52 54 56 52 52 49 ...
 $ RACE    : int  2 4 1 3 2 4 1 4 4 4 ...
 $ BMI6    : num  35.6 19.8 26.4 31.6 22.3 ...
 $ GLUCRES6: int  116 89 82 85 80 88 111 101 83 91 ...
 $ SMOKERE6: int  1 1 NA 1 1 1 2 1 1 NA ...
 $ LDLRESU6: int  137 90 136 154 130 129 93 137 103 128 ...
 $ HDLRESU6: int  48 78 57 55 59 83 47 39 65 41 ...
 $ CRPRESU6: num  8.7 0.5 1.5 2.7 0.3 1.3 7.4 1.3 1.1 1.5 ...
 $ DIABP16 : int  72 62 80 68 62 64 68 70 58 70 ...
 $ SYSBP16 : int  134 96 102 108 94 94 130 124 102 118 ...
 $ EXERCIS6: int  2 2 NA 2 2 1 2 2 2 NA ...

select() also allows you to use other ways to express which columns to include or exclude. For example, functions like starts_with(), ends_with(), contains() can be used with select().

You might want to create a data frame by selecting all of the variables ending in “6”, and you might do that with select(ends_with("6"))

select() also allows you to select all variables except a certain variable, using the hyphen (-). For example, you could select(-NAME, -BMI) to select all variables except NAME and BMI.

## Exercise

How might you create a new data frame from swan_df containing just the id of each participant plus the glucose-related variables, which all start with “GLU”?

Solution

glu_df <- swan_df %>% select(SWANID, starts_with("GLU"))

Renaming variables

Now that we’ve scoped our data frame down to just the variables we want to work with, let’s rename a few, using the rename() function (also from dplyr).

Notice that each item in the parameter list is an expression of the form:

new variable name = old variable name

analysis_swan_df <- analysis_swan_df %>% rename(Glucose = GLUCRES6, LDL = LDLRESU6, 
                           HDL = HDLRESU6, CRP =  CRPRESU6, DBP = DIABP16, SBP = SYSBP16,
                           Smoker = SMOKERE6, Exercise = EXERCIS6, Age = AGE6, BMI = BMI6)

str(analysis_swan_df)
'data.frame':	2424 obs. of  12 variables:
 $ SWANID  : int  10046 10056 10126 10153 10196 10245 10484 10514 10522 10532 ...
 $ Age     : int  58 57 54 57 52 54 56 52 52 49 ...
 $ RACE    : int  2 4 1 3 2 4 1 4 4 4 ...
 $ BMI     : num  35.6 19.8 26.4 31.6 22.3 ...
 $ Glucose : int  116 89 82 85 80 88 111 101 83 91 ...
 $ Smoker  : int  1 1 NA 1 1 1 2 1 1 NA ...
 $ LDL     : int  137 90 136 154 130 129 93 137 103 128 ...
 $ HDL     : int  48 78 57 55 59 83 47 39 65 41 ...
 $ CRP     : num  8.7 0.5 1.5 2.7 0.3 1.3 7.4 1.3 1.1 1.5 ...
 $ DBP     : int  72 62 80 68 62 64 68 70 58 70 ...
 $ SBP     : int  134 96 102 108 94 94 130 124 102 118 ...
 $ Exercise: int  2 2 NA 2 2 1 2 2 2 NA ...

Exercise

Rename the GLUSOST6 variable to Glucosamine in the glu_df data frame you created in the previous exercise.

Solution

glu_df <- glu_df %>% rename(Glucosamine = GLUSOST6)

#TODO: As its own episode, show use of filter()

Key Points

  • Use select() to create a subset of a data frame based on columns

  • Use rename() to rename columns

  • Use filter() to filter rows