Selecting and Renaming Variables, and Subsetting
Overview
Teaching: 15 min
Exercises: 2 minQuestions
How can I narrow down to just the variables of interest?
How can I rename variables?
Objectives
Learn how to scope a data frame down to just variables of interest
Learn how to select variables by name using dplyr
Learn how to rename variables
Creating a subset with fewer variables
Right now our data frame has hundreds of variables in columns, but for our analysis, we’re only interested in analyzing data from a limited number of those variables.
Furthermore, we didn’t pick the names for the variables. They might be names like TOTIDE26
or GLUSSPE6
or GLUSLOO6
. Variable names like these are not at all uncommon when working with data from publicly available data sets, such as NHANES or other CDC data sources, the Framingham Heart Study or in the data we’re using from the SWAN study. With variable names like these, it’s especially important to download the codebook that explains each variable.
And/or, the columns in the data sets might have names that can be difficult to work with in R. For example, a column might be named Study ID
or Diastolic BP
, and names with spaces or other special characters might pose unique problems after the data is loaded into R.
So there are at least two steps we’d like to take at this point:
- Narrow down to just certain variables
- Rename variables
Selecting variables
One of the most convenient ways to “wrangle” data frames is to use functionality that comes with the dplyr
#TODO: Say more here about piping, etc.
dplyr’s select()
function narrows down to the variable names you specify as a comma-separated list of parameters.
library(dplyr)
analysis_swan_df <- swan_df %>% select(SWANID, AGE6, RACE, BMI6,
GLUCRES6, SMOKERE6, LDLRESU6, HDLRESU6,
CRPRESU6, DIABP16, SYSBP16, EXERCIS6)
str(analysis_swan_df)
'data.frame': 2424 obs. of 12 variables:
$ SWANID : int 10046 10056 10126 10153 10196 10245 10484 10514 10522 10532 ...
$ AGE6 : int 58 57 54 57 52 54 56 52 52 49 ...
$ RACE : int 2 4 1 3 2 4 1 4 4 4 ...
$ BMI6 : num 35.6 19.8 26.4 31.6 22.3 ...
$ GLUCRES6: int 116 89 82 85 80 88 111 101 83 91 ...
$ SMOKERE6: int 1 1 NA 1 1 1 2 1 1 NA ...
$ LDLRESU6: int 137 90 136 154 130 129 93 137 103 128 ...
$ HDLRESU6: int 48 78 57 55 59 83 47 39 65 41 ...
$ CRPRESU6: num 8.7 0.5 1.5 2.7 0.3 1.3 7.4 1.3 1.1 1.5 ...
$ DIABP16 : int 72 62 80 68 62 64 68 70 58 70 ...
$ SYSBP16 : int 134 96 102 108 94 94 130 124 102 118 ...
$ EXERCIS6: int 2 2 NA 2 2 1 2 2 2 NA ...
select()
also allows you to use other ways to express which columns to include or exclude. For example, functions like starts_with()
, ends_with()
, contains()
can be used with select()
.
You might want to create a data frame by selecting all of the variables ending in “6”, and you might do that with select(ends_with("6"))
select()
also allows you to select all variables except a certain variable, using the hyphen (-
). For example, you could select(-NAME, -BMI)
to select all variables except NAME
and BMI
.
## Exercise
How might you create a new data frame from
swan_df
containing just the id of each participant plus the glucose-related variables, which all start with “GLU”?Solution
glu_df <- swan_df %>% select(SWANID, starts_with("GLU"))
Renaming variables
Now that we’ve scoped our data frame down to just the variables we want to work with, let’s rename a few, using the rename()
function (also from dplyr
).
Notice that each item in the parameter list is an expression of the form:
new variable name = old variable name
analysis_swan_df <- analysis_swan_df %>% rename(Glucose = GLUCRES6, LDL = LDLRESU6,
HDL = HDLRESU6, CRP = CRPRESU6, DBP = DIABP16, SBP = SYSBP16,
Smoker = SMOKERE6, Exercise = EXERCIS6, Age = AGE6, BMI = BMI6)
str(analysis_swan_df)
'data.frame': 2424 obs. of 12 variables:
$ SWANID : int 10046 10056 10126 10153 10196 10245 10484 10514 10522 10532 ...
$ Age : int 58 57 54 57 52 54 56 52 52 49 ...
$ RACE : int 2 4 1 3 2 4 1 4 4 4 ...
$ BMI : num 35.6 19.8 26.4 31.6 22.3 ...
$ Glucose : int 116 89 82 85 80 88 111 101 83 91 ...
$ Smoker : int 1 1 NA 1 1 1 2 1 1 NA ...
$ LDL : int 137 90 136 154 130 129 93 137 103 128 ...
$ HDL : int 48 78 57 55 59 83 47 39 65 41 ...
$ CRP : num 8.7 0.5 1.5 2.7 0.3 1.3 7.4 1.3 1.1 1.5 ...
$ DBP : int 72 62 80 68 62 64 68 70 58 70 ...
$ SBP : int 134 96 102 108 94 94 130 124 102 118 ...
$ Exercise: int 2 2 NA 2 2 1 2 2 2 NA ...
Exercise
Rename the
GLUSOST6
variable toGlucosamine
in theglu_df
data frame you created in the previous exercise.Solution
glu_df <- glu_df %>% rename(Glucosamine = GLUSOST6)
#TODO: As its own episode, show use of filter()
Key Points
Use
select()
to create a subset of a data frame based on columnsUse
rename()
to rename columnsUse
filter()
to filter rows