This lesson is still being designed and assembled (Pre-Alpha version)

Data Cleaning

Overview

Teaching: 45 min
Exercises: 10 min
Questions
  • How can I focus on just the variables I’m interested in?

  • What are some ways of viewing the raw frequencies of variables?

  • How to count the number of cases vs. controls?

  • How to assess nulls/missing values in my data

Objectives
  • Learn how to create a subset with just certain variables

  • Techniques for dealing with null values in R – ex. null might be represented in the data as 999

TODO: Rename factors and relevel factors (does this need its own section?). Also when you read in and the type is int/char/etc. but it should be a factor.

In cleaning the data, we need to decide the varibles that we are going to assess. For that we can just look at the data frame that we created from merging. Next, we want to make sure that the variables of interest have data for all the observations. If let’s say, in our case, we wanted to assess Total Chol, we will have to make sure that all the observations included in the analysis have a value associated with it. At this point, we would then delete all observations that don’t have a value for that variable. Make sure to inspect the data after each step of cleaning/restriction.

library(tidyverse)
── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.2.1     ✓ purrr   0.3.3
✓ tibble  2.1.3     ✓ dplyr   0.8.3
✓ tidyr   1.0.0     ✓ stringr 1.4.0
✓ readr   1.3.1     ✓ forcats 0.4.0
── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
hdlanalysis_df <- merge_df %>% drop_na(LBXTC)
Error in eval(lhs, parent, parent): object 'merge_df' not found

Next, we will restrict to 30=<Age=<65:

hdlanalysis_df <- hdlanalysis_df %>% filter(RIDAGEYR<=65 & RIDAGEYR>=30)
Error in eval(lhs, parent, parent): object 'hdlanalysis_df' not found

Next, let’s count “NA” for our IV. WHAT I NEED TO LOOK UP: can we do the analysis with NAs included summary(hdlanalysis_df$LBDLDL)

#TODO: Great example of data cleaning - values include: ‘ ‘ ‘ ‘ ‘ xyz’ ‘asfa ‘ bmi_india_cardiogrant2016$out <- as.character(bmi_india_cardiogrant2016$out) bmi_india_cardiogrant2016$out <- trimws(bmi_india_cardiogrant2016$out) bmi_india_cardiogrant2016$out <- as.factor(bmi_india_cardiogrant2016$out)

first level is “” (empty string)

levels(bmi_india_cardiogrant2016$out)[1] <- NA

Key Points

  • Use table() to look at frequencies

  • Get a count …