This lesson is still being designed and assembled (Pre-Alpha version)

Creating univariate regression models

Overview

Teaching: 45 min
Exercises: 10 min
Questions
  • How can I make a graph of my regression?

  • How can I look at residuals?

  • How can I assess and remove outliers?

  • How can I export publication-quality graphics?

Objectives
  • Make a scatterplot

  • cor()

  • Learn how to run regressions with lm() and glm()

  • Learn how to isolate parts of the matrix that lm() returns

  • Learn how to run t.test() - get p-value, CI, etc.

  • Assess interactions between variables

Regression models in R

Now that we have cleaned up the variables in our data frame, and are sufficiently satisfied that the variables of interest meet our assumptions for regression, we can proceed with building regression models.

Most of the commonly-used statistical tests and model-building functions are in R’s built-in stats package, although there are certainly many other packages available with additional statistical functionality.

For simple linear regression, we can use the lm() function (lm stands for “linear model”). R also provides glm() for more generalized regression, which you can use not only for linear regression but also for logistic regression, Poisson regression, etc. R’s stats() package also has anova() and many other commonly used statistical computations.

We’ll need to specify some parameters when we use lm(). At a minumum, we need to specify:

There are other optional parameters available in lm() as well.

Our outcome variable of interest is Glucose which is a continuous variable, so linear regression is appropriate. (If our outcome variable was a categorical variable, we’d be performing a logistic regression, for which we’d need to use glm())

Before we build a full model, we will first assess the individual impact of each variable, independently, on the outcome, by conducting univariate regressions on each variable versus the outcome.

As we build these models, we’ll keep an eye on the R-squared values and the F-test values. We’ll also decide on a cutoff for a p-value as a criteria for which variables to include in the multivariate model. Let’s use a p-value cutoff of 0.15.

For our first univariate model, we’ll try fitting the following model:

Where is the y-intercept and is the slope, and is the error term.

To compute this regression using lm(), we need to specify the forumla parameter and the data parameter.

The formula parameter expects an R formula object, which is expressed using a tilde (~) to relate a “y” variable to an expression containing “x” variables. For example, formulas might look like:

y ~ x

lung_cancer ~ smoker + age + gender + income

Formulas can also include interaction terms:

y ~ x1 + x2 + x1*x2

Note that the names of the variables must be variables that are present in the data frame specified in the data parameter.

In our case, we have a relatively simple univariate model where Glucose is the “y” variable, and BMI_cat is our only “x” variable:

Glucose ~ BMI_cat

We’ll store the resule of the regression in a variable we’ll call lm_bmi:

lm_bmi <- lm(Glucose ~ BMI_cat, data = analysis_swan_df)
summary(lm_bmi) # p<0.001 for all categories except underweight--> KEEP

Call:
lm(formula = Glucose ~ BMI_cat, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-50.16 -11.56  -4.65   2.63 538.53 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          86.366      1.127  76.617  < 2e-16 ***
BMI_catUnderweight   -3.952      5.578  -0.709  0.47871    
BMI_catPre-obese      5.284      1.696   3.116  0.00186 ** 
BMI_catObesity I     14.104      1.965   7.178 1.01e-12 ***
BMI_catObesity II    17.796      2.348   7.580 5.34e-14 ***
BMI_catObesity III   24.680      2.639   9.353  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.42 on 1933 degrees of freedom
  (485 observations deleted due to missingness)
Multiple R-squared:  0.06994,	Adjusted R-squared:  0.06753 
F-statistic: 29.07 on 5 and 1933 DF,  p-value: < 2.2e-16

Let’s now take a look at the model that lm() computed: #TODO functions that pull out different pieces of lm. Note that reference level is “normal” which is not shown as one of the categories of bp_category.

lm_age <- lm(Glucose ~ Age, data = analysis_swan_df)
summary(lm_age) # p=0.231 --> We may not keep it then.

Call:
lm(formula = Glucose ~ Age, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-46.58 -12.64  -6.82   1.74 545.05 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  78.0139    13.5767   5.746 1.05e-08 ***
Age           0.3125     0.2606   1.199    0.231    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.13 on 1988 degrees of freedom
  (434 observations deleted due to missingness)
Multiple R-squared:  0.0007224,	Adjusted R-squared:  0.0002198 
F-statistic: 1.437 on 1 and 1988 DF,  p-value: 0.2307
lm_bp <- lm(formula = Glucose ~ bp_category, data = analysis_swan_df)
summary(lm_bp) # p<0.001 --> KEEP

Call:
lm(formula = Glucose ~ bp_category, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-46.77 -12.32  -6.32   1.88 548.60 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       90.4010     0.9457  95.593  < 2e-16 ***
bp_categoryElevated                6.7230     2.1357   3.148 0.001669 ** 
bp_categoryHypertension Stage 1    8.3709     1.7661   4.740  2.3e-06 ***
bp_categoryHypertension Stage 2+   7.9209     2.0992   3.773 0.000166 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.28 on 1944 degrees of freedom
  (476 observations deleted due to missingness)
Multiple R-squared:  0.01653,	Adjusted R-squared:  0.01501 
F-statistic: 10.89 on 3 and 1944 DF,  p-value: 4.279e-07
lm_Chol_Ratio <- lm(Glucose ~ Chol_Ratio, data = analysis_swan_df)
summary(lm_Chol_Ratio) # p<0.001 --> KEEP

Call:
lm(formula = Glucose ~ Chol_Ratio, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-50.62 -11.71  -5.84   2.27 541.86 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  77.1128     2.5636   30.08  < 2e-16 ***
Chol_Ratio    5.1286     0.7712    6.65 3.79e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.34 on 1958 degrees of freedom
  (464 observations deleted due to missingness)
Multiple R-squared:  0.02209,	Adjusted R-squared:  0.02159 
F-statistic: 44.22 on 1 and 1958 DF,  p-value: 3.795e-11
lm_race <- lm(Glucose ~ RACE, data = analysis_swan_df)
summary(lm_race) # p=0.207 is the smallest --> can keep but can also drop. let's see what the stepwise does. 

Call:
lm(formula = Glucose ~ RACE, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-50.88 -12.60  -6.09   2.29 546.91 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     98.877      1.301  75.986  < 2e-16 ***
RACEChinese     -5.280      2.594  -2.035   0.0419 *  
RACEJapanese    -6.212      2.459  -2.526   0.0116 *  
RACECaucasian   -6.785      1.631  -4.160 3.31e-05 ***
RACEHispanic    -3.127      7.861  -0.398   0.6909    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.01 on 1986 degrees of freedom
  (433 observations deleted due to missingness)
Multiple R-squared:  0.009062,	Adjusted R-squared:  0.007066 
F-statistic:  4.54 on 4 and 1986 DF,  p-value: 0.001186
lm_exercise <- lm(Glucose ~ Exercise, data = analysis_swan_df)
summary(lm_exercise) # p<0.001 --> KEEP

Call:
lm(formula = Glucose ~ Exercise, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-46.69 -11.69  -5.64   2.36 540.31 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   98.688      1.275  77.387  < 2e-16 ***
ExerciseYes   -7.052      1.502  -4.694 2.87e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.22 on 1877 degrees of freedom
  (545 observations deleted due to missingness)
Multiple R-squared:  0.0116,	Adjusted R-squared:  0.01108 
F-statistic: 22.03 on 1 and 1877 DF,  p-value: 2.873e-06
lm_crp <- lm(Glucose ~ log_CRP, data = analysis_swan_df)
summary(lm_crp) # p<0.001 --> KEEP

Call:
lm(formula = Glucose ~ log_CRP, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-54.01 -12.95  -5.32   3.24 536.07 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  91.1544     0.7293  124.99   <2e-16 ***
log_CRP       5.3887     0.5024   10.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30 on 1982 degrees of freedom
  (440 observations deleted due to missingness)
Multiple R-squared:  0.05485,	Adjusted R-squared:  0.05437 
F-statistic:   115 on 1 and 1982 DF,  p-value: < 2.2e-16

#TODO Fix smoker lm and do everything below

lm_smoke <- lm(Glucose ~ Smoker, data = analysis_swan_df)
summary(lm_smoke) # p=0.0351 --> KEEP B/C LESS THAN 0.05

Call:
lm(formula = Glucose ~ Smoker, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-45.25 -12.25  -6.25   1.75 545.75 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  93.2473     0.7283 128.027   <2e-16 ***
SmokerYes     4.2790     2.0291   2.109   0.0351 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.76 on 1915 degrees of freedom
  (507 observations deleted due to missingness)
Multiple R-squared:  0.002317,	Adjusted R-squared:  0.001796 
F-statistic: 4.447 on 1 and 1915 DF,  p-value: 0.03509

Next, we need to consider “Interaction terms”. See link at the top of this episode.

analysis_swan_df$age_Chol_Ratio <- analysis_swan_df$Chol_Ratio * analysis_swan_df$Age
lm_ageChol <- lm(Glucose ~ age_Chol_Ratio, data = analysis_swan_df)
summary(lm_ageChol) # This is significant on it's own.

Call:
lm(formula = Glucose ~ age_Chol_Ratio, data = analysis_swan_df)

Residuals:
   Min     1Q Median     3Q    Max 
-51.02 -11.52  -5.78   2.47 542.26 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    77.12423    2.51703  30.641  < 2e-16 ***
age_Chol_Ratio  0.09850    0.01453   6.779  1.6e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.33 on 1957 degrees of freedom
  (465 observations deleted due to missingness)
Multiple R-squared:  0.02294,	Adjusted R-squared:  0.02244 
F-statistic: 45.95 on 1 and 1957 DF,  p-value: 1.597e-11

Key Points

  • FIXME