Let’s Regress in R!

Regression is a statistical method used to estimate the magnitude of association between exposure(s) and outcome of interest. For example, we know that smoking is associated with increased incidence of lung cancer – but how strong is that association? Questions like this can be answered with regression.

Linear and logistic regression are perhaps the most common types of regression. The paper by Castro and Ferreira summarises the difference well: “Linear regression is used for continuous outcome variables (e.g., days of hospitalization or FEV1), and logistic regression is used for categorical outcome variables, such as death. Independent variables can be continuous, categorical, or a mix of both.”

Today we’ll use the linelist_cleaned.rds dataset from Applied Epi to take a quick look at how each of these types of regression work. You can download this dataset here. Below we’ll work through an R script to get a taste of how linear and logistic regression works.

Load packages
pacman::p_load(
rio, # File import
here, # File locator
tidyverse, # data management + ggplot2 graphics,
stringr, # manipulate text strings
purrr, # loop over objects in a tidy way
gtsummary, # summary statistics and tests
broom, # tidy up results from regressions
lmtest, # likelihood-ratio tests
parameters, # alternative to tidy up results from regressions
see # alternative to visualise forest plots
)

Anchor my code (file location)
here::i_am("regression.R")

Import the linelist
linelist <- import("linelist_cleaned.rds")

Define variables of interest
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

Convert dichotomous variables to 0/1
linelist_01 <- linelist %>%
mutate(across(
.cols = all_of(c(explanatory_vars, "outcome")), ## for each column listed and "outcome"
.fns = ~case_when(
. %in% c("m", "yes", "Death") ~ 1, ## recode male, yes and death to 1
. %in% c("f", "no", "Recover") ~ 0, ## female, no and recover to 0
TRUE ~ NA_real_) ## otherwise set to missing
)
)

Or you can factor the variables of interest
linelist_factor <- linelist %>%
mutate(gender = factor(gender, levels = c("f","m")),
fever = factor(fever, levels = c("no","yes")),
chills = factor(chills, levels = c("no","yes")),
cough = factor(cough, levels = c("no","yes")),
aches = factor(aches, levels = c("no","yes")),
vomit = factor(vomit, levels = c("no","yes")),
outcome = factor(outcome, levels = c("Recover","Death")))

Add in age_category to the explanatory vars
explanatory_vars <- c(explanatory_vars, "age_cat")

Drop rows with missing information for variables of interest
linelist_01 <- linelist_01 %>%
drop_na(any_of(c("outcome", explanatory_vars)))

linelist_factor <- linelist_factor %>%
drop_na(any_of(c("outcome", explanatory_vars)))

Univariate linear regression (outcome ~ exposure)
lm_results <- lm(ht_cm ~ age, data = linelist_01)
summary(lm_results)
tidy(lm_results)

Plot regression results
Pull the regression points and observed data in to one dataset
points <- augment(lm_results)

Plot the data using age as the x-axis
ggplot(points, aes(x = age)) +
## add points for height
geom_point(aes(y = ht_cm)) +
## add your regression line
geom_line(aes(y = .fitted), colour = "red")

Easy way to add regression line to ggplot (if univariate linear); add your data to a plot
ggplot(linelist_01, aes(x = age, y = ht_cm)) +
## show points
geom_point() +
## add a linear regression
geom_smooth(method = "lm", se = TRUE)

Multivariate linear regression
# outcome ~ exposure1 + exposure2 + exposure3....
lm_results_2 <- lm(ht_cm ~ age + wt_kg + ct_blood, data = linelist_01)
summary(lm_results_2)
# the "equation" for the above looks like:
# ht_cm = 13.98662 + 1.79823*age + 1.33526*wt_kg + 0.56266*ct_blood

LOGISTIC REGRESSION

Make a little table for reality check
linelist_01 %>% count(outcome, age_cat)
table(linelist_01$outcome, linelist_01$age_cat5)

Run logistic regression model
model <- glm(outcome ~ age_cat5, family = "binomial", data = linelist_01)
summary(model)

Run model and exponentiate results so you can interpret them directly
model <- glm(outcome ~ age_cat5, family = "binomial", data = linelist_01) %>%
tidy(exponentiate = TRUE, conf.int = TRUE) %>% # exponentiate and produce CIs
mutate(across(where(is.numeric), round, digits = 2)) %>% # round all numeric columns
select(term, estimate, conf.low, conf.high)

model

Run multivariate logistic regression model
model_2 <- glm(outcome ~ gender + hospital + age_cat5, family = "binomial", data = linelist_01) %>%
tidy(exponentiate = TRUE, conf.int = TRUE) %>% # exponentiate and produce CIs
mutate(across(where(is.numeric), round, digits = 2)) %>% # round all numeric columns
select(term, estimate, conf.low, conf.high)

model_2

Related Articles

Responses

Your email address will not be published. Required fields are marked *