Let’s Regress in R! – afredac.net

Regression is a statistical method used to estimate the magnitude of association between exposure(s) and outcome of interest. For example, we know that smoking is associated with increased incidence of lung cancer – but how strong is that association? Questions like this can be answered with regression.

Linear and logistic regression are perhaps the most common types of regression. The paper by Castro and Ferreira summarises the difference well: “Linear regression is used for continuous outcome variables (e.g., days of hospitalization or FEV1), and logistic regression is used for categorical outcome variables, such as death. Independent variables can be continuous, categorical, or a mix of both.”

Today we’ll use the linelist_cleaned.rds dataset from Applied Epi to take a quick look at how each of these types of regression work. You can download this dataset here. Below we’ll work through an R script to get a taste of how linear and logistic regression works.

Load packages
pacman::p_load( rio, # File import here, # File locator tidyverse, # data management + ggplot2 graphics, stringr, # manipulate text strings purrr, # loop over objects in a tidy way gtsummary, # summary statistics and tests broom, # tidy up results from regressions lmtest, # likelihood-ratio tests parameters, # alternative to tidy up results from regressions see # alternative to visualise forest plots )

Anchor my code (file location)
here::i_am("regression.R")

Import the linelist
linelist <- import("linelist_cleaned.rds")

Define variables of interest
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

Convert dichotomous variables to 0/1
linelist_01 <- linelist %>% mutate(across( .cols = all_of(c(explanatory_vars, "outcome")), ## for each column listed and "outcome" .fns = ~case_when( . %in% c("m", "yes", "Death") ~ 1, ## recode male, yes and death to 1 . %in% c("f", "no", "Recover") ~ 0, ## female, no and recover to 0 TRUE ~ NA_real_) ## otherwise set to missing ) )

Or you can factor the variables of interest
linelist_factor <- linelist %>% mutate(gender = factor(gender, levels = c("f","m")), fever = factor(fever, levels = c("no","yes")), chills = factor(chills, levels = c("no","yes")), cough = factor(cough, levels = c("no","yes")), aches = factor(aches, levels = c("no","yes")), vomit = factor(vomit, levels = c("no","yes")), outcome = factor(outcome, levels = c("Recover","Death")))

Add in age_category to the explanatory vars
explanatory_vars <- c(explanatory_vars, "age_cat")

Drop rows with missing information for variables of interest
linelist_01 <- linelist_01 %>% drop_na(any_of(c("outcome", explanatory_vars)))

linelist_factor <- linelist_factor %>% drop_na(any_of(c("outcome", explanatory_vars)))

Univariate linear regression (outcome ~ exposure)
lm_results <- lm(ht_cm ~ age, data = linelist_01) summary(lm_results) tidy(lm_results)

Plot regression results
Pull the regression points and observed data in to one dataset
points <- augment(lm_results)

Plot the data using age as the x-axis
ggplot(points, aes(x = age)) + ## add points for height geom_point(aes(y = ht_cm)) + ## add your regression line geom_line(aes(y = .fitted), colour = "red")

Easy way to add regression line to ggplot (if univariate linear); add your data to a plot
ggplot(linelist_01, aes(x = age, y = ht_cm)) + ## show points geom_point() + ## add a linear regression geom_smooth(method = "lm", se = TRUE)

Multivariate linear regression
# outcome ~ exposure1 + exposure2 + exposure3.... lm_results_2 <- lm(ht_cm ~ age + wt_kg + ct_blood, data = linelist_01) summary(lm_results_2) # the "equation" for the above looks like: # ht_cm = 13.98662 + 1.79823*age + 1.33526*wt_kg + 0.56266*ct_blood

LOGISTIC REGRESSION

Make a little table for reality check
linelist_01 %>% count(outcome, age_cat) table(linelist_01$outcome, linelist_01$age_cat5)

Run logistic regression model
model <- glm(outcome ~ age_cat5, family = "binomial", data = linelist_01) summary(model)

Run model and exponentiate results so you can interpret them directly
model <- glm(outcome ~ age_cat5, family = "binomial", data = linelist_01) %>% tidy(exponentiate = TRUE, conf.int = TRUE) %>% # exponentiate and produce CIs mutate(across(where(is.numeric), round, digits = 2)) %>% # round all numeric columns select(term, estimate, conf.low, conf.high)

model

Run multivariate logistic regression model
model_2 <- glm(outcome ~ gender + hospital + age_cat5, family = "binomial", data = linelist_01) %>% tidy(exponentiate = TRUE, conf.int = TRUE) %>% # exponentiate and produce CIs mutate(across(where(is.numeric), round, digits = 2)) %>% # round all numeric columns select(term, estimate, conf.low, conf.high)

model_2

Related Articles

Responses