Simplified data cleaning in R

Data Cleaning, Cross-tabulation & Summary stats

In this session we will learn: - How to import, view and plot your data - How to change values and create age categories - Cross-tabulation - Summary statistics

Load your packages

pacman::p_load(
  rio,          # File import
  here,         # File locator
  epikit,       # Simplified age categories
  skimr,        # get overview of data
  tidyverse,    # data management + ggplot2 graphics 
  gtsummary,    # summary statistics and tests
  rstatix,      # summary statistics and statistical tests
  janitor,      # adding totals and percents to tables
  scales,       # easily convert proportions to percents  
  flextable     # converting tables to pretty images
)

A case study

You have been provided with a dataset of patients who tested for HIV, CD4 and viral load. Their ages and sex were also collected.

  1. Can you visualise the dataset and tell which data is included?
  2. What type of data are in the dataset?
  3. Can you present this data in summary?

Load your data

hiv_data <- import('https://afredac.net/wp-content/uploads/2024/04/hivdata.xlsx') # import the dataset

How can you view the dataset?

# view the dataset, you can do that using glimpse() or view() or just type the dataset's name

glimpse(hiv_data) %>%
  head() # let us view only the top 5 data
## Rows: 1,500
## Columns: 6
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ age    <dbl> 49, 26, 21, 24, 1, 3, 17, 83, 28, 28, 76, 62, 29, 35, 83, 32, 7…
## $ gender <dbl> 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, …
## $ hiv    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ CD4    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 643, NA, NA, 481, 643, NA, …
## $ VL     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 790, NA, NA, 972, 1736, NA,…
##   id age gender hiv CD4 VL
## 1  1  49      1   0  NA NA
## 2  2  26      1   0  NA NA
## 3  3  21      2   0  NA NA
## 4  4  24      1   0  NA NA
## 5  5   1      2   0  NA NA
## 6  6   3      2   0  NA NA
view(hiv_data)

# Let's try to use a 'skimr' package

# By using the skimr package, you can get a detailed and aesthetically pleasing overview of each of the variables in your dataset. Read more about skimr at its github page.

skim(hiv_data)
Data summary
Name hiv_data
Number of rows 1500
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.0 750.50 433.16 1 375.75 750.5 1125.25 1500 ▇▇▇▇▇
age 0 1.0 42.20 24.33 0 21.00 43.0 63.00 84 ▇▇▇▇▇
gender 0 1.0 1.49 0.50 1 1.00 1.0 2.00 2 ▇▁▁▁▇
hiv 0 1.0 0.10 0.29 0 0.00 0.0 0.00 1 ▇▁▁▁▁
CD4 1357 0.1 488.88 225.36 100 299.50 485.0 678.50 899 ▇▇▇▇▆
VL 1357 0.1 1979.76 985.04 558 1274.50 1849.0 2299.50 4989 ▆▇▂▁▁
# You will see skim gives you
   # - data summary
   # - Group variables
   # - summary statistics

# But you will see our data needs cleaning for gender and hiv status 

1. Changing gender 1 to mean ‘female’ and 2 to mean ‘male’

# By using the dplyr as part of tidverse library we can use the functions mutate() and recode()

hiv_data%>%
  mutate(gender = 
           recode(as.character (gender),
                  '1' = 'female',
                  '2' = 'male'))%>%
  head() # let us view only the top 5 data
##   id age gender hiv CD4 VL
## 1  1  49 female   0  NA NA
## 2  2  26 female   0  NA NA
## 3  3  21   male   0  NA NA
## 4  4  24 female   0  NA NA
## 5  5   1   male   0  NA NA
## 6  6   3   male   0  NA NA
# you will see on your console, the gender has changed to 'male' and 'female'

2. Changing hiv status 1 to mean ‘positive’ and 0 to mean ‘negative’

# As in the previous code, we will use the functions mutate() and recode()

hiv_data%>%
  mutate(hiv = 
           recode(as.character (hiv),
                      '0' = 'negative',
                      '1' = 'positive'))%>%
  head() # let us view only the top 5 data
##   id age gender      hiv CD4 VL
## 1  1  49      1 negative  NA NA
## 2  2  26      1 negative  NA NA
## 3  3  21      2 negative  NA NA
## 4  4  24      1 negative  NA NA
## 5  5   1      2 negative  NA NA
## 6  6   3      2 negative  NA NA

3. What if we want age categories for better summarisation of our data

summary(hiv_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    21.0    43.0    42.2    63.0    84.0
#With the epikit package, you can use the age_categories() function to easily 
#categorize and label numeric columns (note: this function can be applied to non-age numeric variables too). 
#As a bonus, the output column is automatically an ordered factor.

# what about if you want age categories?


hiv_data%>%
  mutate(
    age_cat = age_categories(
      age,
      breakers = c(0, 5, 10, 15, 20,        # break points
                   30, 40, 50, 60, 70)))%>%
  head() # let us view only the top 5 data
##   id age gender hiv CD4 VL age_cat
## 1  1  49      1   0  NA NA   40-49
## 2  2  26      1   0  NA NA   20-29
## 3  3  21      2   0  NA NA   20-29
## 4  4  24      1   0  NA NA   20-29
## 5  5   1      2   0  NA NA     0-4
## 6  6   3      2   0  NA NA     0-4

4. Let us combine both and assign it to a new dataset

#now lets combine all our data

hiv_data_cl <- 
  hiv_data%>%
  mutate(gender = 
           recode(as.character (gender),
                  '1' = 'female',
                  '2' = 'male'),
         hiv = 
           recode(as.character (hiv),
                  '0' = 'negative',
                  '1' = 'positive'),
         
         age_cat = age_categories(
           age,
           breakers = c(0, 5, 10, 15, 20,        # break points
                        30, 40, 50, 60, 70)))



glimpse(hiv_data_cl) %>%
  head() # let us view only the top 5 data
## Rows: 1,500
## Columns: 7
## $ id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ age     <dbl> 49, 26, 21, 24, 1, 3, 17, 83, 28, 28, 76, 62, 29, 35, 83, 32, …
## $ gender  <chr> "female", "female", "male", "female", "male", "male", "female"…
## $ hiv     <chr> "negative", "negative", "negative", "negative", "negative", "n…
## $ CD4     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 643, NA, NA, 481, 643, NA,…
## $ VL      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 790, NA, NA, 972, 1736, NA…
## $ age_cat <fct> 40-49, 20-29, 20-29, 20-29, 0-4, 0-4, 15-19, 70+, 20-29, 20-29…
##   id age gender      hiv CD4 VL age_cat
## 1  1  49 female negative  NA NA   40-49
## 2  2  26 female negative  NA NA   20-29
## 3  3  21   male negative  NA NA   20-29
## 4  4  24 female negative  NA NA   20-29
## 5  5   1   male negative  NA NA     0-4
## 6  6   3   male negative  NA NA     0-4
# you will see on your console, the gender has changed to 'male' and 'female'

5. Cross tabulation

## Now you will be wondering, how many were positive/negative

## let us use the tabyl() functino from janitor package to make a simple cross-tabulation

hiv_data_cl%>%
  tabyl(gender, hiv)%>%
  head() # let us view only the top 5 data
##  gender negative positive
##  female      699       66
##    male      658       77

6. Now you are curious to learn the totals, proportions, percents etc

## You can use janitors 'adorn' functions to convert proportions, percents and display them

hiv_data_cl%>%
  tabyl(gender, hiv)%>%
  adorn_totals(where = 'row')%>%
  adorn_percentages(denominator = 'col') %>%
  adorn_ns(position = 'front') %>%
  adorn_title(
    row_name = 'gender',
    col_name = 'hiv status'
  )
##                        hiv status                        
##  gender                  negative                positive
##  female   699 (0.515106853352985)  66 (0.461538461538462)
##    male   658 (0.484893146647015)  77 (0.538461538461538)
##   Total 1,357                 (1) 143                 (1)

7. How do you print a table output?

## You can use flextable package to print a visually appealing table

hiv_data_cl%>%
  tabyl(gender, hiv)%>%
  adorn_totals(where = 'row')%>%
  adorn_percentages(denominator = 'col') %>%
  adorn_ns(position = 'front') %>%
  adorn_title(
    row_name = 'gender',
    col_name = 'hiv status',
    placement = 'combined') %>%
  flextable::flextable()%>%
  flextable::autofit()

gender/hiv status

negative

positive

female

699 (0.515106853352985)

66 (0.461538461538462)

male

658 (0.484893146647015)

77 (0.538461538461538)

Total

1,357 (1)

143 (1)

7. if you want frequency of age categories of your patients

hiv_data_cl%>%
  tabyl(age_cat)
##  age_cat   n    percent
##      0-4  77 0.05133333
##      5-9  89 0.05933333
##    10-14  88 0.05866667
##    15-19  86 0.05733333
##    20-29 191 0.12733333
##    30-39 184 0.12266667
##    40-49 170 0.11333333
##    50-59 168 0.11200000
##    60-69 182 0.12133333
##      70+ 265 0.17666667

Challenge. Can you try to put percents and print a table output for the age categories?

Related Articles

Responses

Your email address will not be published. Required fields are marked *