Simplified data cleaning in R
Afredac R for Fun Club
2024-08-08
Data Cleaning, Cross-tabulation & Summary stats
In this session we will learn: - How to import, view and plot your data - How to change values and create age categories - Cross-tabulation - Summary statistics
Load your packages
pacman::p_load(
rio, # File import
here, # File locator
epikit, # Simplified age categories
skimr, # get overview of data
tidyverse, # data management + ggplot2 graphics
gtsummary, # summary statistics and tests
rstatix, # summary statistics and statistical tests
janitor, # adding totals and percents to tables
scales, # easily convert proportions to percents
flextable # converting tables to pretty images
)
A case study
You have been provided with a dataset of patients who tested for HIV, CD4 and viral load. Their ages and sex were also collected.
- Can you visualise the dataset and tell which data is included?
- What type of data are in the dataset?
- Can you present this data in summary?
Load your data
hiv_data <- import('https://afredac.net/wp-content/uploads/2024/04/hivdata.xlsx') # import the dataset
How can you view the dataset?
# view the dataset, you can do that using glimpse() or view() or just type the dataset's name
glimpse(hiv_data) %>%
head() # let us view only the top 5 data
## Rows: 1,500
## Columns: 6
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ age <dbl> 49, 26, 21, 24, 1, 3, 17, 83, 28, 28, 76, 62, 29, 35, 83, 32, 7…
## $ gender <dbl> 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, …
## $ hiv <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ CD4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 643, NA, NA, 481, 643, NA, …
## $ VL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 790, NA, NA, 972, 1736, NA,…
## id age gender hiv CD4 VL
## 1 1 49 1 0 NA NA
## 2 2 26 1 0 NA NA
## 3 3 21 2 0 NA NA
## 4 4 24 1 0 NA NA
## 5 5 1 2 0 NA NA
## 6 6 3 2 0 NA NA
view(hiv_data)
# Let's try to use a 'skimr' package
# By using the skimr package, you can get a detailed and aesthetically pleasing overview of each of the variables in your dataset. Read more about skimr at its github page.
skim(hiv_data)
Name | hiv_data |
Number of rows | 1500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.0 | 750.50 | 433.16 | 1 | 375.75 | 750.5 | 1125.25 | 1500 | ▇▇▇▇▇ |
age | 0 | 1.0 | 42.20 | 24.33 | 0 | 21.00 | 43.0 | 63.00 | 84 | ▇▇▇▇▇ |
gender | 0 | 1.0 | 1.49 | 0.50 | 1 | 1.00 | 1.0 | 2.00 | 2 | ▇▁▁▁▇ |
hiv | 0 | 1.0 | 0.10 | 0.29 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
CD4 | 1357 | 0.1 | 488.88 | 225.36 | 100 | 299.50 | 485.0 | 678.50 | 899 | ▇▇▇▇▆ |
VL | 1357 | 0.1 | 1979.76 | 985.04 | 558 | 1274.50 | 1849.0 | 2299.50 | 4989 | ▆▇▂▁▁ |
# You will see skim gives you
# - data summary
# - Group variables
# - summary statistics
# But you will see our data needs cleaning for gender and hiv status
1. Changing gender 1 to mean ‘female’ and 2 to mean ‘male’
# By using the dplyr as part of tidverse library we can use the functions mutate() and recode()
hiv_data%>%
mutate(gender =
recode(as.character (gender),
'1' = 'female',
'2' = 'male'))%>%
head() # let us view only the top 5 data
## id age gender hiv CD4 VL
## 1 1 49 female 0 NA NA
## 2 2 26 female 0 NA NA
## 3 3 21 male 0 NA NA
## 4 4 24 female 0 NA NA
## 5 5 1 male 0 NA NA
## 6 6 3 male 0 NA NA
# you will see on your console, the gender has changed to 'male' and 'female'
2. Changing hiv status 1 to mean ‘positive’ and 0 to mean ‘negative’
# As in the previous code, we will use the functions mutate() and recode()
hiv_data%>%
mutate(hiv =
recode(as.character (hiv),
'0' = 'negative',
'1' = 'positive'))%>%
head() # let us view only the top 5 data
## id age gender hiv CD4 VL
## 1 1 49 1 negative NA NA
## 2 2 26 1 negative NA NA
## 3 3 21 2 negative NA NA
## 4 4 24 1 negative NA NA
## 5 5 1 2 negative NA NA
## 6 6 3 2 negative NA NA
3. What if we want age categories for better summarisation of our data
summary(hiv_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 21.0 43.0 42.2 63.0 84.0
#With the epikit package, you can use the age_categories() function to easily
#categorize and label numeric columns (note: this function can be applied to non-age numeric variables too).
#As a bonus, the output column is automatically an ordered factor.
# what about if you want age categories?
hiv_data%>%
mutate(
age_cat = age_categories(
age,
breakers = c(0, 5, 10, 15, 20, # break points
30, 40, 50, 60, 70)))%>%
head() # let us view only the top 5 data
## id age gender hiv CD4 VL age_cat
## 1 1 49 1 0 NA NA 40-49
## 2 2 26 1 0 NA NA 20-29
## 3 3 21 2 0 NA NA 20-29
## 4 4 24 1 0 NA NA 20-29
## 5 5 1 2 0 NA NA 0-4
## 6 6 3 2 0 NA NA 0-4
4. Let us combine both and assign it to a new dataset
#now lets combine all our data
hiv_data_cl <-
hiv_data%>%
mutate(gender =
recode(as.character (gender),
'1' = 'female',
'2' = 'male'),
hiv =
recode(as.character (hiv),
'0' = 'negative',
'1' = 'positive'),
age_cat = age_categories(
age,
breakers = c(0, 5, 10, 15, 20, # break points
30, 40, 50, 60, 70)))
glimpse(hiv_data_cl) %>%
head() # let us view only the top 5 data
## Rows: 1,500
## Columns: 7
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ age <dbl> 49, 26, 21, 24, 1, 3, 17, 83, 28, 28, 76, 62, 29, 35, 83, 32, …
## $ gender <chr> "female", "female", "male", "female", "male", "male", "female"…
## $ hiv <chr> "negative", "negative", "negative", "negative", "negative", "n…
## $ CD4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 643, NA, NA, 481, 643, NA,…
## $ VL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 790, NA, NA, 972, 1736, NA…
## $ age_cat <fct> 40-49, 20-29, 20-29, 20-29, 0-4, 0-4, 15-19, 70+, 20-29, 20-29…
## id age gender hiv CD4 VL age_cat
## 1 1 49 female negative NA NA 40-49
## 2 2 26 female negative NA NA 20-29
## 3 3 21 male negative NA NA 20-29
## 4 4 24 female negative NA NA 20-29
## 5 5 1 male negative NA NA 0-4
## 6 6 3 male negative NA NA 0-4
# you will see on your console, the gender has changed to 'male' and 'female'
5. Cross tabulation
## Now you will be wondering, how many were positive/negative
## let us use the tabyl() functino from janitor package to make a simple cross-tabulation
hiv_data_cl%>%
tabyl(gender, hiv)%>%
head() # let us view only the top 5 data
## gender negative positive
## female 699 66
## male 658 77
6. Now you are curious to learn the totals, proportions, percents etc
## You can use janitors 'adorn' functions to convert proportions, percents and display them
hiv_data_cl%>%
tabyl(gender, hiv)%>%
adorn_totals(where = 'row')%>%
adorn_percentages(denominator = 'col') %>%
adorn_ns(position = 'front') %>%
adorn_title(
row_name = 'gender',
col_name = 'hiv status'
)
## hiv status
## gender negative positive
## female 699 (0.515106853352985) 66 (0.461538461538462)
## male 658 (0.484893146647015) 77 (0.538461538461538)
## Total 1,357 (1) 143 (1)
7. How do you print a table output?
## You can use flextable package to print a visually appealing table
hiv_data_cl%>%
tabyl(gender, hiv)%>%
adorn_totals(where = 'row')%>%
adorn_percentages(denominator = 'col') %>%
adorn_ns(position = 'front') %>%
adorn_title(
row_name = 'gender',
col_name = 'hiv status',
placement = 'combined') %>%
flextable::flextable()%>%
flextable::autofit()
gender/hiv status | negative | positive |
---|---|---|
female | 699 (0.515106853352985) | 66 (0.461538461538462) |
male | 658 (0.484893146647015) | 77 (0.538461538461538) |
Total | 1,357 (1) | 143 (1) |
7. if you want frequency of age categories of your patients
hiv_data_cl%>%
tabyl(age_cat)
## age_cat n percent
## 0-4 77 0.05133333
## 5-9 89 0.05933333
## 10-14 88 0.05866667
## 15-19 86 0.05733333
## 20-29 191 0.12733333
## 30-39 184 0.12266667
## 40-49 170 0.11333333
## 50-59 168 0.11200000
## 60-69 182 0.12133333
## 70+ 265 0.17666667
Responses