Introduction to Relative Risk
Relative Risk (RR) is one of the most fundamental measures in public health, offering insights into the strength of association between an exposure (like smoking) and an outcome (such as lung cancer). Here, we’ll quickly cover what Relative Risk is, and how to calculate it in R.
Understanding the Mathematical Concept of Relative Risk
Before we delve into the R programming aspects, it’s essential to grasp the mathematical foundation of Relative Risk. RR is a ratio that compares the probability of an event occurring in two different groups: those exposed to a certain factor versus those not exposed.
The Basic Formula
The formula for calculating Relative Risk is relatively straightforward:
[math] \text{RR} = \frac{\text{Re}}{\text{Ru}} [/math]
Here,
- Risk in Exposed Group (Re) is the probability of an event occurring in the group exposed to a certain factor.
- Risk in Unexposed Group (Ru) is the probability of the same event in a group not exposed to that factor.
Now we have to ask ourselves how to calculate Re and Ru. Simply put, risk is just the number of individuals who have the condition of interest divided by all individuals who are observed.
[math] \text{Re} = \frac{\text{Number of Cases in Exposed Group}}{\text{Total Exposed Group}} [/math]
[math] \text{Ru} = \frac{\text{Number of Cases in Unexposed Group}}{\text{Total Unexposed Group}} [/math]
Interpreting Relative Risk
- RR = 1: Indicates no association between exposure and outcome.
- RR > 1: Suggests a higher risk of the event with exposure.
- RR < 1: Implies a lower risk with exposure, potentially indicating a protective factor.
Relative Risk provides a quantifiable measure to understand the strength and direction of the association, making it an invaluable tool alongside techniques like Odds Ratios.
Project Introduction
For this scenario, imagine a senior epi has called you in to figure out if there’s a significant risk of developing two kinds of cancer due to smoking. They hand you a csv with 3500 entries, including smoking status and a field for diagnosis codes that looks a bit tricky. There was some talk about further data cleaning and organization but that would take days to sift through manually. No worries though, we’ll get it all sorted out and the question of if there’s significant risk in a matter of minutes. If you want to follow along, feel free to download the relative risk folder from Cody’s Github.
Preparing the Environment: Installing and Loading Necessary R Packages
To begin, we need to set up our R environment by installing and loading the required packages. We’ll use dplyr
for data manipulation and epiR
for analysis. Installing these packages is straightforward:
Data Preparation: Importing Data into R
Our analysis begins by importing the dataset into R. We’ll use a practice file smoking_survey.csv
file, which contains data on smoking status and various diagnosis codes for each person, age of the person and the zip-code in which they reside.
For our work, we really only need to know if the person was a smoker, and if they have a diagnosis code of interest.
Identifying the Variables: Selecting Relevant Variables for Analysis
We focus on two key variables: smoking_status
and diagnosis_codes
. To simplify our dataset, we’ll select only these columns:
Applying Functions to Data: Using apply
Function to Process Data
Next, we classify each individual’s lung cancer status based on diagnosis codes. We define a function, has_lung_cancer
, which returns “yes” if the codes match either of the specific lung cancer diagnoses and “no” otherwise. Then, we apply this function to our data:
From this point we can do another quick column selection to only keep what we really need. Now we can go about calculating relative risk.
Calculating Relative Risk: Manual Method
To calculate relative risk, we first create a contingency table using our selected variables:
This is done by using the table()
function, where smoking_status
are our Y values, and lung_cancer
status are our X values. We turned them into factors earlier on in this script to ensure the first entry is “yes” for lung_cancer
and “smoker” as the first value for smoking_status
. Factors are basically organized categorical values, so R doesn’t try to guess at organization and we have a messed up statistic as a result.
Then, we calculate the risk for both exposed (smokers) and unexposed (non-smokers) groups:
Let’s break down the syntax of what we just did. Because we already have our data formatted into factors, we can essentially point to the location in the table instead of having clunky variable references. In these cases, those values within the brackets [X,Y]
follow the system where the first value, Y is column, and the X is a row. You might notice that we have a
sum()
function that takes in only one argument, smoking_table[X,]
. In this case we’re asking R to basically take all values of that have that row location and collect them together to have an operation done on them. In this case, simply to add them all up. After creating our risk for exposed persons (Re) and risk for unexposed persons (Ru), and then calculate out our relative risk of ~3.85, which at first blush seems to indicate a very high risk of cancer when someone is a smoker when compared to non-smokers. But let’s confirm this with a more advanced function by way of a package made for epidemiology.
Calculating Relative Risk: Using epiR
epiR
simplifies this process. We can just take our table and pass it through the epi.2by2
function:
Here, we’re first telling the function what data we want to use, then indicating we want to treat this as a cohort count method, and finally that our outcomes, which is to say the counts of people with and without cancer are the columns. This code not only calculates relative risk but also provides other useful statistics, including levels of significance, the Odds Ratio (which is covered in a separate article), and even confidence intervals.
Interpreting the Results: Understanding the Meaning of Relative Risk Values
The relative risk value helps us understand the likelihood of lung cancer in smokers compared to non-smokers. Values greater than 1 suggest a higher risk in the exposed group, while values less than 1 suggest a lower risk. Unsurprisingly given our scenario, our risk is far higher than 1, and using the added functionality we found in the EpiR
package, we also were able to find these results to be very significant.
Conclusion
Here, we covered how to take fairly raw data, including some free-form string data, and turn it into not only a simply table but also how to calculate the relative risk in R.
Humanities Moment
The featured image for this article is Smoke Rings (1887) by Georgios Jakobides (Greek, 1853-1932). Aside from completing around 200 oil paintings which often fetched high prices at time of creation, Jakobides also created contemporary designs for currency in his native Greece.