Introduction
Odds Ratio (OR) calculations are a cornerstone in public health research, providing insights into the strength of association between an exposure and an outcome. In this tutorial, we’ll explore how what an Odds Ratio is, as well as how to calculate it in Python.
Understanding Odds Ratio
Odds Ratio (OR) is a measure used in epidemiology to compare the odds of an event occurring in one group to the odds of it occurring in another group. This is especially useful in case-control studies where you’re comparing the odds of exposure in cases (those with the disease or outcome of interest) to the odds of exposure in controls (those without the disease).
Odds as a Concept
Before diving into odds ratios, it’s important to understand what “odds” are. In a health context, odds are a way of representing the likelihood of an event happening. If the probability of an event happening is P, the odds are calculated as:
[math]{\text{Odds} = \frac{P}{1 – P}}[/math]
Simply put, this equation reads as “Odds can be calculated as the probability of an event occurring [math]{(P)}[/math], divided by the probability of the event not occurring [math]{(1 – P)}[/math]”.
Calculating the Odds Ratio
Now, consider a case-control study with the following data:
- : Number of cases (people with the disease) who were exposed to a certain risk factor.
- : Number of controls (people without the disease) who were exposed to the same risk factor.
- : Number of cases who were not exposed to the risk factor.
- : Number of controls who were not exposed to the risk factor.
The odds of exposure among the cases is [math]{A/C}[/math], and the odds of exposure among the controls is [math]{B/D}[/math]
. The odds ratio is calculated as: [math]{\text{Odds Ratio (OR)} = \frac{\text{Odds of exposure in cases}}{\text{Odds of exposure in controls}} = \frac{A/C}{B/D} = \frac{A \times D}{B \times C}} [/math]Interpretation of the Odds Ratio
- OR = 1: This suggests there is no association between the exposure and the outcome.
- OR > 1: This indicates a positive association, meaning the exposure might increase the odds of the outcome.
- OR < 1: This implies a negative association, suggesting the exposure might decrease the odds of the outcome.
Limitations
It’s important to remember that odds ratios can sometimes overestimate the risk, especially if the outcome is common. Also, they do not necessarily imply causation. For more explanation on the underlying mathematics and mechanics of Odds Ratios, please check out our Epi Explained series! For now, let’s get on with the calculation of Odds Ratios in Python.
Practical Example: Smoking and Cancer
Let’s analyze a dataset to determine if smoking is associated with higher odds of lung cancer. To follow along, please download the folder for Odds Ratio on Cody’s Github.
Step 1: Load Python Libraries
As we get started, there are a few packages you’ll need ready. If you haven’t already, set up a virtual environment (venv) for this project by directing your command line to the project folder and type in the following: python -m venv .venv
This makes your project management cleaner by only allowing files and packages you bring in explicitly, rather than having every file and package in a single location which, if moved, breaks everything. Next, we’ll want to install some packages for our project, pandas
for basic data science functions, numpy
for more advanced calculations, and scipy
for statistical work. To bring these all in, activate your virtual environment using source .venv/bin/activate
and then entering in py -m pip install X
where X can be replaced by any of the packages mentioned. With all that out of the way, we can finally get to the proper coding. Starting out, we can load in our packages into the script and rename them. Note that for scipy
we only need to pull in a specific part of the package.
Importing and Preparing the Dataset
We’ll use the pandas
library to import our dataset (named “smoking_survey.csv”) and select relevant columns, in our case those being smoking_status
and diagnosis_codes
.
Data Transformation for Analysis
Next, we need to define a quick function to look through our diagnosis_codes
column and see if we have either of our two ICD codes of interest. In Python, this is fairly straight forward using a combination ofif/else
, in
and or
statements. We can then .apply
our function directly on our dataframe and column of interest, and iterate down that whole column. we are then given a new field called lung_cancer
. From this point, we can retain what are now our fields of interest, smoking_status
and lung_cancer
. We now only need to do one more adjustment before we can craft our contingency table and get done with this analysis, and that’s to format our variables from Character type to Categorical. The reason for this is it helps structure future table work and allows us to easily depend on what order the sub variables will come in, which otherwise would be up to whatever character entry comes up first.
Creating a Contingency Table
Using the pandas.crosstab
function, we create a contingency table to visualize the distribution of smoking status and lung cancer occurrence.
This creates a table where you have your two status indicators, as well as the categories of whether or not cancer had developed.
Calculating the Odds Ratio
We employ the fisher_exact
method from scipy.stats
to calculate the Odds Ratio and obtain a p-value for statistical significance.
Here, we’re assigning two variables odds_ratio, p_value
in a singular line from the two values that are sent back from the fisher_exact
function. It’s worth mentioning that in Python, functions can often return multiple values, so printing results instead of assigning them to a variable might be useful if you’re not familiar with them.
In printing these values, we can see there’s an OR value of 5.77 and a p-value of of 1.4131903723243655e-95, which is far under our threshold cut-off of 0.05, meaning the results are very significant.
Conclusion
This tutorial provided an understanding of Odds Ratios and demonstrated how to calculate them in Python using real-world data. This skill is essential for public health researchers and epidemiologists to assess associations in epidemiological studies.
Humanities Moment
The featured image for this article is Ships Riding on the Seine at Rouen (1872-1873) , by Claude Monet (French, 1840-1926). Monet is known as the most famous french impressionist, and in fact the coiner of the term as the movement was named after the painting Impressions. Unlike many artists of his time, Monet was wildly popular during both his lifetime and throughout the 1900s, with caricatures, portraits and landscapes all being featured in museums and exhibits worldwide.