Prepare NHANES diet recall data
This is a tutorial to help users access NHANES dietary data for downstream utilization in Polyphenol Estimator. The R scripts below walk you through how to directly download one cycle of NHANES dietary data from the CDC website and perform several diet cleaning steps. Users can directly download this code here and generate the same outputs by running the R code in RStudio.
About NHANES
NHANES is a nationally representative sample of non-institutionalized individuals in the United States. NHANES uses the Food and Nutrient Database for Dietary Studies (FNDDS) to generate nutrient intakes from food composition data. FNDDS is released every two-years in conjunction with the What We Eat in America (WWEIA), NHANES dietary data release. For each new version of FNDDS, foods/beverages, portions, and nutrient values are reviewed and updated.
Users interested in analyzing multiple cycles of NHANES dietary data should utilize cross-walks to harmonize changes in FNDDS foods and beverages over different cycles. Crosswalk information is available in the ‘Documentation’ File for each FNDDS release. Visit the USDA Food Survey Research Group for more information here.
INPUT
- 2021 - 2023 Demographic Data
- 2021 - 2023 Dietary Interview - Itemized foods from two separate diet recalls
- Dietary Interview Technical Support File - Food Codes
- Dietary Totals - Total Nutrient Intakes, two separate recalls
OUTPUT
- NHANES_2021_2023_diet_adults.csv.bz2 - NHANES 2021-2023 ingredient-level diet data, first and second recalls combined, filtered for adults >=20 years old and nutrient outliers (kcal, fat, protein, vitamin C, and beta-carotene). Each participant has two complete recalls.
- (Optional) NHANES_2021_2023_demographics_adults.csv.bz - NHANES 2021-2023 demographic data
SCRIPTS
Load packages
# dplyr: helps with data wrangling
# haven: loads SAS files
required = c("dplyr", "haven")
# Loop to install and load packages
for (pkg in required) {
# This will install the package if you do not already have it
if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
# This will load the package so it's active for this R session
library(pkg, character.only = TRUE)
}
Extract 2021-2023 NHANES Data from CDC website
To analyze 2021-2023 NHANES dietary data, we need to pull down the relevant files for our cycle of interest. An array of files from the 2021-2023 NHANES cycle are available from the CDC here, but for our purposes, we will pull down several key files:
- Demographic Data - We will use this data to filter the number of individuals we analyze.
- Dietary Interview - “Individual Foods, First Day” and “Individual Foods, Second Day” - Diet data is stored in separate files which we will combine.
- Dietary Interview Technical Support File - Food Codes - Contains three columns (food codes, a short food description, and a long food description)
- Dietary Total - “Total Nutrient Intakes, First Day” and “Total Nutrient Intakes, Second Day”, data is stored in separate files which we will combine. We will use select total nutrients to QC our dietary data.
# 1. Demographic data
demo_data = read_xpt('https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.xpt')
# 2. Dietary Interview Data
recall1 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR1IFF_L.xpt")
recall2 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR2IFF_L.xpt")
# 3. Food Codes
diet_codes = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DRXFCD_L.xpt")
# 4. Dietary Totals
tot_recall1 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR1TOT_L.xpt")
tot_recall2 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR2TOT_L.xpt")
Checkpoint: How many participants (SEQN) are in our starting files?
## 2021-2023 demographic file, n = 11933
## 2021-2023 diet recall 1, n = 6751
Clean Column Names
Many columns in these NHANES recall files are labelled with “DR1” or “DR2” which denote the recall they came from. Since we want to analyze both recalls, it makes sense to convert these labels so they are no longer specific to the recall and we can merge dataframes together (Ex: DR1DBIH and DR2DBIH turn into DRXDBIH). To make sure we still know which recall the data came from, we can create a new column specifying this information.
Individual Food Files
recall1_clean = recall1 %>%
# replace column recall labels
rename_with(~ sub("^DR1", "DRX", .x), starts_with("DR1")) %>%
# Create Recall Column
mutate(RecallNo = 1)
# Repeat for Recall 2
recall2_clean = recall2 %>%
rename_with(~ sub("^DR2", "DRX", .x), starts_with("DR2")) %>%
mutate(RecallNo = 2)
Now that the columns within our files have been cleaned, we can merge them together! We can also merge in food descriptions from our diet_codes dataframe.
recall_merge_clean = full_join(recall1_clean, recall2_clean) %>%
# Merge food descriptions into recall data, Column names are not exact, so linkage is provided.
left_join(diet_codes, by = c("DRXIFDCD" = "DRXFDCD")) %>%
# Move RecallNo Column up
relocate(RecallNo, .after = SEQN) %>%
# Move Food Descriptions after food code
relocate(DRXFCSD, DRXFCLD, .after = DRXIFDCD)
## Joining with `by = join_by(SEQN, WTDRD1, WTDR2D, DRXILINE, DRXDRSTZ, DRXEXMER,
## DRABF, DRDINT, DRXDBIH, DRXDAY, DRXLANG, DRXCCMNM, DRXCCMTX, DRX_020, DRX_030Z,
## DRXFS, DRX_040Z, DRXIFDCD, DRXIGRMS, DRXIKCAL, DRXIPROT, DRXICARB, DRXISUGR,
## DRXIFIBE, DRXITFAT, DRXISFAT, DRXIMFAT, DRXIPFAT, DRXICHOL, DRXIATOC, DRXIATOA,
## DRXIRET, DRXIVARA, DRXIACAR, DRXIBCAR, DRXICRYP, DRXILYCO, DRXILZ, DRXIVB1,
## DRXIVB2, DRXINIAC, DRXIVB6, DRXIFOLA, DRXIFA, DRXIFF, DRXIFDFE, DRXICHL,
## DRXIVB12, DRXIB12A, DRXIVC, DRXIVD, DRXIVK, DRXICALC, DRXIPHOS, DRXIMAGN,
## DRXIIRON, DRXIZINC, DRXICOPP, DRXISODI, DRXIPOTA, DRXISELE, DRXICAFF, DRXITHEO,
## DRXIALCO, DRXIMOIS, DRXIS040, DRXIS060, DRXIS080, DRXIS100, DRXIS120, DRXIS140,
## DRXIS160, DRXIS180, DRXIM161, DRXIM181, DRXIM201, DRXIM221, DRXIP182, DRXIP183,
## DRXIP184, DRXIP204, DRXIP205, DRXIP225, DRXIP226, RecallNo)`
Total Files
We will repeat this column cleaning process for total nutrient intake files.
tot_recall1_clean = tot_recall1 %>%
# replace column recall labels
rename_with(~ sub("^DR1", "DRX", .x), starts_with("DR1")) %>%
# Create Recall Column
mutate(RecallNo = 1)
# Repeat for Recall 2
tot_recall2_clean = tot_recall2 %>%
rename_with(~ sub("^DR2", "DRX", .x), starts_with("DR2")) %>%
mutate(RecallNo = 2)
Now that names are cleaned, we can merge our recalls together and pull out our columns of interest, notably our total nutrient outliers for dietary cleaning (kcal, protein, fat, vitamin C, and beta-carotene).
qc_nutrients = c("DRXTKCAL", # kcal
"DRXTPROT", # protein
"DRXTTFAT", # Fat
"DRXTVC", # Vitamin C
"DRXTBCAR") #Beta Carotene
# Dataframe of total select nutrients needed for nutrient QC
tot_recall_merge_clean = full_join(tot_recall1_clean, tot_recall2_clean) %>%
# Move RecallNo Column up
relocate(RecallNo, .after = SEQN) %>%
# keep relevant columns and for nutrient outlier determination
select(c(SEQN, RecallNo, all_of(qc_nutrients)))
## Joining with `by = join_by(SEQN, WTDRD1, WTDR2D, DRXDRSTZ, DRXEXMER, DRABF,
## DRDINT, DRXDBIH, DRXDAY, DRXLANG, DRXMRESP, DRXHELP, DRXSTY, DRXSKY, DRXTNUMF,
## DRXTKCAL, DRXTPROT, DRXTCARB, DRXTSUGR, DRXTFIBE, DRXTTFAT, DRXTSFAT, DRXTMFAT,
## DRXTPFAT, DRXTCHOL, DRXTATOC, DRXTATOA, DRXTRET, DRXTVARA, DRXTACAR, DRXTBCAR,
## DRXTCRYP, DRXTLYCO, DRXTLZ, DRXTVB1, DRXTVB2, DRXTNIAC, DRXTVB6, DRXTFOLA,
## DRXTFA, DRXTFF, DRXTFDFE, DRXTCHL, DRXTVB12, DRXTB12A, DRXTVC, DRXTVD, DRXTVK,
## DRXTCALC, DRXTPHOS, DRXTMAGN, DRXTIRON, DRXTZINC, DRXTCOPP, DRXTSODI, DRXTPOTA,
## DRXTSELE, DRXTCAFF, DRXTTHEO, DRXTALCO, DRXTMOIS, DRXTS040, DRXTS060, DRXTS080,
## DRXTS100, DRXTS120, DRXTS140, DRXTS160, DRXTS180, DRXTM161, DRXTM181, DRXTM201,
## DRXTM221, DRXTP182, DRXTP183, DRXTP184, DRXTP204, DRXTP205, DRXTP225, DRXTP226,
## DRX_300, DRX_320Z, DRX_330Z, DRXBWATZ, DRXTWSZ, RecallNo)`
Data Filtering
With the recalls now cleaned and merged into a singular dataframe, we will apply several data filtering steps to reduce the number of participants we will analyze.
- Include adults aged 20+ years old. We will leverage the demographic file (
RIDAGEYR) to do this. - Include participants who completed two recalls (
DRDINT) - Include participants who had dietary recalls that passed quality control (Specific by recall:
DR1DRSTZ,DR2DRSTZ).
- 1, Reliable and met the minimum criteria
- 2, Not reliable or not met the minimum criteria
- 4, Reported consuming breast-milk
- 5, Not done
- ., Missing
# Filtering for age
demo_adults = demo_data %>%
filter(RIDAGEYR >= 20)
# Let us apply our filters
diet_data_filtered = recall_merge_clean %>%
# Include only adults 20+
filter(SEQN %in% demo_adults$SEQN) %>%
# Include only participants who completed two recalls
filter(DRDINT == 2) %>%
# Include recalls that passed QC
filter(DRXDRSTZ == 1) %>%
# Double-check that we have two recalls per participant
# Some people can escape the above filtering steps
group_by(SEQN) %>%
filter(n_distinct(RecallNo) == 2) %>%
ungroup()
Checkpoint: How many participants (SEQN) remain after filtering?
## Participants post-filtering, n = 4284
Remove Nutrient Outliers
Nutrient cleaning cut points are based off NHANES data are provided in Appendix A of the CDC’s “Reviewing and Cleaning ASA24 Data”.
Note: What this script provides is not an exhaustive list of cleaning steps for 24-hour recall data. Users may want to filter for specific populations or perform other diet cleaning steps (e.g. Portion Outliers).
| Sex | Nutrient | Minimum | Maximum |
|---|---|---|---|
| Female | Energy (kcal) | 600 | 4400 |
| Female | Protein (g) | 10 | 180 |
| Female | Fat (g) | 15 | 185 |
| Female | Vitamin C (mg) | 5 | 350 |
| Female | Beta-carotene (mcg) | 15 | 7100 |
| Male | Energy (kcal) | 650 | 5700 |
| Male | Protein (g) | 25 | 240 |
| Male | Fat (g) | 25 | 230 |
| Male | Vitamin C (mg) | 5 | 400 |
| Male | Beta-carotene (mcg) | 15 | 8200 |
# Let's reduce the number of columns for merging into other dataframes
metadata = demo_adults %>%
select(c(SEQN, RIDAGEYR, RIAGENDR)) %>%
left_join(tot_recall_merge_clean, by = "SEQN")
diet_data_filtered_QC = left_join(diet_data_filtered, metadata, by = c("SEQN", "RecallNo")) %>%
relocate(RIDAGEYR, RIAGENDR, DRXTKCAL, DRXTPROT, DRXTTFAT, DRXTVC, DRXTBCAR, .after = RecallNo) %>%
# Apply Nutrient QC Filters
filter(
case_when(
RIAGENDR == "2" ~ # Female
DRXTKCAL >= 600 & DRXTKCAL <= 4400 &
DRXTPROT >= 10 & DRXTPROT <= 180 &
DRXTTFAT >= 15 & DRXTTFAT <= 185 &
DRXTVC >= 5 & DRXTVC <= 350 &
DRXTBCAR >= 15 & DRXTBCAR <= 7100,
RIAGENDR == "1" ~ # male
DRXTKCAL >= 650 & DRXTKCAL <= 5700 &
DRXTPROT >= 25 & DRXTPROT <= 240 &
DRXTTFAT >= 25 & DRXTTFAT <= 230 &
DRXTVC >= 5 & DRXTVC <= 400 &
DRXTBCAR >= 15 & DRXTBCAR <= 8200,
# removes anyone with a missing RIAGENDR
TRUE ~ FALSE)) %>%
# Remove meta data columns
select(-c(RIDAGEYR, RIAGENDR, DRXTKCAL, DRXTPROT, DRXTTFAT, DRXTVC, DRXTBCAR))
Checkpoint: How many participants (SEQN) remain after nutrient outlier control?
## Participants post-QC, n = 3976
Export Data Files
Given the number of entries, we will compress this file to reduce file size.
vroom::vroom_write(diet_data_filtered_QC,
'user_inputs/NHANES_2021_2023_diet_adults.csv.bz2', delim = ",")
# Optional for users who want to keep the filtered demographic data
#vroom::vroom_write(demo_adults, 'user_inputs/NHANES_2021_2023_demographics_adults.csv.bz2', , delim = ",")