Prepare NHANES diet recall data

This is a tutorial to help users access NHANES dietary data for downstream utilization in Polyphenol Estimator. The R scripts below walk you through how to directly download one cycle of NHANES dietary data from the CDC website and perform several diet cleaning steps. Users can directly download this code here and generate the same outputs by running the R code in RStudio.

About NHANES
NHANES is a nationally representative sample of non-institutionalized individuals in the United States. NHANES uses the Food and Nutrient Database for Dietary Studies (FNDDS) to generate nutrient intakes from food composition data. FNDDS is released every two-years in conjunction with the What We Eat in America (WWEIA), NHANES dietary data release. For each new version of FNDDS, foods/beverages, portions, and nutrient values are reviewed and updated.

Users interested in analyzing multiple cycles of NHANES dietary data should utilize cross-walks to harmonize changes in FNDDS foods and beverages over different cycles. Crosswalk information is available in the ‘Documentation’ File for each FNDDS release. Visit the USDA Food Survey Research Group for more information here.

INPUT

  1. 2021 - 2023 Demographic Data
  2. 2021 - 2023 Dietary Interview - Itemized foods from two separate diet recalls
  3. Dietary Interview Technical Support File - Food Codes
  4. Dietary Totals - Total Nutrient Intakes, two separate recalls

OUTPUT

  • NHANES_2021_2023_diet_adults.csv.bz2 - NHANES 2021-2023 ingredient-level diet data, first and second recalls combined, filtered for adults >=20 years old and nutrient outliers (kcal, fat, protein, vitamin C, and beta-carotene). Each participant has two complete recalls.
  • (Optional) NHANES_2021_2023_demographics_adults.csv.bz - NHANES 2021-2023 demographic data

SCRIPTS

Load packages

# dplyr: helps with data wrangling
# haven: loads SAS files
required = c("dplyr", "haven")

# Loop to install and load packages
for (pkg in required) {
  # This will install the package if you do not already have it
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
  # This will load the package so it's active for this R session
  library(pkg, character.only = TRUE)
}

Extract 2021-2023 NHANES Data from CDC website

To analyze 2021-2023 NHANES dietary data, we need to pull down the relevant files for our cycle of interest. An array of files from the 2021-2023 NHANES cycle are available from the CDC here, but for our purposes, we will pull down several key files:

  1. Demographic Data - We will use this data to filter the number of individuals we analyze.
  2. Dietary Interview - “Individual Foods, First Day” and “Individual Foods, Second Day” - Diet data is stored in separate files which we will combine.
  3. Dietary Interview Technical Support File - Food Codes - Contains three columns (food codes, a short food description, and a long food description)
  4. Dietary Total - “Total Nutrient Intakes, First Day” and “Total Nutrient Intakes, Second Day”, data is stored in separate files which we will combine. We will use select total nutrients to QC our dietary data.
# 1. Demographic data
demo_data = read_xpt('https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.xpt')

# 2. Dietary Interview Data
recall1 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR1IFF_L.xpt")
recall2 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR2IFF_L.xpt")

# 3. Food Codes
diet_codes = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DRXFCD_L.xpt")

# 4. Dietary Totals
tot_recall1 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR1TOT_L.xpt")
tot_recall2 = read_xpt("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DR2TOT_L.xpt")

Checkpoint: How many participants (SEQN) are in our starting files?

## 2021-2023 demographic file, n = 11933
## 2021-2023 diet recall 1, n = 6751

Clean Column Names

Many columns in these NHANES recall files are labelled with “DR1” or “DR2” which denote the recall they came from. Since we want to analyze both recalls, it makes sense to convert these labels so they are no longer specific to the recall and we can merge dataframes together (Ex: DR1DBIH and DR2DBIH turn into DRXDBIH). To make sure we still know which recall the data came from, we can create a new column specifying this information.

Individual Food Files

recall1_clean = recall1 %>%
  # replace column recall labels
  rename_with(~ sub("^DR1", "DRX", .x), starts_with("DR1")) %>%
  # Create Recall Column
  mutate(RecallNo = 1)

# Repeat for Recall 2
recall2_clean = recall2 %>%
  rename_with(~ sub("^DR2", "DRX", .x), starts_with("DR2")) %>%
  mutate(RecallNo = 2)

Now that the columns within our files have been cleaned, we can merge them together! We can also merge in food descriptions from our diet_codes dataframe.

recall_merge_clean = full_join(recall1_clean, recall2_clean) %>%
  # Merge food descriptions into recall data, Column names are not exact, so linkage is provided.
  left_join(diet_codes, by = c("DRXIFDCD" = "DRXFDCD")) %>%
  # Move RecallNo Column up
  relocate(RecallNo, .after = SEQN) %>%
  # Move Food Descriptions after food code
  relocate(DRXFCSD, DRXFCLD, .after = DRXIFDCD)
## Joining with `by = join_by(SEQN, WTDRD1, WTDR2D, DRXILINE, DRXDRSTZ, DRXEXMER,
## DRABF, DRDINT, DRXDBIH, DRXDAY, DRXLANG, DRXCCMNM, DRXCCMTX, DRX_020, DRX_030Z,
## DRXFS, DRX_040Z, DRXIFDCD, DRXIGRMS, DRXIKCAL, DRXIPROT, DRXICARB, DRXISUGR,
## DRXIFIBE, DRXITFAT, DRXISFAT, DRXIMFAT, DRXIPFAT, DRXICHOL, DRXIATOC, DRXIATOA,
## DRXIRET, DRXIVARA, DRXIACAR, DRXIBCAR, DRXICRYP, DRXILYCO, DRXILZ, DRXIVB1,
## DRXIVB2, DRXINIAC, DRXIVB6, DRXIFOLA, DRXIFA, DRXIFF, DRXIFDFE, DRXICHL,
## DRXIVB12, DRXIB12A, DRXIVC, DRXIVD, DRXIVK, DRXICALC, DRXIPHOS, DRXIMAGN,
## DRXIIRON, DRXIZINC, DRXICOPP, DRXISODI, DRXIPOTA, DRXISELE, DRXICAFF, DRXITHEO,
## DRXIALCO, DRXIMOIS, DRXIS040, DRXIS060, DRXIS080, DRXIS100, DRXIS120, DRXIS140,
## DRXIS160, DRXIS180, DRXIM161, DRXIM181, DRXIM201, DRXIM221, DRXIP182, DRXIP183,
## DRXIP184, DRXIP204, DRXIP205, DRXIP225, DRXIP226, RecallNo)`

Total Files

We will repeat this column cleaning process for total nutrient intake files.

tot_recall1_clean = tot_recall1 %>%
  # replace column recall labels
  rename_with(~ sub("^DR1", "DRX", .x), starts_with("DR1")) %>%
  # Create Recall Column
  mutate(RecallNo = 1)

# Repeat for Recall 2
tot_recall2_clean = tot_recall2 %>%
  rename_with(~ sub("^DR2", "DRX", .x), starts_with("DR2")) %>%
  mutate(RecallNo = 2)

Now that names are cleaned, we can merge our recalls together and pull out our columns of interest, notably our total nutrient outliers for dietary cleaning (kcal, protein, fat, vitamin C, and beta-carotene).

qc_nutrients = c("DRXTKCAL", # kcal
                 "DRXTPROT", # protein
                 "DRXTTFAT", # Fat
                 "DRXTVC", # Vitamin C
                 "DRXTBCAR") #Beta Carotene

# Dataframe of total select nutrients needed for nutrient QC
tot_recall_merge_clean = full_join(tot_recall1_clean, tot_recall2_clean) %>%
  # Move RecallNo Column up
  relocate(RecallNo, .after = SEQN) %>%
  # keep relevant columns and for nutrient outlier determination
  select(c(SEQN, RecallNo, all_of(qc_nutrients)))
## Joining with `by = join_by(SEQN, WTDRD1, WTDR2D, DRXDRSTZ, DRXEXMER, DRABF,
## DRDINT, DRXDBIH, DRXDAY, DRXLANG, DRXMRESP, DRXHELP, DRXSTY, DRXSKY, DRXTNUMF,
## DRXTKCAL, DRXTPROT, DRXTCARB, DRXTSUGR, DRXTFIBE, DRXTTFAT, DRXTSFAT, DRXTMFAT,
## DRXTPFAT, DRXTCHOL, DRXTATOC, DRXTATOA, DRXTRET, DRXTVARA, DRXTACAR, DRXTBCAR,
## DRXTCRYP, DRXTLYCO, DRXTLZ, DRXTVB1, DRXTVB2, DRXTNIAC, DRXTVB6, DRXTFOLA,
## DRXTFA, DRXTFF, DRXTFDFE, DRXTCHL, DRXTVB12, DRXTB12A, DRXTVC, DRXTVD, DRXTVK,
## DRXTCALC, DRXTPHOS, DRXTMAGN, DRXTIRON, DRXTZINC, DRXTCOPP, DRXTSODI, DRXTPOTA,
## DRXTSELE, DRXTCAFF, DRXTTHEO, DRXTALCO, DRXTMOIS, DRXTS040, DRXTS060, DRXTS080,
## DRXTS100, DRXTS120, DRXTS140, DRXTS160, DRXTS180, DRXTM161, DRXTM181, DRXTM201,
## DRXTM221, DRXTP182, DRXTP183, DRXTP184, DRXTP204, DRXTP205, DRXTP225, DRXTP226,
## DRX_300, DRX_320Z, DRX_330Z, DRXBWATZ, DRXTWSZ, RecallNo)`

Data Filtering

With the recalls now cleaned and merged into a singular dataframe, we will apply several data filtering steps to reduce the number of participants we will analyze.

  1. Include adults aged 20+ years old. We will leverage the demographic file (RIDAGEYR) to do this.
  2. Include participants who completed two recalls (DRDINT)
  3. Include participants who had dietary recalls that passed quality control (Specific by recall: DR1DRSTZ, DR2DRSTZ).
  • 1, Reliable and met the minimum criteria
  • 2, Not reliable or not met the minimum criteria
  • 4, Reported consuming breast-milk
  • 5, Not done
  • ., Missing
# Filtering for age
demo_adults = demo_data  %>%
  filter(RIDAGEYR >= 20)

# Let us apply our filters
diet_data_filtered = recall_merge_clean %>%
  # Include only adults 20+
  filter(SEQN %in% demo_adults$SEQN) %>%
  # Include only participants who completed two recalls
  filter(DRDINT == 2) %>%
  # Include recalls that passed QC 
  filter(DRXDRSTZ == 1) %>%
  # Double-check that we have two recalls per participant
  # Some people can escape the above filtering steps
  group_by(SEQN) %>%
  filter(n_distinct(RecallNo) == 2) %>%
  ungroup()

Checkpoint: How many participants (SEQN) remain after filtering?

## Participants post-filtering, n = 4284

Remove Nutrient Outliers

Nutrient cleaning cut points are based off NHANES data are provided in Appendix A of the CDC’s “Reviewing and Cleaning ASA24 Data”.

Note: What this script provides is not an exhaustive list of cleaning steps for 24-hour recall data. Users may want to filter for specific populations or perform other diet cleaning steps (e.g. Portion Outliers).

Sex Nutrient Minimum Maximum
Female Energy (kcal) 600 4400
Female Protein (g) 10 180
Female Fat (g) 15 185
Female Vitamin C (mg) 5 350
Female Beta-carotene (mcg) 15 7100
Male Energy (kcal) 650 5700
Male Protein (g) 25 240
Male Fat (g) 25 230
Male Vitamin C (mg) 5 400
Male Beta-carotene (mcg) 15 8200
# Let's reduce the number of columns for merging into other dataframes
metadata = demo_adults %>%
  select(c(SEQN, RIDAGEYR, RIAGENDR)) %>%
  left_join(tot_recall_merge_clean, by = "SEQN") 

diet_data_filtered_QC = left_join(diet_data_filtered, metadata, by = c("SEQN", "RecallNo")) %>%
  relocate(RIDAGEYR, RIAGENDR, DRXTKCAL, DRXTPROT, DRXTTFAT, DRXTVC, DRXTBCAR, .after = RecallNo) %>%
  # Apply Nutrient QC Filters
  filter(
    case_when(
      RIAGENDR == "2" ~ # Female
        DRXTKCAL >= 600 & DRXTKCAL <= 4400 &
        DRXTPROT >= 10 & DRXTPROT <= 180 &
        DRXTTFAT >= 15 & DRXTTFAT <= 185 &
        DRXTVC >= 5 & DRXTVC <= 350 &
        DRXTBCAR >= 15 & DRXTBCAR <= 7100,

      RIAGENDR == "1" ~ # male
        DRXTKCAL >= 650 & DRXTKCAL <= 5700 &
        DRXTPROT >= 25 & DRXTPROT <= 240 &
        DRXTTFAT >= 25 & DRXTTFAT <= 230 &
        DRXTVC >= 5 & DRXTVC <= 400 &
        DRXTBCAR >= 15 & DRXTBCAR <= 8200,
      # removes anyone with a missing RIAGENDR
      TRUE ~ FALSE)) %>% 
  # Remove meta data columns
  select(-c(RIDAGEYR, RIAGENDR, DRXTKCAL, DRXTPROT, DRXTTFAT, DRXTVC, DRXTBCAR))

Checkpoint: How many participants (SEQN) remain after nutrient outlier control?

## Participants post-QC, n = 3976

Export Data Files

Given the number of entries, we will compress this file to reduce file size.

vroom::vroom_write(diet_data_filtered_QC,
                   'user_inputs/NHANES_2021_2023_diet_adults.csv.bz2', delim = ",")

# Optional for users who want to keep the filtered demographic data
#vroom::vroom_write(demo_adults, 'user_inputs/NHANES_2021_2023_demographics_adults.csv.bz2', , delim = ",")