Pre‐process raw career placement data - UCSB-MEDS/shiny-dashboard GitHub Wiki

IMPORTANT: In an ideal world, the format of these raw career data remain consistent from year to year...however slight changes to column names, additional / missing columns, etc. have necessitated updates to the data processing / cleaning pipeline. If the code breaks while trying to process new data, you will need to do a bit of sleuthing and update the code, as necessary.

ADDITIONALLY: The steps below are for the initial processing of MESM & MEDS career placement (initial & active) data -- this primarily includes adding columns for program acronym (program) and graduation year (year), and ensuring that they same columns exist across years so that they can be combined into four tidy data frames (two for MESM and two for MEDS). Additional cleaning is required (e.g. ensuring state names and abbreviations are consistent, removing student melt, etc.). Cleaning steps are described in later sections of this wiki.

1. Convert raw data to Google Sheets

  • 1. Open the Bren Dashboard shared Drive (Sam Shanny-Csik or Jamie Montgomery can provide access) and navigate to the data/career/raw-data/ folder.
  • 2. Create a new year folder (YYYY/, e.g. 2024/) within both the MESM/ & MEDS/ directories
  • 3. Add both the Initial Placement and Active Placement files to the appropriate program/year directory
  • 4. Convert all files to Google Sheets, as necessary -- open .xlsx file, then click File > Save as Google Sheets
  • 5. Move any .xslx files to data/career/raw-data/xlsx files (original), for organization's sake
  • 6. Manually remove any accessory header rows, footer rows, or preceding extra columns from the Google Sheets (often in Active Placement files; the code currently does not automate this process)

2. Get shareable links

We use the {googlesheets4} package to read raw data directly into process-career-data.qmd (see career-data repo). This requires a shareable link:

  • 1. From an open Google Sheet, click Share > make sure General access is set to "Restricted" > Copy link. It should look something like this:

3. Pre-process / harmonize data

The career-data repository houses three custom functions and one processing file (process-career-data.qmd) which calls those functions to read in (read_career()) and process initial (process_placement()) and active (process_status()) placement data:

  • 1. Open process-career-data.qmd in your cloned career-data project
  • 2. Add the shareable Google Sheets links to the appropriate lists in section 1. READ IN DATA. Do not delete links from past years. It is also helpful to add a short annotation next to the link with the year. E.g. MEDS initial placement data is read in and saved to a list called, meds_placement_list like this:
# INITIAL PLACEMENT (IP) ----
meds_placement_list <- read_career(link_str = c("https://docs.google.com/spreadsheets/d/1jNOfvnUGeg0EjCqFI4KktFv-lWsKI2ttj7BCDj-KZGg/edit?usp=sharing", # MEDS IP 2022
                                                "https://docs.google.com/spreadsheets/d/1WA5c-1iufh2O5_olumq4IE6oN1o-n1rCvrh9gg2WGbs/edit?usp=sharing", # MEDS IP 2023
                                                "https://docs.google.com/spreadsheets/d/1JIRt6VDXlsAYoJI_KOYo85DjyvbVl7ppWpe3yxGQhRk/edit?usp=sharing")) # MEDS IP 2024
  • 3. After adding any new Google Sheet links, read in the data by running all code in section 1. If this is your first time u sing the {googlesheets4} package, you'll need to set up your account authentication (follow the prompts, which will likely be displayed when you first run mesm_placement_list <- read_career(link_str = ...). If it's been a while since you've used {googlesheets4}, you may need to go through a similar process to re-authenticate (again, follow the prompts).
  • 4. Run section 2 process all MESM & MEDS initial placement data
  • 5. Open the process_status.R file inside the R/ directory. Add a new stringr::str_detect(campaign_name, "YYYY" == TRUE ~ YYYY line to the following code, ensuring that YYYY matches that of the new file.
data <- data |> 
      clean_names() |>
      mutate(class_year = case_when(
        stringr::str_detect(campaign_name, "2019") == TRUE ~ 2019,
        stringr::str_detect(campaign_name, "2020") == TRUE ~ 2020,
        stringr::str_detect(campaign_name, "2021") == TRUE ~ 2021,
        stringr::str_detect(campaign_name, "2022") == TRUE ~ 2022,
        stringr::str_detect(campaign_name, "2023") == TRUE ~ 2023,
        stringr::str_detect(campaign_name, "2024") == TRUE ~ 2024,
        # ADD NEW LINE HERE
      )) |> 
      select(-any_of(c("contact_count", "campaign_id", "x2"))) # removes columns if they exist (some years have them, others don't)
  • 6. Return to process-career-data.qmd and run section 3 to process all MESM and MEDS status (active placement) data
  • 7. Run section 4 to write processed data to .rds files (NOTE: you may need to create a processed-data/ folder in your project's root directory first). It's helpful to save with file names that include the range of years (e.g. meds_placement_2021_2024.rds).
⚠️ **GitHub.com Fallback** ⚠️