Pre‐process raw career placement data - UCSB-MEDS/shiny-dashboard GitHub Wiki
IMPORTANT: In an ideal world, the format of these raw career data remain consistent from year to year...however slight changes to column names, additional / missing columns, etc. have necessitated updates to the data processing / cleaning pipeline. If the code breaks while trying to process new data, you will need to do a bit of sleuthing and update the code, as necessary.
ADDITIONALLY: The steps below are for the initial processing of MESM & MEDS career placement (initial & active) data -- this primarily includes adding columns for program acronym (program
) and graduation year (year
), and ensuring that they same columns exist across years so that they can be combined into four tidy data frames (two for MESM and two for MEDS). Additional cleaning is required (e.g. ensuring state names and abbreviations are consistent, removing student melt, etc.). Cleaning steps are described in later sections of this wiki.
- 1. Open the Bren Dashboard shared Drive (Sam Shanny-Csik or Jamie Montgomery can provide access) and navigate to the
data/career/raw-data/
folder. - 2. Create a new year folder (
YYYY/
, e.g.2024/
) within both theMESM/
&MEDS/
directories - 3. Add both the Initial Placement and Active Placement files to the appropriate program/year directory
- 4. Convert all files to Google Sheets, as necessary -- open
.xlsx
file, then click File > Save as Google Sheets - 5. Move any
.xslx
files todata/career/raw-data/xlsx files (original)
, for organization's sake - 6. Manually remove any accessory header rows, footer rows, or preceding extra columns from the Google Sheets (often in Active Placement files; the code currently does not automate this process)
We use the {googlesheets4}
package to read raw data directly into process-career-data.qmd
(see career-data
repo). This requires a shareable link:
- 1. From an open Google Sheet, click Share > make sure General access is set to "Restricted" > Copy link. It should look something like this:

The career-data
repository houses three custom functions and one processing file (process-career-data.qmd
) which calls those functions to read in (read_career()
) and process initial (process_placement()
) and active (process_status()
) placement data:
- 1. Open
process-career-data.qmd
in your clonedcareer-data
project - 2. Add the shareable Google Sheets links to the appropriate lists in section
1. READ IN DATA
. Do not delete links from past years. It is also helpful to add a short annotation next to the link with the year. E.g. MEDS initial placement data is read in and saved to a list called,meds_placement_list
like this:
# INITIAL PLACEMENT (IP) ----
meds_placement_list <- read_career(link_str = c("https://docs.google.com/spreadsheets/d/1jNOfvnUGeg0EjCqFI4KktFv-lWsKI2ttj7BCDj-KZGg/edit?usp=sharing", # MEDS IP 2022
"https://docs.google.com/spreadsheets/d/1WA5c-1iufh2O5_olumq4IE6oN1o-n1rCvrh9gg2WGbs/edit?usp=sharing", # MEDS IP 2023
"https://docs.google.com/spreadsheets/d/1JIRt6VDXlsAYoJI_KOYo85DjyvbVl7ppWpe3yxGQhRk/edit?usp=sharing")) # MEDS IP 2024
- 3. After adding any new Google Sheet links, read in the data by running all code in section 1. If this is your first time u sing the
{googlesheets4}
package, you'll need to set up your account authentication (follow the prompts, which will likely be displayed when you first runmesm_placement_list <- read_career(link_str = ...)
. If it's been a while since you've used{googlesheets4}
, you may need to go through a similar process to re-authenticate (again, follow the prompts). - 4. Run section 2 process all MESM & MEDS initial placement data
- 5. Open the
process_status.R
file inside theR/
directory. Add a newstringr::str_detect(campaign_name, "YYYY" == TRUE ~ YYYY
line to the following code, ensuring thatYYYY
matches that of the new file.
data <- data |>
clean_names() |>
mutate(class_year = case_when(
stringr::str_detect(campaign_name, "2019") == TRUE ~ 2019,
stringr::str_detect(campaign_name, "2020") == TRUE ~ 2020,
stringr::str_detect(campaign_name, "2021") == TRUE ~ 2021,
stringr::str_detect(campaign_name, "2022") == TRUE ~ 2022,
stringr::str_detect(campaign_name, "2023") == TRUE ~ 2023,
stringr::str_detect(campaign_name, "2024") == TRUE ~ 2024,
# ADD NEW LINE HERE
)) |>
select(-any_of(c("contact_count", "campaign_id", "x2"))) # removes columns if they exist (some years have them, others don't)
- 6. Return to
process-career-data.qmd
and run section 3 to process all MESM and MEDS status (active placement) data - 7. Run section 4 to write processed data to
.rds
files (NOTE: you may need to create aprocessed-data/
folder in your project's root directory first). It's helpful to save with file names that include the range of years (e.g.meds_placement_2021_2024.rds
).