Resources data cleaning - Enterprise-CMCS/cmcs-eregulations GitHub Wiki
Data is easier to process and search if it's tidy.
Manual QA:
- Sort A-Z and Z-A by each of the columns, see if anything is missing or inconsistent.
- Are there any "null"s that shouldn't be nulls?
- Run it through a broken link checker - are there any broken links?
- Are there any HTTP links? All links should be HTTPS.
- Check for links to non-.gov websites. Is it a legitimate government website or publication? For example, is it an official CMS site run by contractors, such as PASRR Technical Assistance Center or ResDAC?
Remove:
- Duplicate items (multiple items with the same URL)
- Hidden Unicode control characters
- Newlines
- Double spaces
Consider whether to systematically re-process:
- Curly quotes
- Curly apostrophes
’ - Em dashes and en dashes
—(may also need to make sure they have spaces around them, to ensure searchability) - Copyright and registered trademark symbols
® - Section symbols