OpenRefine - ua-datalab/AI-for-Professionals GitHub Wiki

OpenRefine


OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

OpenRefine: Guide

Key Features and Capabilities

  • Clean and transform messy data (detect duplicates, split/join cells, fix errors)
  • Explore large datasets interactively without programming
  • Reconcile data (link records to standardized values, e.g., linking hospital names to master lists)
  • Import/export data in multiple formats: CSV, Excel, TSV, Google Sheets, JSON, etc.
  • Track every step with a history of transformations, enabling easy undo/redo

How It Basically Works

  • Load your dataset (CSV, Excel, etc.) into OpenRefine
  • Preview and examine data in a table view
  • Apply transformations using browser-based tools or simple expressions (no coding required)
  • Filter, facet, and cluster data to find and fix inconsistencies
  • Export the cleaned dataset for further analysis or sharing

Five Potential Use Cases

  • Clean and standardize patient names, addresses, or diagnoses in survey or registry data
  • Normalizing healthcare survey data for consistent analysis
  • Remove duplicates and merge records in vaccination or appointment logs
  • Auditing and correcting medication lists for inventory management
  • Map free-text disease names to standardized codes (ICD-10, SNOMED, etc.)

Benefits of Using OpenRefine

  • Saves time on manual data cleaning, improving data reliability
  • Reduces data entry errors, supporting better public health decisions
  • No installation required (can be run locally in a browser); free and open-source
  • Empowers staff without coding experience to handle complex data tasks

References and Further Readings

Example: General Refine Expression Language (GREL)

The Mass Cell Processor (MCP) is a feature in OpenRefine that allows you to perform operations on multiple cells at once. Here's a guide on how to use it.

The MCP feature in OpenRefine lets you apply transformations across multiple cells by writing expressions using the General Refine Expression Language (GREL), Python, or other languages.

Here's a step-by-step example:

  1. Open your project in OpenRefine with your dataset loaded
  2. Select the column you want to transform by clicking on the column header
  3. Click the dropdown arrow and select "Edit cells" > "Transform..."
  4. Write your expression in the expression editor

For example, if you want to convert all text in a column to uppercase:

value.toUppercase()

The preview will show how your data will be transformed. When satisfied, click "OK" to apply the transformation to all cells in that column.

Advanced MCP Examples

Example 1: Trimming whitespace and standardizing case

value.trim().toLowerCase()

Example 2: Extracting part of a string (first 5 characters)

value.substring(0, 5)

Example 3: Replacing text

value.replace("old text", "new text")

Example 4: Converting date formats

value.toDate("MM/dd/yyyy").toString("yyyy-MM-dd")

Example 5: Using regular expressions

value.match(/\d{3}-\d{3}-\d{4}/)

Common MCP Functions

  • value.trim() - Remove leading and trailing whitespace
  • value.replace(s1, s2) - Replace s1 with s2
  • value.contains(s) - Check if value contains s
  • value.split(sep) - Split string into array
  • value.substring(start, length) - Extract substring
  • value.toNumber() - Convert to number
  • value.toString() - Convert to string

Created: 05/13/2025 (C. Lizárraga)

Updated: 05/13/2025 (C. Lizárraga)

DataLab, Data Science Institute, University of Arizona.

CC BY-NC-SA 4.0

⚠️ **GitHub.com Fallback** ⚠️