OpenRefine - ua-datalab/AI-for-Professionals GitHub Wiki
OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

Key Features and Capabilities
- Clean and transform messy data (detect duplicates, split/join cells, fix errors)
- Explore large datasets interactively without programming
- Reconcile data (link records to standardized values, e.g., linking hospital names to master lists)
- Import/export data in multiple formats: CSV, Excel, TSV, Google Sheets, JSON, etc.
- Track every step with a history of transformations, enabling easy undo/redo
How It Basically Works
- Load your dataset (CSV, Excel, etc.) into OpenRefine
- Preview and examine data in a table view
- Apply transformations using browser-based tools or simple expressions (no coding required)
- Filter, facet, and cluster data to find and fix inconsistencies
- Export the cleaned dataset for further analysis or sharing
Five Potential Use Cases
- Clean and standardize patient names, addresses, or diagnoses in survey or registry data
- Normalizing healthcare survey data for consistent analysis
- Remove duplicates and merge records in vaccination or appointment logs
- Auditing and correcting medication lists for inventory management
- Map free-text disease names to standardized codes (ICD-10, SNOMED, etc.)
Benefits of Using OpenRefine
- Saves time on manual data cleaning, improving data reliability
- Reduces data entry errors, supporting better public health decisions
- No installation required (can be run locally in a browser); free and open-source
- Empowers staff without coding experience to handle complex data tasks
References and Further Readings
- OpenRefine Download
- OpenRefine GitHub
- OpenRefine official documentation and tutorials
- OpenRefine Community Forum
- Library Carpentries: Introduction to OpenRefine
- Data Wrangling Made Easy: An Introduction to OpenRefine (video)
Example: General Refine Expression Language (GREL)
The Mass Cell Processor (MCP) is a feature in OpenRefine that allows you to perform operations on multiple cells at once. Here's a guide on how to use it.
The MCP feature in OpenRefine lets you apply transformations across multiple cells by writing expressions using the General Refine Expression Language (GREL), Python, or other languages.
Here's a step-by-step example:
- Open your project in OpenRefine with your dataset loaded
- Select the column you want to transform by clicking on the column header
- Click the dropdown arrow and select "Edit cells" > "Transform..."
- Write your expression in the expression editor
For example, if you want to convert all text in a column to uppercase:
value.toUppercase()
The preview will show how your data will be transformed. When satisfied, click "OK" to apply the transformation to all cells in that column.
Advanced MCP Examples
Example 1: Trimming whitespace and standardizing case
value.trim().toLowerCase()
Example 2: Extracting part of a string (first 5 characters)
value.substring(0, 5)
Example 3: Replacing text
value.replace("old text", "new text")
Example 4: Converting date formats
value.toDate("MM/dd/yyyy").toString("yyyy-MM-dd")
Example 5: Using regular expressions
value.match(/\d{3}-\d{3}-\d{4}/)
Common MCP Functions
-
value.trim()
- Remove leading and trailing whitespace -
value.replace(s1, s2)
- Replace s1 with s2 -
value.contains(s)
- Check if value contains s -
value.split(sep)
- Split string into array -
value.substring(start, length)
- Extract substring -
value.toNumber()
- Convert to number -
value.toString()
- Convert to string
Created: 05/13/2025 (C. Lizárraga)
Updated: 05/13/2025 (C. Lizárraga)
DataLab, Data Science Institute, University of Arizona.