Information adapted from OpenRefine’s documentation wiki.
What is OpenRefine?
OpenRefine, previously known as Google Refine, is an open source, web based data cleaning and transformation tool. OpenRefine allows you to import and clean large datasets in formats such as Excel, XML, RDF as XML, RDF N3 triples, JSON, TSV, and CSV.
What can OpenRefine help you do?
- Use filtering and faceting to more easily explore large data sets
- Cluster data to identify, edit spelling and entry discrepancies to help ensure data quality
- Easily apply bulk transformations or subsetting to your data using OpenRefine’s general expression language or add an OpenRefine extension to use other common expression languages
- Change history is automatically documented with available undo/redo functions
- Export your data in TSV, CSV, Excel, or HTML table formats
- Export as a .tar.gz file which will include your data transformation history
- Further functionality available through extensions and reconciliation
How do you get it?
It is available at no cost from openrefine.org. You can find information on other distributions and extensions on the same page. They also have detailed installation instructions available. OpenRefine will run in your browser once successfully installed.
What else should you know?
You can run into memory issues with OpenRefine and may need to adjust your memory allocation. OpenRefine is a great tool for cleaning up messy data, however it is important to remember to follow good data management habits. Keep your raw data separate from your cleaned datasets, document what you did to your data so that the results are reproducible, and ensure that you follow good backup and storage practices.
If you’re interested in learning more about using OpenRefine, keep an eye out for upcoming Data Carpentry workshops on campus which build in hands-on time with the tool and help from instructors. Also, feel free to contact us if you have any questions in the mean time.
If you enjoy this tool, you can also contribute to the OpenRefine development community.