Published on Apr 28, 2016
Dedupe.io: Quickly and automatically find similar rows in a spreadsheet or database.
Today, DataMade is launching something new. It’s not a map, or an app. It’s a product called dedupe.io. And it will transform your messy data into something much more useful.
At DataMade, we build technology and use data to empower journalists, researchers, governments and advocacy organizations.
What do these disparate datasets have in common? There’s a lot of it, and it’s a mess.
Before we can map how much money we’re spending on incarcerating people per city block, before we can tell you how much money Ed Burke’s political campaign has in its coffers and before we can show residents what lots are for sale in their neighborhood, we have to spend a lot of time just cleaning up and preparing data. It’s a lot of work.
We’re very proud of these tools, and we know they’ve helped a lot. But for organizations with limited developer resources, it’s not enough. We needed to build something robust, powerful and easy to use so anyone can take advantage of these tools without the need for writing code.
That is why we are proud to be launching dedupe.io today. It’s a web interface for quickly and automatically finding similar rows in a spreadsheet or database, using machine learning methods.
Powered by our open source dedupe library, dedupe.io is customized for your data by the training you give it. From there, it learns the best way to compare records to identify duplicates in your data.
It’s built to be simple. Upload your data, provide some training examples, review some of the matches, and we take it from there.
Training dedupe.io to find similar records
When it’s done you can search and review the results.
Dedupe.io results in a searchable, browseable interface
Or, you can download your original data as a spreadsheet which includes a new column called
entity_id on the left that indicates a cluster of records. These results can be opened up in Excel or imported into a database.
Dedupe.io results, viewed in Excel
Dedupe doesn’t just work on one static dataset. We’ve built it to handle a variety of data matching cases, including:
Record linkage - Is your data siloed in multiple systems and databases? We can link together multiple datasets and find overlapping records in each.
Continuous matching - Data is rarely static. New records are added and updated all the time. To handle this, we’ve set up dedupe.io to match streaming data as it comes in and apply your training to it to automatically match new records.
API integrations - Integrate dedupe.io seamlessly into your project or data pipeline with our restful API.