Published on Mar 17, 2014
TL;DR We’re launching a free online tool for de-duplicating spreadsheets.
Data. Every day, more and more of it gets released by governments, non-profits and corporations, often for free. And that’s awesome. Each new dataset opens up important opportunities for journalists, researchers, and citizens to understand what’s going on in our world.
What’s not awesome? This data often requires a lot of cleanup before it can be useful.
The type of cleanup that we’ve found takes the most time is finding all the different records in a dataset that are really about the same thing–a task often called de-duplication. We’ve been working on building tools to make de-duplication faster and easier for everyone.
Today we are proud to announce a new tool for de-duplicating spreadsheets: Spreadsheet Deduper.
Spreadsheet Deduper is a web application that generically de-duplicates any spreadsheet with up to 10,000 rows in less than 5 minutes.
Here’s how it works:
Spreadsheet Deduper is built on top of dedupe, an open source Python library that we built to generically de-duplicate any kind of database or flat file. It builds on an entire field of academic computational research and based closely on a Ph.D. dissertation by Mikhail Yuryevich Bilenko called Learnable Similarity Functions and their Application to Record Linkage and Clustering.
The library uses some powerful string comparators, machine learning algorithms and your input to determine the best set of rules for your spreadsheet. This is the secret sauce of dedupe and Spreadsheet Deduper—you actually train the program to best identify duplicates for your spreadsheet.
Got some messy data? Give Spreadsheet Deduper a test drive - it’s free! Does it solve your de-duplication problem? If not, send us an email and let us know why.
With Spreadsheet Deduper we are really trying to learn a few things:
Also, if you want to de-duplicate data that is really big, complicated or sensitive, contact us and we can set up a custom implementation for you.