« back to blog

Forest Gregg

Forest Gregg

Dedupe 0.5

Published on Mar 17, 2014

Today, we are excited to announce a major update to the dedupe library! The new features include parallel processing support, improved record linkage across files, and a new asynchronous architecture.

Parallel processing

Deduplication is ultimately about comparing records to try to decide whether they are similar enough to be about the same thing. As the number of records grow, so do the number of comparisons. While we use a lot of techniques to reduce the number of comparisons (often by over 99%), comparing records for big data can still take hours.

One way to speed that up is to increase the number of processors that your computer is using to make these comparisons. Dedupe 0.5 implements multi-core parallel processing using the standard Python multiprocessing library. This can roughly cut the time comparing records by the number of processors on your machine.

Record linkage across two sources

Sometimes you want to link two datasets where you know that each dataset, by itself, does not have any duplicates. When you can make this strong assumption, the record linkage problem can be solved much more accurately.

Last summer, Code for America’s Google Summer of Code student Nikit Saraf wrote most of the code to handle this type of record linkage. Thanks Nikit! Thanks Code for America! Thanks Google!

Check out this example of linking product information between different online stores.

Asynchronous architecture

Rich user interfaces depend upon loose coupling between the user interface and the core dedupe program. We rewrote the active learning interface to support asynchronous user labeling. These architectural changes let us write a rich user interface for Spreadsheet Deduper.

New dedupe apps!

With these new dedupe features, we’ve been able to create some new and exciting tools to help developers and non-techies de-duplicate their data.

Spreadsheet Deduper

Using dedupe’s new asynchronous architecture, we were able to bring dedupe to the web for a wider audience. With Spreadsheet Deduper, you can de-duplicate any spreadsheet with up to 10,000 rows online, for free. Read our blog post for more details on how and why we built it.

Last year, we received generous support from Knight-Mozilla to build a command line interface for dedupe called csvdedupe.

With the 0.5 dedupe release, we have also updated csvdedupe with

  • a new csvlink command for linking two datasets together like you would with a SQL JOIN
  • a new ‘destructive’ mode for automatically deleting duplicates for csvdedupe and only returning matching records in csvlink (much like a SQL OUTER JOIN)

Join the dedupe community

If you’re interested in contributing to dedupe or have any trouble working with dedupe or any of the tools we built on top of it join our dedupe Google group or chat with us on the #dedupe IRC channel on irc.freenode.net.