« Our work

Entity-Focused Data System

Entity-Focused Data System
De-duplication, distributed tasks, data linking
Atlanta Journal Constitution
February 2015

A data system to continually link information about political figures, campaign filings, contracts and lobbyist disclosures to drive investigations.

More about this project

The Entity-Focused Data System addresses basic technical problems that prevent journalists from taking advantage of newly available government and business data. These datasets have the potential to allow journalists to quickly and efficiently connect the dots, to reveal the influence that institutions and individuals exert on society. In other words, with the right preparation, these datasets, and the links across them, contain important stories.

To really find stories buried in data, journalists and researchers need to know that the Mr. John Smith that gave money to State Representative Roberts is the same John Smith that owns Acme Corp Limited, and that Acme Corp Limited is the same company as the Acme Corp Ltd. that got a big government contract from a committee that State Rep. Roberts sits on.

Journalists have been connecting these dots for decades, but it is slow, painstaking work. With the right tools, it can be cheaper, faster, more accurate, and automatic.

With the Atlanta Journal Constitution, we are building a system that uses dedupe, a record linkage library we have been working on for a few years (thanks in part to some funding from Knight-Mozilla OpenNews).

This system will have a workflow for continuously adding and linking new data that will allow journalists to benefit from the most timely information about public figures and organizations, and allow them to connect figures in incoming data with those already discovered in other datasets.

Tools created under this project

parserator kit

Toolkit for making domain-specific probabilistic parsers.


Python library for parsing unstructured US address strings into address components, using advanced natural language processing (NLP) methods. Check out our blog post on Parsing addresses with usaddress.

probable people

Python library for parsing unstructured romanized name strings into name components, using advanced natural language processing (NLP) methods.