DataMade at NICAR 2016

Published on Mar 10, 2016


NICAR 2016, an annual conference devoted to data journalism

This week, DataMade will be in Denver for NICAR 2016, a yearly conference hosted by Investigative Reporters and Editors (IRE) on data journalism. We’ll be presenting on the work we’ve been doing and sharing our open source tools for working with, cleaning up and visualizing data.

News nerds, look for us in Denver! We’d love to talk to you!

Derek Eder
Derek Eder, Founder and Partner
@derekeder
Ask me about: civic tech, data visualization, DataMade stickers
Forest Gregg
Forest Gregg, Partner
Ask me about: dedupe, machine learning, open source, deep thoughts


Eric van Zanten
Eric van Zanten, Senior Developer
@evanzanten
Ask me about: SQL, python, open civic data, graveyards
Cathy Deng
Cathy Deng, Developer
@cthydng
Ask me about: data visualization, documentation, greyhounds

Our sessions

All four of us will be presenting while at NICAR. Here’s a preview of our sessions:

Advanced data cleaning with Python: Machine learning techniques

Forest Gregg, Cathy Deng
Date/Time: Thursday, March 10 at 4:45 p.m.
Location: Denver II

People and corporations are of interest to reporters, but data about them are often messy. Fundamentally, natural language lacks structure and the same thing can be represented in many different ways. In many cases, simple deterministic approaches (e.g. regex) can’t get you very far. We’ll show you some of the powerful tools that DataMade uses to efficiently clean and link the worst data, including dedupe, usaddress, and probablepeople.

See Forest and Cathy’s presentation slides.


Can’t afford your own unicorn? Bringing civic hackers into the newsroom

Derek Eder, Josh Stewart, Eva Constantaras, Marco Tulio Pires
Date/Time: Friday, March 11 at 10:15 a.m.
Location: Denver III-IV

Let’s face it, most of us work at media houses that are never going to have a full data team. But we have plenty of evidence from around the world that there is another way! Millions of dollars are being poured into the development of tech hubs, open data portals and data-driven think tanks, all of which are collecting fabulous data, building apps and finding amazing stories, none of which ever reach the public. In this session, the people who have created new models for bringing together civic hackers and under-resourced newsrooms to collaborate on data-driven projects will share what it takes to make these partnerships work and evaluate their impact on the public and the media.

Derek will talk about tracking Chicago’s snow plows with ClearStreets.org and the subsequent collaborations with NBC5 Chicago and the Sun-Times on plowing (or not plowing) streets fairly. See his presentation slides here.


Election: Sane ways of collecting candidate information

Speakers: Eric van Zanten, Donny Bridges
Date/Time: Sunday, March 13 at 9 a.m.
Location: Denver III-IV

At DataMade we’ve found ourselves often needing to handle and organize information about candidates running for office, their campaign committees, how those committees are funded, and ultimately the outcomes of the elections that they are in. As a result, we’ve ended up finding and creating some open source tools to integrate these pieces into an extensible system for tracking this information and tying it to the people who are involved. In this session we’ll share the tools that we’ve assembled and how they might be reused.


Journalists, use our open source tools!

We write a lot of open source software for working with, cleaning up, and visualizing data. Here’s some tools we built:

De-duplication and data linking

dedupe
A python library that quickly performs de-duplication, entity resolution and linking on large, structured data.

csvdedupe
Command line tools for using the dedupe python library for deduplicating CSV files.

We have been using dedupe to build the Entity-Focused Data System with the Atlanta Journal Constitution to continually link information about political figures, campaign filings, contracts and lobbyist disclosures to drive investigations.

Parsing messy data

usaddress
A python library for parsing unstructured United States address strings into components like AddressNumber, StreetName and ZipCode.

probablepeople
A python library for parsing unstructured western name strings into components like GivenName, MiddleInitial, Surname, or Corporation.

parserator
Need to parse some messy text? We created a toolkit for making domain-specific probabilistic parsers. To create a parser, all you need is some training data to teach your parser about its domain. We used this framework to build usaddress and parserator.

Bonus! You can also parse names and addresses online without using any code. Check out parserator.datamade.us

Data visualization

Searchable map template
An HTML and javascript template, powered by Fusion Tables, that helps you turn a spreadsheet into a fully customizable searchable map.

CSV to HTML Table
Display any CSV file as a searchable, filterable, pretty HTML table. Done in 100% JavaScript.

Depending on the interest, Derek may run an informal session on how to get started using these tools. If you’re interested, ping him on Twitter @derekeder.

Guides

data making guide
DataMade’s guide to creating non-destructive, repeatable scripts for extracting, transforming and loading (ETL) data.

site launch checklist
Did you forget to setup Google Analytics again? What about load testing? We created a checklist of final tasks to do before launching a public, open source website or tool.

Bonus! We also know quite a bit about structuring information about government people and organizations with Open Civic Data through our work with Councilmatic. If you’re interested, ask us about it!