This post is also available in: עברית (Hebrew)
How did a software engineer de-anonymized data on 173 Million NYC taxi trips?
When New York City released a treasure trove of data about fare logs and historical trip information earlier this year through a public records request, it was a dream come true for open-data advocates. But it created a headache for public officials after the information was released – there was a privacy flaw in how the information was anonymized.
In June, software engineer Vijay Pandurangan was going through the newly released data and found a data-collection error that, as he detailed in a post on Medium, saying that “the entire anonymization was flawed and could be easily reversed.”
In less than two hours, Pandurangan had de-anonymized the data of 173 million taxi trips, including the pickup and drop off location, the anonymized hack license number, the taxi’s medallion number, along with other metadata.
It was the latest example of the “the privacy perils of anonymized data,” according to Ars Technica.
In July, Pandurangan spoke about how he identified the flaw in the anonymized data as part of a Dev Bootcamp event hosted by Beta NYC, the local Code for America brigade in New York City. Above is a Video of Pandurangan’s talk .
“It’s a really hard computer science problem to anonymize data. It’s not trivial at all,” he said during his talk. “And there are a lot of people who spend their PhD theses doing this and it is a really complicated and challenging problem. I don’t mean to say that the person who did this is an idiot by any means. It shows how difficult it is to get this right.”