We know everything, and it means nothing

This is a written version of a short talk I recently gave during the Philadelphia Area New Media Association (PANMA)'s annual show and tell. It's closer to how I practiced it than how it actually came out of my mouth. I've included one very ugly, but relevant, slide. 

I'm here to tell you a tale of two cities. Really, a tale of one city, and a lament about another city in another time.

They are both about data, or more specifically, how most of our data is useless. Or, if you're an academic, how most of our data are useless. How we know  everything and it means nothing.

I work in public health, and we traffic in data. We've been doing big, population-level data for a very long time - since the famous Broad Street cholera outbreak of 1854. Here's a refresher: in 3 days, over 150 people died from cholera in a small part of London. So this doctor, John Snow (not Jon Snow, the Game of Thrones guy), puts all the deaths on a map while everyone else is talking about how these dead people wouldn't have gotten cholera if they weren't so poor and dirty. The prevailing wisdom at the time, by the way, is that disease is spread by smell, or miasma. Florence Nightingale, mother of modern nursing? HUGE miasmatist.

Anyway, Dr. Snow maps the cases, and the miasma theory suddenly seems off. He finds a local guy with a badass beard who knows everyone (and everyone's business). His name is Reverend Henry Whitehead, and he knows all about who's been doing what with whom. Whitehead and Snow go door to door asking really invasive questions. Snow figures out the disease is coming from the water supply, and he breaks the handle off the water pump. The cholera outbreak dies down. Everyone is pissed because some nut job just took away their access to the best-tasting water in London.

Such is the glory of public health. Anyway. That was the beginning of modern epidemiology, the study of epidemics. It was over 150 years ago that Dr. Snow plotted those cholera deaths on a map.

Bring it up to modern day in Philadelphia. What's changed?

Well, the maps are getting fancier. We've been collecting tons and tons of data the whole time. So much data that we have biostatisticans and epidemiologists all over the place, and health departments have different divisions for different diseases, and they all have their own epidemiologists and data analysts, and they all make these nifty reports that go into giant PDFs!

Actually, you only get PDFs if you're lucky. Part of what I do, both for work and fun, is poke around the internet looking for data for Philadelphia and its suburbs. A lot of it isn't published in public-facing reports at all. Formats vary wildly. One health department database has these download instructions: "to download, highlight, then copy and paste". Or my personal favorite, a national dataset on substance use and mental health with 3,120 public variables, which you can only sort by a useful geographic area if you fill out this application. It's 20 pages, has its own manual, and they review applications once a year.

So I go around searching for public data, collecting it, digging through it, and compiling it. If I had to guess what you were thinking right now, it would be something like "Sucks to be you."

But public health is a public problem, so my problems are also your problems. Let's think about a public health issue a lot of us have heard about. In Philadelphia, today, there's a lot of talk about high STD rates among teens. So what do we, the public, know about our city's teens? Let's look at the Youth Risk Behavior Survey. This showed that 38% of Philly high schoolers said they were sexually active, meaning they had sex in the last 3 months. 42% of them didn't use a condom last time. This isn't great great, but here's the kicker: 21.8% of all Philly high schoolers had had at least 4 sex partners in their lives, which means that they're picking up partners at a pretty steady clip for people who aren't using condoms. Plus, these are somewhat closed social networks. A few years ago, for fun at work, we mapped how STDs would spread through the cast of the Jersey Shore. Imagine that in a high school.

Now that we've established that this is a problem, what do we know about teens in Philadelphia? And when I say "we" I mean the Henry Whiteheads, not the Dr. Snows. We're only looking at published public data, so we've got information on teen pregnancies, youth risk behaviors, other risk behaviors,  STD cases, and HIV cases. And let's define teens as 13 - 19 years old - the conventional definition - and see what data we have for that group, 13 - 19 year olds, that will tell us the most about teen STDs.

I went to the six most relevant data sources, and sorted by age. Then I made a hideous Venn diagram to illustrate why we can't have nice things. What you'll see is that the only published public data available for Philadelphia for the age range 13 - 19 is HIV cases. That's the big bubble. The others have some age groups that fall within the range we're looking for, but none offer a complete picture of what's going on with 13 - 19 year olds.

An obnoxious illustration of what happens when we sort data sources by age groups.

An obnoxious illustration of what happens when we sort data sources by age groups.

So what do we know about teens in Philadelphia, collectively, as a group? Not a whole hell of a lot. And this is just age groups. Imagine what complex categories like race, ethnicity, and sexual orientation look like.

The thing that drives me bananas about this, is that this is an easy problem to fix. This isn't trying to get access to new data, or inputting data differently, or buying new software, or developing new infrastructure. This is just running existing reports on existing data slightly differently so that we can look at information from different places and then extract some meaning from it.

Listen, before we talk about open data, before we get into APIs and ripping things out of PDFs, we need to fix this problem. Machine-readable data is great for epidemiologists and devs and Dr. Snows, but in the meantime, we need something that the community can use.

We need data that's comparable across sources.

Data that tells a story about a population.

Data that means something.