Crowdsourcing Data, While Keeping Yours Private
12:02 minutes
At the 2016 Worldwide Developer Conference, Apple engineer Craig Federighi described one way the company planned to learn from its customers, without compromising the individual privacy of any particular user: “differential privacy.” Google, too, has used a form of differential privacy for several years in its Chrome browser.
Cynthia Dwork, a co-creator of differential privacy, says it’s a “mathematically rigorous definition of privacy” in which the statistical analysis of a dataset has the same outcome, whether any individual user’s data is included or not. That means the dataset as a whole can provide meaningful insights, without revealing anything about the preferences of the individuals within the dataset.
For example: Suppose a government wants to survey its citizens, including you, about whether they’d used illegal drugs. But instead of answering directly, you flip a coin. If it’s heads, answer truthfully. If it’s tails, you flip the coin a second time. If it’s heads, answer yes; if it’s tails, answer no.
After this exercise, the government can’t know for sure whether you’ve used illegal drugs—your response has random noise built in. But by analyzing the results from all citizens, trends emerge on the frequency of illegal drug use.
Dwork and security researcher Matthew Green discuss how differential privacy and randomized responses are being used today, from analyzing texting trends to the smart grid.
Cynthia Dwork is the co-creator of Differential Privacy, and a distinguished scientist at Microsoft in Mountain View, California.
Matthew D. Green is an assistant professor in the Information Security Institute at Johns Hopkins University in Baltimore, Maryland.
IRA FLATOW: This is Science Friday. I’m Ira Flatow. When you type stuff into your web browser, your computer or your smartphone starts suggesting things, right? As soon as you start typing in the box, it’s predicting what it thinks you want. Same thing when you text, your phone auto completes your thoughts or maybe suggest the perfect emoji. You share your data, I share mine, and everyone gets to share in that benefit of a smarter data trained machine, right? Very time saving, right?
Well, it might come with a little bit of a deal with the devil, how we get all that convenience. It’s sort of a technological deal with the devil. But can you get those crowdsource benefits and preserve your privacy? There is the devil in the details.
Google’s Chrome browser use something called differential privacy to achieve that goal. And a few weeks ago, at Apple’s Worldwide Developers conference, Craig Federighi hinted that Apple’s getting into the same game.
CRAIG FEDERIGHI: Differential privacy is a research topic in the area of statistics and data analytics that uses hashing, subsampling, and noise injection to enable this kind of crowdsourced learning, while keeping the information of each individual user completely private.
IRA FLATOW: Hashing, subsampling, noise injection, lots of technobabble. But we have a few folks who might help to straighten that out and tell us what’s really going on. Cynthia Dwork is the co-creator of differential privacy and a distinguished scientist at Microsoft. She’s based in Mountain View, California. Welcome to Science Friday.
CYNTHIA DWORK: Thank you, Ira. It’s a pleasure to be here.
IRA FLATOW: Well, thank you. You’re welcome. Matthew Green is an assistant professor at the Information Security Institute at Johns Hopkins University in Baltimore. Welcome to Science Friday.
MATTHEW GREEN: Thanks for having me.
IRA FLATOW: And our listeners who want to get in on the conversation, 844-724-8255, 844-SCI-TALK. You can also tweet us @SCIFRI.
Dr. Dwork, you invented what, shall I say, English language definition of what differential privacy is, something a little less technical than what Greg mentioned in his keynote.
CYNTHIA DWORK: Yes, so we invented the mathematical definition and the English language definition says essentially this, the outcome of any statistical analysis is essentially equally likely independent of whether any individual chooses to opt in to the dataset or to opt out of the data set. So in other words, essentially, the same things can happen to me with basically the same probabilities, whether I have allowed my data to be used in the data analysis or if I have withheld my data.
IRA FLATOW: Wow, I bet that does seem like magic. Matthew Green, can you give us an example of what she was talking about?
MATTHEW GREEN: Well, so there’s a very old statistical technique called randomized response, which we use to ask people questions they might not want to answer honestly, like have you ever stolen from your employer. And the basic idea with this technique, which is one example of a differentially private technique, is that when I ask you that question, you flip a coin. If it comes up heads, you answer honestly. And if it comes up tails, you answer at random. And the basic idea there is that if I see your response. I don’t really learn much about what you individually did. But if I can aggregate many, many different responses, and I compute a statistical average number of people who are going to answer that question positively, I can subtract away the noise and I can learn things that are useful to me.
IRA FLATOW: So you can basically hide the response through mathematical noise, so to speak, so we don’t know who actually made the response. And that’s why, Dr. Dwork, you say it’s 50/50 whether we know or don’t know who it was.
CYNTHIA DWORK: Actually, in the example that Matthew just gave, we may know exactly who is responding, but because of the randomness that’s introduced in the procedure for generating the yes or no, I stole from my employer, we don’t actually know if a yes means it really happened, or if the person is saying yes because the coin flips said to say yes. And so even if we knew exactly who it was, we still would only have a vague statistical hint as to whether they actually did or did not engage in the behavior.
And these statistics all hints, while telling us nothing about the individual or very, very little about the individual, can be aggregated over many individuals to understand the fraction of people who cheated from their employers.
IRA FLATOW: Without giving away who that individual was?
CYNTHIA DWORK: Exactly.
IRA FLATOW: So we know about the whole population, Matthew, but we don’t know anything about the individuals in them?
MATTHEW GREEN: That’s correct.
IRA FLATOW: And is this what Apple is doing in their new iPhone system that’s going to be launched?
MATTHEW GREEN: So it appears that they’re using a variant of this technique, this randomized response technique. And what it does is almost exactly the same thing, except they’re taking that technique and they’re kind of putting it on steroids. So instead of just asking one question about what your phone has done today, they can ask very complicated questions like, for example, did you erase a word when you were writing a message and replace it with an emoji.
And they can learn from that once they aggregate many people’s information. They can learn, well, are people commonly using this emoji instead of a particular word? And they can take that information and use that to make that a suggestion to other people as well.
IRA FLATOW: So what do you say to people, Dr. Dwork, who are afraid that their privacy is being compromised here?
CYNTHIA DWORK: I think it’s quite a challenge to articulate the nature of a probabilistic guarantee. You know, I need to improve my ability to teach the public understanding of risk here.
[LAUGHTER]
IRA FLATOW: So would you say then they shouldn’t be worried? I’ll put it very simply like that. Don’t worry, they’re not giving away whom you are.
CYNTHIA DWORK: Yes, that’s what I would say.
[LAUGHTER]
IRA FLATOW: OK, that’s very simple. Now, I understand– let’s talk about Microsoft. I know Microsoft is working with a power company in California to implement some of this differential privacy technology. Is that right? What does that have to do with the power grid?
CYNTHIA DWORK: Well, your individual power consumption can actually reveal quite a bit about you. Apparently, it can even reveal which movies you’re watching because the patterns of when the screen lights up are very distinctive for a movie, and that takes a certain amount of power. So the California Public Utilities Commission require certain reporting on smart grid data. And this is aggregate reporting.
And there’s a question about how the data ought to be aggregated. And the administrative law judge made one proposal, which is, of course, the law. But a power company in Southern California is exploring using differential privacy as an easier and more powerful technology to at least comply with the spirit of the law.
IRA FLATOW: I see. As the internet of things takes off, we’re going to have sensors in everything we wear, we drive, things like that. Is this definition of differential privacy, is this something that we should get used to hearing, because this is how this data will be collected? Matthew?
CYNTHIA DWORK: I can’t predict the future.
MATTHEW GREEN: I hope so. I mean if companies that are building IoT, Internet of Things, devices start building in privacy protections of any kind, we’re going to be a lot better off than we are today.
IRA FLATOW: And can it be made stronger? Or what kind of research can you do on it to make it better?
CYNTHIA DWORK: So any kind of disclosure leaks a little bit of information. And there’s a fundamental law of information recovery, which says that overly accurate estimates of too many statistics can completely eventually destroy privacy. And this can no more be circumvented than can be the laws of physics. The goal of algorithmic research on differential privacy is to postpone this inevitability and to push to the theoretical limits.
IRA FLATOW: So you have to decide how much– I understand what you’re saying. There’s a trade off here between collecting more data and less privacy or more privacy and less data. And you want to find a comfort zone there, because it will start leaking some of that information about the person. Would that be right, Matthew?
MATTHEW GREEN: Right, so eventually, if I ask you the same question every day, and you answer that question, even if you randomize your answers and add noise, over time I’m going to learn something about you if you continue to answer honestly. And so the danger there, of course, there are other questions you could ask me that are correlated with that first question. So instead of ask me, did I steal from my employer, you could ask me, have I ever stolen from anybody. And so the danger there is actually answering all of those very specific questions of how often should I ask the question, how should the question be phrased, and what do I do to prevent that information from eventually being tied to a person.
IRA FLATOW: Because if you ask it enough times, you might get better answers. You have to actually, I guess, figure out where that differential point is about the question and the answer and how much–
CYNTHIA DWORK: So what differential privacy lets you do is it lets you measure the cumulative privacy loss over many analyzes so that you can scale your noise accordingly in order to stay safe. But eventually, your answers will become pure noise.
IRA FLATOW: It will become pure noise, which–
CYNTHIA DWORK: Eventually.
IRA FLATOW: –which means the answers are worthless?
CYNTHIA DWORK: Overly accurate estimates of too many statistics is blatantly non-private. There’s no getting around that.
IRA FLATOW: So if you become too accurate, you give up privacy. But so what you want–
CYNTHIA DWORK: And it has nothing to do with differential privacy. This is just a fact.
IRA FLATOW: OK, so then you want to have enough privacy, but not too much so that you can get some of the information out. And we have to, you and technologies have to decide where that comfort zone is.
CYNTHIA DWORK: Where the comfort zone is an interesting question. If I were to say to you, is a week a long time? What’s the answer to that question? I don’t know it depends. It depends on the context. And humans have developed some notions about the value of time. Similarly, we have a particular measure of privacy loss. And we need to develop a real understandings of what these numbers mean and in which contexts.
IRA FLATOW: In other words, would I be willing to give up some of the convenience of texting ahead and having it guess what I want, but get a little bit more privacy?
MATTHEW GREEN: Well, to be clear, these systems right now are opt-in. So you do have the option to opt out if you feel like you don’t trust this technology now. But just to be clear, on a daily basis, we type things into Google that we probably wouldn’t tell our closest friends. So anything that improves over what’s currently being done in the tech industry is probably going to be a big improvement.
IRA FLATOW: All right, we’re going to leave it at that. Thank you both, Cynthia Dwork, co-creator of differential privacy, distinguished scientist at Microsoft. She’s based out there in Mountain View, California. Matthew Green, assistant professor at the Information Security Institute on the other coast at Johns Hopkins University in Baltimore. Thank you both for taking time to be with us today, and have a happy holiday weekend.
Copyright © 2016 Science Friday Initiative. All rights reserved. Science Friday transcripts are produced on a tight deadline by 3Play Media. Fidelity to the original aired/published audio or video file might vary, and text might be updated or amended in the future. For the authoritative record of ScienceFriday’s programming, please visit the original aired/published recording. For terms of use and more information, visit our policies pages at http://www.sciencefriday.com/about/policies.
Christopher Intagliata was Science Friday’s senior producer. He once served as a prop in an optical illusion and speaks passable Ira Flatowese.