
Features - Enterprise Data Insights:
WITH FALSE NUMBERS, DATA CRUNCHERS TRY TO MINE THE TRUTH
By Anne Eisenberg - New York Times
Online merchants who ask nosy questions like that on surveys at their Web
sites have learned what usually honest visitors will do. Fib, most likely.
People give false answers to protect their privacy. Then, because the data is
so unreliable, companies can't use it to help them run their businesses.
Two I.B.M. researchers have devised software that seeks to get around this
information age impasse. Rakesh Agrawal and Ramakrishnan Srikant, computer
scientists at the I.B.M. Almaden Research Center in San Jose, Calif., have
devised a data-mining program that would cloak individual truthful answers
that people might enter once their trust was won but still recover important
characteristics of the overall group.
For instance, instead of recording the answer "41" to a nosy question like
"How old are you?" the software automatically adds a random number of years
within a specified range, say minus 30 to plus 30, to the answer. No record of
initial answers is kept. Then, using a series of mathematical guesses based
partly on how the initial data was randomized, the program gradually
reconstructs a realistic distribution of the age groups that responded how
many people were 20 to 25, say, or 40 to 45. Demographic information like this
might be of great interest to a company in quest of 25-year-olds to buy its
sports cars or computer games.
Some inaccuracy results when the I.B.M. program approximates the actual
distribution of age, salary or other characteristics in such large data sets,
said Ann Cavoukian, the commissioner of information and privacy in Ontario.
"But in return for about 5 percent inaccuracy, you have a privacy model in
which individual answers are not used," she said. Programs like this one could
lead to greater truthfulness in the answers people volunteer on the Web, she
said, provided that they were willing to replace some of their native caution
with a bit of good will toward a company and its need for data-mining.
"Right now, the rate of falsification on Web surveys is extremely high," Dr.
Cavoukian said. Conservative estimates are 42 percent, but anecdotally the
rates are far higher, she added. "People are lying," she said, "and vendors
don't know what is false and accurate, so the information is useless." Dr.
Agrawal said that his way of reconstructing data was based on hiding the true
numbers, although not through the sort of lying practiced by ordinary people
confronting a questionnaire.
"When people lie randomly and that is what they do now when they answer
questions we get very poor results," he said. But by "adding random values to
true values," he said, "we can reconstruct a distribution that is very close
to the actual one." Dr. Srikant said, "We know a lot about the distribution of
these random values."
The random numbers generated by the computer could be distributed in a bell
curve, for instance, with most values clustered near zero and fewer at either
end. Or the computer could pick random numbers out of a hat, with the chances
of picking any one number the same as for any other.
Using this information, Dr. Srikant said, the researchers make a first guess
at what the true distribution should be. Then the program crunches through the
analysis and produces a slightly better guess. This guess is crunched again,
and the process is repeated over and over again, getting closer and closer to
the actual distribution. "When you do this for 10,000 answers, the overall
distribution is likely to be accurate," Dr. Srikant said.
Johannes Gehrke, an assistant professor of computer science at Cornell
University who specializes in data mining, said the program was the first
effort to address in depth the challenge of reconstructing a distribution of
large data sets in the context of data mining.
"You know the record after randomization and you also know how you randomized
the record," he said. Those two pieces of information, along with a standard
statistical theorem called Bayes' rule, allow the program to estimate the
prior distribution.
Random perturbation, the formal name of the technique used by the I.B.M.
researchers to mask the original answers, satisfies the demand for privacy to
a greater degree than many other procedures available to organizations, said
David F. Andrews, who recently retired as a professor in the department of
statistics at the University of Toronto. "The idea that you can take data from
a population, add random noise to it and then recover important
characteristics from this perturbed data has a long history," he said.
Techniques that reconstruct distributions without revealing individual
information may be welcome not only to people filling out forms but also to
companies that ask touchy questions. "If companies have data and it escapes,
they could be liable for data breaches of security," Dr. Andrews said. "This
way, you can't be sued."
The program and related ones by other researchers may help companies explore
raw data presently closed to them, said Christopher W. Clifton, an associate
professor of computer science at Purdue University and author of a chapter on
security and privacy in the forthcoming LEA Handbook of Data Mining (Lawrence
Erlbaum Associates). "These programs ensure that the original data values
can't be reconstructed, but are still close enough to the real results to be
meaningful." The I.B.M. program has been tested in the lab and a prototype is
available. Dr. Cavoukian said she hoped that businesses would soon come
forward to do beta tests of the software.
"Usually technology is used to invade privacy," she said. "I like this program
because here we are using technology to protect privacy."
|