HOW MUCH CAN DATA ANALYSIS BE AUTOMATED?
By Will Dwinnell
In the not too distant past, people wanting to perform a simple linear
regression had to perform many calculations 'by hand' (even with a calculator,
this is tedious work). The amount of data which could be manipulated in this
fashion was severely limited and performing multiple linear regression was
even more involved. In the late 1970s, it became possible for a moderately
competent programmer with affordable hardware to produce a program which would
carry out this exercise on the computer. In the early 1980s, Visicalc became
available, providing linear regression capabilities to people possessing much
less technical skill. Ultimately, linear regression became encapsulated as a
single, built-in spreadsheet function. Today, of course, not only will
packaged software perform multiple linear regression, run stepwise variable
selection and calculate significance tests, but it will drive even more
powerful modeling and analysis methods, such as neural networks and fractal
dimension estimation.
Modern commercial data analysis software, often driven by expert systems,
will perform many diagnostic tests and generate appropriate data
summarizations automatically. Current data modeling tools can select many
model parameters autonomously using powerful error resampling methods, such as
k-fold cross validation and bootstrapping. The trend is obvious: increasing
amounts of analytical work are being performed by computers. The question
posed by this fact should be equally obvious: how much can data analysis be
automated? Is it possible that all facets of data analysis can be handled by
the machine without human assistance?
A review of several data modeling tools which I published several years ago
included the following passage:
'While any of these products is capable of performing without preprocessing
or expert guidance, none offers a true "one-button solution" for users.
Knowledge of basic statistical and modeling concepts would benefit users of
such systems, as data analysis and preprocessing make the tool's job easier,
and post-processing and diagnostics ensure the quality of model output. In
particular, questions of sampling and overfitting remain issues that users
must address with these new modeling tools.'
Certainly, at this point in time, some progress has been made toward
mechanizing several of the tasks mentioned, particularly in regard to avoiding
overfit. Data analysis tasks may be divided by whether or not they may be
performed by humans and whether or not they may be performed by computers.
This gives us the following four categories of tasks, those which: 1. Require
human performance (the manual) 2. Require computer performance (the
necessarily automatic) 3. Can be handled by computer or human (the optionally
automatic) 4. Cannot be handled by man or machine (the unfeasible).
Naturally, the lines which divide these categories change over time. Before
the availability of computers to do this kind of work, all data analysis tasks
by definition were either handled by humans or they were unfeasible. The
introduction of computing machinery and its subsequent growth in power has
eaten into both of these areas, automating increasing amounts of manual work
and claiming previously unfeasible work as computer-only work. In fact, new
data analysis tasks have even been invented specifically for computers. To
answer the question at hand, it will be helpful to understand why different
types of tasks have fallen to the computer.
The most natural applications of computers to analytical work have been
those which require mathematical manipulation of large amounts of data.
Indeed, our very concept of what constitutes a 'large' data set has been
radically transformed by the growth of the information storage and
manipulation capabilities of our machines. Today, it possible to buy several
gigabytes of hard-disk storage for a few hundred dollars and an increasing
number of organizations maintain terabyte-level storage.
The core math and logic of small-scale data analysis which was formerly
performed manually has largely if not completely been claimed by the
computers. On my shelf sit several books on 'pencil and paper' data analysis
which describe various techniques and tricks for getting the most from your
manual efforts. Such texts include interesting higher- level analytical and
mathematical material but the skills which they describe for low-level,
detailed processing of the numbers are mostly out-dated. Tasks such as
building linear regressions, calculating standard deviations and finding
medians has irrevocably been absorbed by computer software.
Of course, many types of work which were once unfeasible were simply
extensions of work performed by humans. Calculating means or correlation
coefficients is impractical for human analysts when data sets become too
large. Theoretically, it is possible for humans to do such things, but it is
simply too costly and time-consuming for them to do so. Such drudgery has
quickly been snapped up by the machines. Finding the average of a million
numbers is quite easy today, even with what is now considered 'low-end'
hardware.
These two types of rote work, the previously manual and the previously
unfeasible, may be considered as one, distinguished only by how much tedium
humans will or can tolerate. Additionally, computers have handled an
escalating level of decision-making in the actual analysis process. This has
been true at the lowest level, as in converting class data to a series of
dummy variables or automatically identifying statistical outliers using simple
heuristics. It has also been true at a higher level: some commercial offerings
embody a great deal of expertise for handling data.
Scenario from Cognos is a data modeling tool which will automatically set
variable types for imported data, select variables for inclusion in the model,
set model parameters and control model complexity. Additionally, Scenario
generates reports in English which summarize its findings. As another example,
Forecast Pro from Business Forecast Systems will automatically choose an
appropriate forecasting method using an expert system which analyzes the data.
Tools like these provide a great deal of data analysis sophistication within
an automatic framework. Computers can definitely handle a large portion of the
data analysis job, even at a high level of decision-making. As with other
fields within computing, when the problem-solving process can be clearly
expressed as an algorithm, the computer can be programmed to perform the task.
To this point, a distinction has been made only between what humans and
computers can and cannot do. Clearly, though, humans possess a range of
abilities in this department. In the book "The Electronic Cottage", Joseph
Deken writes about the ability to deal with abstract data as though it were
another sense. Some humans, obviously, have a more developed 'sense' of
information than others. This is important in that some people possess a
greater ability to interpret, understand and absorb abstract information than
others. To generalize this idea, the effectiveness with which some people
(such as statisticians) deal with available data analysis systems will be
greater than for others.
Automation will, to some extent, help people with less of this 'sense'
utilize information. Deken touches on the idea of the 'computer as
consultant': 'By using a computer, you can obtain not only new data but
valuable built-in "consultant service" to help you analyze it.' What Deken
describes is a vision of interactive exploration and understanding which is
becoming a reality on today's software tools (On-Line Analytical Processing
and data visualization come to mind). (This is fairly visionary for a book
which was published in 1981!) Deken offers a warning, though, about using only
computer analysis -- specifically in relation to causal relationships:
'Causal relationships are the strongest type you can hope to find, and are
the backbone of engineering and science. A word of caution is in order here,
as you begin to flex your newfound computer capabilities: You will have
unprecedented power to find relationships and associations. Once found, an
association's predictive value can only be established by statistical or
empirical induction. The causal nature of an association can only be
investigated by experiment. All the computer calculation in the world cannot
vault you over a single one of these inductive or experimental hurdles.'
Thus, there are at least some things which cannot be handled by the
computer. The statistical induction which Deken mentions can be implemented
on the computer in the form of things like hold-out testing. His comment about
causal associations, though, brings up an important area which computers do
not cover: the context of the data analysis problem. The difficulty is that
this is conceptual issue and not a mechanical one. A computer can detect
statistically significant associations between variables, but it cannot
determine whether that association has any real-world significance. For
example, a statistical relationship may be discovered between weather and
sales, but whether that is important to us depends on the context of our
problem. A business analyst would be better suited to answer questions of
real-world significance. For computers to answer such questions would require
a great deal more intelligence than they currently possess. Without an
abstract knowledge of what drives a business, simply knowing that weather
affects sales may be of no use.
As another example, consider a political model of voting patterns in
Pennsylvania: whether it is of any use in Georgia is doubtful, given the
different statistical universes involved. This is precisely the sort of thing
which analysis software on computers have no knowledge of and no control over.
When the problem context is narrowly specified, greater automation is
possible through customization. Vertical applications for fraud detection or
credit scoring are good examples of this. The downside of this approach is the
sacrifice of generality: these systems have limited utility outside of their
intended application.
Another area which computers have not attacked is the creation of new data
analysis techniques. Things like Fourier transforms, wavelets and
morphological smoothing may be immensely useful in data analysis and may also
be automated once they have been discovered. Before their invention, however,
there is no way for computers to make use of them. It doesn't matter how
powerful the computer is: if it's running linear regressions on non-linear
data, then it can't compete with one running non-linear regressions.
Constructing new analysis methods is a creative act which thus far has been
the exclusive domain of humans. This is often a significant portion of the
data analysis job, particularly in modeling. Analysis often requires the
creation of new filters, measures or summaries of data for individual
projects.
Genetic programming has shown some promise of automatic generation of new
algorithms, and has demonstrated an ability to assemble relatively low-level
filters. Unfortunately, it has yet to prove itself capable of developing
anything as interesting as higher-order spectral analysis. Whether it will
ever be able to do this remains a matter of speculation. The ability to
innovate leaves humans ahead in the development of methods for such tasks as
preprocessing of training data, postprocessing of model results, actual
modeling of data and graphical display of data.
'One-button' data analysis alludes us and appears likely to do so for the
foreseeable future. Various mechanical activities necessary for data analysis
may be (and many have been) automated. It is apparent that this idea extends
beyond the mechanical grinding of the data to higher level decision-making.
There are, however, two areas of data analysis (at least) which seem unlikely
to be reproduced on computers any time soon: creation of new analysis methods
and understanding of problem context. These would require an enormous leap in
the amount of intelligence employed by computers in their work.
Software Review Mentioned: "Advanced Modeling Systems", AI Expert (June,
1994) by W. Dwinnell.
Texts on Manual Data Analysis: "Exploratory Data Analysis" by John Tukey,
published by Addison Wesley ISBN 0-201-07616-0.
"Empirical Equations and Nomography" by Davis published by McGraw-Hill (no
ISBN: copyright 1943).
Book Mentioned: "The Electronic Cottage" by Joseph Deken published by Morrow
ISBN 0-688-00664-7.
Will Dwinnell works in quantitative analysis, especially machine learning
and pattern recognition. He has a background in statistics and neural
networks, and works as a data mining consultant near Philadelphia. Will writes
regularly and has been a Contributing Editor for "PC AI" magazine
(www.pcai.com/pcai) and The Gordian Institute quarterly newsletter
(www.gordianknot.com) since 1997. He has also been an invited speaker at
several technical conferences. Will received his B.S. in Economics and a minor
in Computer Science from Villanova University, and an M.B.A. with a
concentration in Operations from the University of Pittsburgh. Will can be
reached by e-mail at predictor@compuserve.com.
|