DATA MINING FOR INTRUSION DETECTION
by Neal Rothleder, MITRE
In the Network Operations Center of the future, the security analyst will
come to work in the morning, sit down with a cup of coffee, and press the
"What's New?" button on the network monitoring and analysis screen. A list of
suspicious incidents and attempted intrusions (more commonly called "attacks")
on the network will appear. Perhaps there is a file transfer at 2 a.m. from a
host that usually only has activity during business hours. The analyst will
then investigate these incidents to identify them as, for example, "attack" or
"false alarm." The analyst will also be presented with distilled descriptions
of attacks that he or she had identified the previous day.
An essential technology in this scenario will be data mining (DM). It will
be DM analysis that will determine the bounds for normal network activity, and
it will be DM techniques that enable the software to spend the night
determining which characteristics of previously identified attack activity
distinguish it from normal network usage.
To understand the improvement this will represent, it is necessary to
understand the current network intrusion detection (ID) environment. Software
"sensors" deployed along the network record activity: the initiation of a
World Wide Web connection from host A to host B, for example, or a single
outside host trying to connect to every MITRE host. Each sensor records
certain important pieces of information about this activity, such as the time
of day and the duration of the connection. This information is stored in a
database that easily accrues millions of records each day. On a regular basis,
security analysts sift through this data looking for the most serious attacks.
There can be thousands of suspicious activity alarms and each requires further
analysis to fully understand its purpose. Moreover, as commercial ID software
currently favors heightened sensitivity, many of the alerts generated are
false alarms and result in wasted time.
One of the most serious limitations in identifying and describing new
attacks is that there is simply so much data that security experts are not
able to examine thoroughly every single alerted activity. And, as data
collection grows with increased network usage, little is being done to help
mitigate this situation by performing analysis to determine which data is the
most relevant and which data is unnecessary to collect.
This area of data overload is where data mining can make its most
significant contribution. A number of MITRE research projects have begun to
explore the use of DM to address data overload in ID by taking one of two
basic approaches: profiling or classification. In profiling, the goal is to
establish some notion of "normal" and then look for deviations from that. In
classification, we take known attacks and try to determine meaningful features
that distinguish that set of traffic from the remainder of the traffic.
Of these two approaches, classification has been used less often in the ID
environment. This is because it is crucial for classification analysis that
there be adequate collections of data representing both attacks and
non-attacks. Because this type of analysis is new to the ID world, rarely is
this information collected in the proper form. For example, when the recent
Knowledge Discovery and Data Mining (KDD) Cup--an annual competition at the
preeminent technical conference in the data mining industry--challenged
contestants to classify attacks in network activity logs, it had to enhance
actual network data with attacks artificially generated according to
predetermined attack signatures. Without explicit identifiers on identified
attack records, it has been nearly impossible for classifiers to learn to
discriminate between attacks and non-attacks.
MITRE's current "Data Mining in ID" project is starting to address this
deficiency by enabling security analysts to tag important records in the
database and assign them to meaningful classes (attack, probe, legitimate,
etc.). By providing the necessary capabilities for labeling attacks and a
better way to maintain the history of intrusion behavior, this work represents
a significant enhancement to the existing security infrastructure. In the near
future, this labeled data will be used to explore and test various data mining
classification techniques. This project has also begun to perform profiling on
individual hosts. This profiling analysis can operate on the basic network
traffic data that is already collected. The hope is that by looking at the
traffic to and from specific machines, unusual activity can be identified. The
initial approach involves doing simple statistical analyses of isolated
features. For example, the chart below shows a 30-day summary of the frequency
of File Transfer Protocol (FTP) connections to a particular host for each hour
of the day. Notice that the activity from 1 a.m. to 2 a.m. is outside the
hours during which the vast majority of connections are made; analysts should
be alerted so they can investigate that activity further. The next stage of
this project will use data clustering techniques to identify more
sophisticated partitions of common activity for that host. Then, traffic that
does not "fit" into any of the normal groups will be reported to the security
analyst for further investigation.
Thirty-day summary of File Transfer Protocol connections.
In other emerging work, MITRE is addressing the issue of false alarms
produced by current ID sensors. This work uses data mining to look for
recurring sequences of alarms to help understand which alarms might be the
result of legitimate usage. For example, alarm "A" may be frequently followed
by alarm "B" as a result of legitimate operations. Once this is recognized,
future occurrences of this sequence can be filtered out. In joint work with
George Mason University, MITRE is working on an approach that includes
filtering out data that captures "common" connection activity. It makes use of
association rule detection to identify frequent host parings. For example,
perhaps host X regularly connects to host Y four times a day. Once these
common connections have been removed, the remaining data is fed to a
classification system to detect attacks. This work has been successfully
tested on synthetically generated data, and it will soon be applied to actual
network data.
The fields of network intrusion detection and data mining are just beginning
to work together. MITRE research is beginning to demonstrate that the network
activity data whose sheer quantity has been one of the primary challenges to
current ID efforts can be amenable to analysis via a variety of data mining
techniques. The application of those techniques has already begun to prove
useful in filtering out false alarms and characterizing normal connection
pairs. In the near future, data mining should be able to help us understand
what normal behavior is for individual host machines and better discriminate
network attacks from innocuous activity.
For more information, contact Bill Hill at 703-883-6416 or bill@mitre.org.
|