TEXT MINING BY FILTER COMPOSITION
by Inderjeet Mani, et al, Mitre
The following article is by Inderjeet Mani, Linda Van Guilder, Chris Clifton
& Kristian Concepcion
We all have access to lots of information, but are seldom in a position to
exploit it effectively for decision making. In times of crisis, this problem
can be especially severe.
Imagine you are a senior analyst besieged with news and intelligence reports
of a hostage situation at an American embassy. Who is in charge of the
terrorists? Is their group likely to attack other embassies? When the
president calls for an emergency meeting, your boss is asked to make a
20-minute presentation that profiles the terrorist group and develops
arguments describing their likely negotiation positions and the potential for
further attacks.
How can computers help this process, which relies so critically on
collective human understanding and insight, in the midst of the furor of a
crisis?
Genoa, a project of the Defense Advanced Research Projects Agency (DARPA),
is aimed at improving analysis and decision making in crisis situations by
providing tools that allow analysts to collaborate in developing structured
arguments in support of particular conclusions and to help predict likely
future scenarios. Genoa also provides knowledge discovery tools to mine the
information in these sources for important patterns, trends, and anomalies, to
discover nuggets of valuable information.
One of the challenges Genoa faces is to make it easy for analysts to take
knowledge gleaned with the use of these discovery tools and embed it in a
concise and useful form in an intelligence product, as evidence in support of
structured arguments. MITRE has been tasked with developing a summarization
filter architecture to address this challenge. MITRE’s approach relies on
component-based software composition, i.e., assembly of software units that
have contractually specified interfaces and that can be independently deployed
and reused. This component-based approach, which leverages XML and Java-Beans
technologies, allows the analyst to select various text mining tools from a
menu and, with just a few mouse clicks, assemble them to create a complex
filter that fulfills whatever information discovery function is currently
needed. A filter here is a tool that takes input information and turns it into
some more abstract and useful representation. Filters can also weed out
irrelevant parts of the input information.
For example, in response to the crisis situation discussed earlier, an
analyst might use these mining tools to discover important nuggets of
information in a large collection of news sources. This use of data mining
tools can be illustrated by looking at TopCat, a MITRE-developed system that
identifies different topics in a collection of documents and displays the key
"players" for each topic. TopCat uses association rule mining technology to
identify correlations among people, organizations, locations, and events
(shown below in blue, violet, green, and red, respectively). Clustering these
correlations creates topics such as the three in the following figure, built
from six months of global news from several print, radio, and video
sources--over 60,000 news stories in all.
Topics derived from clustering 60,000 news stories.
This allows the analyst to discover, say, an association between people
involved in a bombing incident, which gives a starting point for further
analysis, e.g., do McVeigh and Nichols belong to a common organization? This,
in turn, can lead to new knowledge that can be leveraged in the analytical
model used to help predict whether this terrorist organization is likely to
strike elsewhere in the next few days. Similarly, the third topic reveals the
important players in an election in Cambodia. This discovered information can
be leveraged to help predict whether the situation in Cambodia is going to
explode into a crisis that affects U.S. interests.
Now, suppose an analyst wants to know more about the people in the last
topic. Instead of reading more than 6,000 words of text from 10 articles on
the topic, the analyst can compose a topic detection filter like TopCat with a
biographical summarization filter that gleans facts about key persons from the
topic’s articles. The result of the composition is a short, 86-word-long
summary, seen below.
An 86-word summary of the news collection.
This summarization filter, developed under DARPA funding, identifies and
aggregates descriptions of people from a collection of documents by means of
an efficient syntactic analysis, the use of a thesaurus, and some simple
natural language generation techniques. It also extracts from these documents
salient sentences related to these people by weighting sentences based on the
presence of the names of people as well as the location and proximity of terms
in a document, their frequency, etc. (TopCat and a summarization filter
perform a similar function for MITRE's Broadcast News Navigator, which applies
them to continuously collected broadcast news in order to extract named
entities and keywords and to identify the transcripts and sentences that
contain them. For further information see
www.mitre.org/pubs/edge/july_97/second.htm . The summarization filter
includes a parameter to specify the target length or the reduction rate,
allowing summaries of different lengths to be generated. For example, allowing
a longer summary would mean that facts about other people (e.g., Pol Pot)
would also appear in the summary.
This example illustrates how mining a text collection using a composed
summarization filter can reveal important associations at varying levels of
detail. The component-based approach also allows these filters to be easily
integrated into intelligence products such as reports and briefings. To help
analysts present structured arguments and supporting information to decision
makers, Genoa provides an electronic notebook briefing tool (the Virtual
Situation Book) developed by Global Infotek. Summarization filters can be
associated with regions on a page in a briefing book that can be shared across
a community of collaborating analysts. When a document or a folder of
documents is dropped onto a region associated with a filter, the filter
applies and the textual summary or visualization appears in that region.
For more information, contact Inderjeet Mani at 703-883-6149 or
imani@mitre.org.
|