
Features:
ADVANCED ANALYTIC-SEARCH TOOLS FOR STRATEGIC STRIKE
Commentary from the High-End Crusader
Starting with the monumental Branscomb/Klausner report "Making the Nation
Safer: The Role of Science and Technology in Countering Terrorism" in 2002,
there has always been a solid consensus among responsible observers that
high-performance advanced analytic-search tools---from the general area of
data mining---are central to helping this nation avoid strategic surprise.
One recommendation in "Making the Nation Safer" is to advance the practical
utility of data fusion and data mining for intelligence analysis. With the
advent of large-scale information fusion and information management, the
report said, "[f]uture intelligence and law-enforcement activities could ...
benefit enormously from advances in automatic interpretation of text, image,
video, sensor, and other kinds of unstructured data. This would enable [a]
computer to sort efficiently through ... massive quantities of data to bring
the [pertinent information] ... to the attention of [an] analyst".
Data-mining applications extend across a broad strategic spectrum. Recently,
a task force of the Defense Science Board released a report entitled "Future
Strategic Strike Forces". Focusing on the need to create strong defenses
against rogue states and terrorist organizations, the task force considered
the problem of acquiring and holding at risk the targets of greatest value
to an adversary, whether these are weapons of mass destruction, leadership
assets, special, e.g., hard and deeply buried, targets, or other assets of
significant value to the leadership.
The task force called for extending current ISR systems containing space and
air-based sensors to include lower tiers of networked close-in sensors, e.g.,
wireless networks of intelligent microsensors, and---the recommendation that
motivated this column---making further efforts to improve and integrate the
process of controlling surveillance and reconnaissance assets---as well as
the strategic strike forces themselves---through advanced data mining of
sensor-network outputs and other appropriate forms of algorithm development.
If data mining is to do so many things, then we should distinguish different
types of data mining. The Department of Homeland Security envisages deploying
a dense network of biological sensors, from which information about pathogens
will be mined. Here, the problem is largely being inundated with data; as it
happens, many of the data-mining techniques that are currently being used in
the financial industry will be applied to these data sets.
Tony Tether, the DARPA director, once offered a lucid taxonomy of new and old
data-mining techniques. This taxonomy extends across the whole spectrum of
strategic data mining. The conventional view of data mining is, use clever
statistical techniques to comb through large amounts of data to discover
previously unknown, but useful, patterns for building predictive models.
Indeed, finding statistical correlations as a means of discovering unknown
behavior patterns, and then building a predictive model, constitute the bulk
of commercial data mining.
Strategic data mining, e.g., in counterterrorism, follows a different model.
As Tether says, "one must find _extremely rare_ instances of patterns across
an _extremely wide_ variety of actions---and _hidden_ relationships among
individuals". One approach to be distinguished from commercial data mining
is searching large data sets looking for _evidence_ of specified patterns.
Highly expert attack scenarios are formulated and then reduced to a series of
questions used to query a data set; this querying may reveal data that are
evidence that such attacks are being planned.
Counterterrorism research at DARPA in the areas of data search and pattern
recognition (at least in 2003, before the financial cut-off of TIA) focused
on two basic types of queries, which are typically used in a complementary
fashion.
A subject-based query begins with an entity, such as a person _known_ to be
suspicious. Analytic search then uncovers evidence of links to other suspects
or suspicious activities. The most well-known form of subject-based query is
link analysis. Link analysis detects connectedness within rare patterns using
known starting points. In contrast, a pattern-based query begins with a
pattern that has been developed by some (possibly semi-automated) means.
Analytic search then uncovers evidence of a specified pattern of activity that
might be a threat.
Data mining goes hand in hand with data fusion, which is acquiring data from
many sources, and then integrating the data into usable and accessible forms.
Data mining formulates, analyzes, and implements basic analytic-search
techniques that facilitate the extraction of meaningful information and
knowledge from structured or unstructured data. Example: Are there telephone
numbers in Columbia that are called more than 50 times per day from distinct
cell phones in New York City? To find the answer, simply data mine the phone
logs. Example: The Department of Defense regularly uses Bayesian inference
for force protection. This requires assigning reliability quotients to vast
bodies of data and is very similar to the way Amazon.com uses past buying to
suggest new purchases.
Data mining identifies unsuspected interesting structure in data. Data fusion
is the preliminary task of reconciling or integrating input from a variety of
data sources to make it sufficiently "standard" so that all the data can be
mined together. Data come from different sources but are jointly necessary to
establish single propositions about the world, such as "This cargo container
left Rotterdam with enriched uranium inside" or "The level of frustration
inside al-Qaeda is growing because it is unable to hit major symbolic
political and military targets inside the United States".
Strategic Strike
A strategic strike is a military operation designed to alter decisively an
adversary's course of action. In planning for the nation's future, we must
identify capabilities that U.S. military forces will need to deter and defeat
adversaries who rely on surprise, deception, and asymmetric warfare to achieve
their objectives.
The war on religiously-motivated Sunni Islamic terrorism will be prolonged,
quite possibly measured in decades. It is likely to be characterized by a
full spectrum of military actions, from tailored strikes against emerging
terrorist targets to full-scale military operations against supporters of
terrorism.
This situation imposes a unique set of targeting requirements. To quote from
the DSB report, "the overall target set could include WMD targets, leadership
targets, and other military assets---any of which can be located or deployed
in such a way as to make them special targets. These special targets will be
particularly difficult to attack. They may be buried deep underground and
additionally hardened as well, or they may be mobile and so difficult to
detect as to allow the United States only occasionally fleeting opportunities
to target them, or they may pose great dangers to surrounding areas if
striking them risks release of dangerous materials or triggers other
particularly large and damaging effects".
The task force recommends an integrated, multi-tier intelligence system
encompassing space and air-based sensors linked to lower tiers of close-in and
intrusive microsensors. We are reaching the limits of being able to identify
and track the most difficult targets from space and airborne systems. Our
adversaries are skilled in dispersing, moving, and hiding what is most
valuable to them.
A close-in sensor network is necessary to gather additional information,
including that obtained from acoustic, seismic, visible, infrared,
radio-frequency, hyperspectral-imagery, and magnetic sensors. This requires
massive data-assimilation capabilities if we are to process, exploit, and
disseminate this information to allow integrated tasking of strategic strike
forces.
Nuclear Payloads
The DSB task force also makes a bold recommendation about strategic payloads
that has nothing to do with data mining but may have a great deal to do with
requirements for high-end computing. Currently, the Stockpile Stewardship
Program (SSP), of which ASCI is a significant part, is focused primarily on
refurbishing the legacy stockpile. The task force is skeptical that the
legacy stockpile will meet the nation's future needs and recommends that SSP
be shifted towards a new vision: enabling the design of new nuclear weapons to
create a more diversified stockpile that would be more relevant to the future
threat environment.
To quote, "[n]uclear weapons are needed that produce much lower collateral
damage (great precision, deep penetration, greatly reduced radioactivity);
have robust performance margins; are devised for ease of manufacture and
maintenance; and produce special effects (e.g., enhanced EMP, enhanced
neutron flux, or reduced fission yield). The Task Force recommends that
research be initiated on weapons that meet this new vision".
The major question that comes to mind is, do the computational capabilities
for new design, even design based on previously tested nuclear devices and/or
designs, differ from the capabilities required to refurbish the legacy
stockpile? Your correspondent senses that this new tasking would
significantly stress what might be called the _agile programmability_ of
high-end systems: the designer must be able to speedily perform a sequence of
computational experiments on the systems available to him, where the results
of one experiment are necessary to set up the next; in this case, the dominant
productivity metric is ease and speed of performance programming (construction
of new applications) by the designer, who may or may not be a seasoned
computer or computational scientist.
Wireless Sensor Networks
The task force sees self-organizing wireless sensor networks, an outgrowth of
DARPA's "smart-dust" program, as one component of a larger intelligence,
surveillance, and reconnaisance (ISR) system. Because communication is
expensive, an intelligent sensor will aggregate data, rather than relaying raw
data back to its gateway. The whole idea of having a little computer next to
every sensor is to be able to do _some_ of the processing inside the network,
and only pass along the most important information. But what is important
information from the network's perspective still has to be fused and mined,
using advanced analytic search, into the relevant big picture at higher
levels.
Suppose our task is to find and identify elusive or hidden targets, including
individuals, small and dispersed WMD components, and buried or otherwise
concealed objects. Assume we have an integrated, multi-tier intelligence
system encompassing space and air-based sensors linked to close-in and
intrusive lower tiers. Improved ISR is the single most important pacing
factor in the future achievement of effective strategic strike. Success in
this endeavor requires both new sensor packages and improved means for
bringing together and fusing, in a timely fashion, the information provided
by sensors, operatives on the ground, and human intelligence (HUMINT).
The entities that correspond to patterns of suspicious activity in classical
counterterrorism are templates ("signatures") of high-value targets. These
templates are subtle, elusive, and dynamic. The development of both patterns
and templates should be computer-aided but requires a significant contribution
from human intelligence (and HUMINT too). The data-mining problem in ISR
systems with lower-tier sensor networks is to fuse sensor products and HUMINT
and then perform analytic search for evidence of specified target templates.
This form of data mining is exceedingly complex due to the extremely wide
variety of data sources and types.
Your correspondent asserts that the algorithmically challenging problems in
strategic data mining involve sparse-graph problems with millions of nodes
and large fanouts. These problems are variations of shortest-paths,
connected-components, and graph and path isomorphism---suitably constrained
if necessary so as not to be NP-hard. These graph problems are well
understood and have little chance of being solved on distributed-memory
systems. A graph of 100 million nodes with an average fanout of 10,000 has
_no_ locality. No clever rearrangement of nodes, similar to that suggested
recently by some numerical analysts as beneficial in solving sparse linear
systems, is going to help.
Some of the features of the sparse-graph problems at the heart of strategic
data analysis make it difficult to design scalable parallel algorithms for
these problems on distributed-memory machines. Another parallel architecture
that should be considered as a potentially effective supercomputer for
high-end analytic search is a high-end scalar computing system with a large
location-insensitive shared memory. Flat shared memories of sufficient size
can be implemented in any high-end computing system that tolerates the latency
of frequent irregular fine-grained long-range communication. This computing
system will be far more programmable if there is abundant lightweight
synchronization as well.
The reader must resist the temptation to request too many details about the
applications or the questions being asked. The only thing that will be
described here is the character of possible solutions and their computing
requirements.
Sparse-Graph Computation
Single-source shortest paths (SSSP) is a relevant sparse-graph problem for
which it is notoriously difficult to design efficient parallel algorithms.
SSSP is an interesting test case for the relative difficulty of designing
parallel algorithms for different architecture classes. When the graph is
dense, reasonable parallel algorithms may be found in any standard textbook.
The usual approach is to partition the distance array 'd' and the adjacency
matrix 'Adj' equally among 'p' processors. In this case, without too much
difficulty, one can design an algorithm in which the computation time is not
swamped by the communication time. Such parallel algorithms succeed because
there are no load-balancing issues and communication is localized.
In sharp contrast, these partitioning schemes break down in the presence of
sparse graphs, which are represented by adjacency lists. No satisfactory
method of partitioning the adjacency lists is known. Either there are severe
load-balancing issues or the time spent communicating information among the
processors that store separate portions of the set of adjacency lists may
increase dramatically. In sum, practical parallel formulations of SSSP are
only known for sparse graphs with very special structures.
We introduce some terminology for directed graphs. G = (V, E) describes a
_directed graph_ with _vertex set_ V and _edge set_ E, which is a subset of
V x V. By convention, n = |V| and m = |E|. The out-degree of vertex v is
|{(v, u) in E}|; the in-degree of vertex v is |{(u, v) in E}|. A weighted
directed graph (digraph) has associated edge weights w: E ---> |R. For our
purposes, we will consider sparse graphs where the average fanout (out-degree)
is sqrt n. In this case, m = n sqrt n.
In the single-source shortest-paths problem, we are given a digraph G =
(V, E) with real-valued, nonnegative edge weights w(u, v) and a distinguished
_source vertex_ 's' in V. The problem is to compute, for all vertices 'v'
in V, the length of a shortest path from 's' to 'v', where the length of the
path is the sum of the weights of the edges on the path.
The standard sequential algorithm is Dijkstra's (or Johnson's) algorithm. The
algorithm maintains for each vertex 'v' the tentative distance (upper bound)
d[v] from the source and a set S of vertices for which a shortest path has
been found. The algorithm keeps all unselected vertices in a priority queue
Q and iterates over these vertices, selecting a vertex of minimum tentative
distance in each iteration. Implementing the priority queue as a binary heap
means that the 'n' extract_min operations and the 'm' decrease_key operations
each cost log n for a total cost of O(n log n + m log n) = O(m log n). The
algorithm is shown below.
'forall' v 'in' V 'do' d[v] <-- infinity 'od'
d[s] <-- 0
Q <-- V // S = V - Q is implicit
'while' Q <> empty-set 'do'
u <-- extract_min(Q) // extract_min heap operation
'forall' v 'in' Adj[u] 'do'
d[v] <-- min{d[v], d[u] + w(u,v)} // decrease_key heap operation
'od'
'od'
Your correspondent would like to introduce two parallel _straw_ algorithms
for SSSP suited to a scalar shared-memory machine modeled after the Cray MTA.
In the first, he relies on the compiler to generate good parallel code for
the parallel loops that appear as ordinary sequential loops. His programming
model assumes: 1) an abundance of cheap flexible parallelism, 2) a flat shared
memory, and 3) an abundance of lightweight synchronization. In particular,
inexpensive future variables are implemented by full-empty bits.
Cray MTA programs are largely (ostensibly) sequential loops written without
concern for data distribution. Parallel code is produced for each loop by
the compiler. The compiler generates code to acquire multiple threads and
distribute iterations of the loops to these threads. Barriers are inserted
by the complier to separate the loops. The barriers are implemented as
binary trees, where pairs of threads use full-empty bit locking. All this
makes it easy to write and tune programs.
Parallelism is implicit in each loop. The compiler can parallelize a loop if
two conditions are met. First, it must be able to determine before the loop
begins how many iterations will be executed. Second, the loop must be one of
the forms the compiler can handle. The three general forms are parallel
loops, recurrences, and reductions. Reductions are simpler forms of
recurrences that are easier to solve in parallel and involve less parallel
overhead. In the author's SSSP code, there are only trivially parallel loops
and reductions (specifically, finding the minimum element of a vector), but
these inner loops are nested inside a sequential outer loop.
First Straw Algorithm
The data structures for this version of SSSP are relatively straightforward.
Consider the data structure for a single vertex. It has fields for the
out-degree and the in-degree. All outgoing edges are kept in a doubly-linked
list 'Adj'; this allows easy deletion of an edge by any thread with a pointer
to it. The outgoing edges are sorted in nondecreasing order of edge weight.
In particular, a minimum-weight outgoing edge from vertex 'v' is stored in the
head node of Adj[v]. In contrast, all incoming edges are kept in an adjacency
array 'incoming'; each array component points to the list node in 'Adj'
representing that edge. The use of an array allows easy parallel access to
the incoming edges of a vertex.
A graphical representation may clarify things. Each list node of 'Adj' looks
like {<*, v>: weight}; similarly, each array component of 'incoming' looks
like [: pointer].
An array 'Gamma', used as a scratch-pad, contains those vertices that meet two
criteria. First, each vertex has been selected for a shortest-paths tree and
its 'dist' value has been settled. Second, each vertex has a strictly
positive out-degree. The candidate nodes for the next selection are precisely
the heads of the adjacency lists 'Adj' of vertices in 'Gamma'. As an
invariant, each 'path-weight' field of 'Gamma' is the sum of dist[v] and the
weight of the leading edge of Adj[v], where 'v' is the vertex indexed by that
component of 'Gamma'. Here is the first code for parallel single-source
shortest paths.
'for' j := 1 'to' n 'do'
dist[j] := infinity // tentative distance
pi[j] := nil // parent in shortest-paths tree
'od'
dist[source] := 0
/* assert parallel
'for' j := 1 'to' source.in-degree 'do' // keep outgoing edges only
delete_edge(source.incoming[j].pointer)
'od'
'if' source.out-degree > 0 'then'
length := 1
Gamma[1] := source
Gamma[1].path-weight := source.Adj.head.weight
'else'
length := 0
'fi'
'while' length <> 0 'do'
j := 1
'for' k := 2 'to' length 'do' // reduction operation
'if' Gamma[k].path-weight < Gamma[j].path-weight 'then' j := k 'fi'
'od'
(vertex, dist[vertex]) := (Gamma[j].Adj.head, Gamma[j].path-weight)
pi[vertex] := Gamma[j]
length++
Gamma[length] := vertex
/* assert parallel
'for' j := 1 'to' vertex.in-degree 'do' // keep outgoing edges only
'if' vertex.incoming[j].source-vertex 'in' Gamma 'then'
delete_edge(vertex.incoming[j].pointer)
'fi'
'od'
"compact Gamma"
'od'
The outer 'while' loop iterates 'n' times. The code was written to use the
system's parallelism to minimize the execution time of each iteration.
With sufficient parallelism, each reduction operation takes O(log length) =
O(log n). Note that, as far as 'Gamma' is concerned, the reduction operation
is a read-only operation, apart from the adding of one vertex. The second
parallel loop deletes any 'Adj' edges that are incident to (directed towards)
the newly selected vertex. This maintains the invariant that there are only
outgoing edges from 'Gamma'. At most one node is deleted from any 'Adj' list.
Procedure 'delete_edge' is reponsible for updating both 'out-degree' and
'path-weight'.
The compaction operation may be emulated in any number of ways. The goal is
to perform the reduction operation on a vector of minimum length. The real
parallelism in this code is in the reduction operation.
How fast is this code? Probably not fast enough. There are 'm' independent
delete_edge operations and 'n' reduction operations on vectors of length at
most 'n'. With unbounded parallelism, the total cost is O(n log n), which is
a speed-up of sqrt n. The main defficiency is that no strategy has been
designed to parallelize the outer loop.
Second Straw Algorithm
The first algorithm is overconstrained because it postulates that shortest
paths are enumerated in the order of increasing distance from the source.
By performing a fixed-point calculation, we can allow all iterations of the
outer loop to proceed in parallel. It is well known that we may relax edges
in arbitrary order and are guaranteed to reach a fixed point. The problem
is that, if we do not specify a deterministic order, one edge may be relaxed
several times, thus doing more work than the sequential algorithm. A good
heuristic---not found by your correspondent---would allow maximal good
nondeterminism while keeping the extra work within reasonable limits.
One useful observation is that, after a vertex has been expanded, there is no
point in expanding it again unless its 'dist' value has been decremented in
the meantime. We assign a thread to each vertex. There is a Boolean variable
fresh[v] (actually a full-empty bit) that indicates whether dist[v] has been
decremented since the last expansion of 'v'. We only expand 'fresh' vertices.
To expand a vertex 'u', we scan Adj[u], temporarily locking each vertex 'v'
before deciding whether to decrement dist[v]. Here is the second code for
parallel single-source shortest paths.
'forall' v 'in' V 'do' dist[v] := infinity; fresh[v] := false 'od'
dist[source] := 0
fresh[source] := true
thread u:
'loop'
readfe(fresh[u]) // wait fresh, set stale
'forall' v 'in' Adj[u] 'do'
readfe(dist[v]) // wait full, set empty
'if' dist[v] > dist[u] + w(u, v) 'then'
dist[v] := dist[u] + w(u, v) // if dist[v] changes, then
fresh[v] := true // vertex v becomes fresh
'fi'
dist[v] := dist[v] // set full
'od'
'forever'
The algorithm terminates when every thread is suspended on a "stale" node,
which is evidence that every node has been settled.
Each thread 'u' waits for vertex 'u' to become 'fresh' and immediately sets it
to "stale". The 'forall' loop steps through the outgoing edges. Each dist[v]
is temporarily locked and the test occurs. If dist[v] is decremented, then
the full-empty bit of fresh[v] is set to full. Finally, dist[v] is unlocked.
An algorithm in a similar vein is to have each vertex suspend until all its
predecessor vertices have settled, and then settle itself. Your correspondent
does not _yet_ see a good self-test that a vertex has settled.
Dynamic Sparse Graphs Are Worse
There are common features of pattern-based analytic search with streaming
data. In counterterrorism, classical graph algorithms, such as suitably
optimized parallel breadth-first search, can be used as inference engines to
derive new knowledge from the knowledge encoded in a concept graph. A concept
graph is a standard knowledge-representation data structure. In intelligence
work, we will be constantly streaming in new knowledge from external sources,
resulting in continual revision of our concept graph. Analytic reasoning
consists of running parallel graph algorithms on dynamic sparse graphs. The
situation is the same in analytic search for strategic strike where, in this
case, data is constantly streaming in from the sensor networks.
Hitherto, we have been dealing with large irregular sparse graphs with no
locality. As it happens, the sparse graphs in strategic analytic search are
typically _dynamic_. At any time in the analytic process, some quantity of
transformed raw data has been represented as a graph and we are searching it
to discover interesting subgraphs. But information from external data sources
is constantly streaming in, which dynamically extends and modifies the graph.
At the same time, analysts are constantly introducing new criteria, new rules,
and new measures of interest. In comparison to commercial data mining,
strategic analytic search---by necessity---is massively more fluid, more
adaptive.
This has a fundamental computational consequence: conventional load balancing
based on data decomposition must be abandoned and replaced by running the
algorithm on a system with a flat shared memory. The reason is that it is no
longer computationally feasible to do data redistribution after each graph
modification prior to the next analytic search of the modified graph.
Partitioning of large irregular sparse graphs is too expensive to be
interleaved into an ongoing analytic-search process. With on-line strategic
data mining of large dynamic graphs, we cannot amortize the high cost of graph
partitioning. However, if the graph is stored in a flat shared memory, data
distribution is unnecessary: location insensitivity means that performance is
completely unaffected by where we chase the pointers to.
Lessons Learned
With all their imperfections, the two straw parallel algorithms are quite
instructive about the computing requirements for efficient parallel SSSP.
We see the need for systems that can hold large tables in large flat shared
memories. If we design sophisticated global data structures, we want every
thread in the system to be able to access these global structures without
incurring any performance penalty. We need to tolerate the latency of
long-range communication. We need to be free from any worries about either
load balancing or localized communication.
The second straw algorithm is very much in the spirit of pipelining with
futures. Some sort of pipelining is necessary to parallelize the outer
'while' loop. Pipelining without futures has enormous practical problems:
maintaining the pipeline is quite complicated for the programmer and the
pipelining forces highly synchronous code execution. Although the second
straw algorithm doesn't have the right heuristics to properly control the
nondeterminism to reasonably minimize the extra work, it rightly assumes a
runtime system that manages the pipelining implicitly.
No designer will succeed in cracking SSSP unless his programming model offers
both lightweight access to global data structures and implicit management of
pipelining with futures. If it will be taken in the right spirit, your
correspondent wishes to assert: features such as flat shared memory and
pipelining with futures allow the sort of _computational play_ that is
necessary to design efficient parallel algorithms for challenging
sparse-graph problems such as SSSP.
Everything revolves around programmability.
As a final note, in analytic search for strategic strike, in computational
nuclear-weapons design, in analytic search for counterterrorism, we need to
better understand the relationship between tool users and tool developers.
The users are military officers and weapons designers and intelligence
professionals; the developers are computer and computational scientists and
software engineers. Because of the sheer complexity of all three tasks, both
tool users and tool developers must be _programmers_ at different levels of
abstraction. We would like the tool users to be able to concentrate on the
target aspects or the physics aspects or the intelligence aspects, but we
cannot shield them from all of the computer aspects (although we can certainly
try).
The problem with current systems is that distributed-memory machines are
difficult to program. Almost every high-end system in production use is a
large cluster of SMP nodes, where each shared-memory node contains a moderate
number of cache-based scalar processors. These processors are not strongly
parallel and the system interconnect does not have exceptional bisection
bandwidth.
Parallel programming of these machines faces many challenging issues from data
distribution to minimizing communication to task scheduling to load balancing.
Each issue increases the cost of writing code. When you sweep these issues
under the rug, the resulting code is either slow or unscalable. We need
something better.
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for the
opinions he expresses. Comments are always welcome and may be sent to the
HPCwire editor, Tim Curns, at tim@hpcwire.com.
|