HPCwire
 The global publication of record for High Performance Computing / May 14, 2004: Vol. 13, No. 19

Previous Article   |  Table of Contents  |  

Features:

ADVANCED ANALYTIC-SEARCH TOOLS FOR STRATEGIC STRIKE
Commentary from the High-End Crusader

Starting with the monumental Branscomb/Klausner report "Making the Nation Safer: The Role of Science and Technology in Countering Terrorism" in 2002, there has always been a solid consensus among responsible observers that high-performance advanced analytic-search tools---from the general area of data mining---are central to helping this nation avoid strategic surprise.

One recommendation in "Making the Nation Safer" is to advance the practical utility of data fusion and data mining for intelligence analysis. With the advent of large-scale information fusion and information management, the report said, "[f]uture intelligence and law-enforcement activities could ... benefit enormously from advances in automatic interpretation of text, image, video, sensor, and other kinds of unstructured data. This would enable [a] computer to sort efficiently through ... massive quantities of data to bring the [pertinent information] ... to the attention of [an] analyst".

Data-mining applications extend across a broad strategic spectrum. Recently, a task force of the Defense Science Board released a report entitled "Future Strategic Strike Forces". Focusing on the need to create strong defenses against rogue states and terrorist organizations, the task force considered the problem of acquiring and holding at risk the targets of greatest value to an adversary, whether these are weapons of mass destruction, leadership assets, special, e.g., hard and deeply buried, targets, or other assets of significant value to the leadership.

The task force called for extending current ISR systems containing space and air-based sensors to include lower tiers of networked close-in sensors, e.g., wireless networks of intelligent microsensors, and---the recommendation that motivated this column---making further efforts to improve and integrate the process of controlling surveillance and reconnaissance assets---as well as the strategic strike forces themselves---through advanced data mining of sensor-network outputs and other appropriate forms of algorithm development.

If data mining is to do so many things, then we should distinguish different types of data mining. The Department of Homeland Security envisages deploying a dense network of biological sensors, from which information about pathogens will be mined. Here, the problem is largely being inundated with data; as it happens, many of the data-mining techniques that are currently being used in the financial industry will be applied to these data sets.

Tony Tether, the DARPA director, once offered a lucid taxonomy of new and old data-mining techniques. This taxonomy extends across the whole spectrum of strategic data mining. The conventional view of data mining is, use clever statistical techniques to comb through large amounts of data to discover previously unknown, but useful, patterns for building predictive models. Indeed, finding statistical correlations as a means of discovering unknown behavior patterns, and then building a predictive model, constitute the bulk of commercial data mining.

Strategic data mining, e.g., in counterterrorism, follows a different model. As Tether says, "one must find _extremely rare_ instances of patterns across an _extremely wide_ variety of actions---and _hidden_ relationships among individuals". One approach to be distinguished from commercial data mining is searching large data sets looking for _evidence_ of specified patterns. Highly expert attack scenarios are formulated and then reduced to a series of questions used to query a data set; this querying may reveal data that are evidence that such attacks are being planned.

Counterterrorism research at DARPA in the areas of data search and pattern recognition (at least in 2003, before the financial cut-off of TIA) focused on two basic types of queries, which are typically used in a complementary fashion.

A subject-based query begins with an entity, such as a person _known_ to be suspicious. Analytic search then uncovers evidence of links to other suspects or suspicious activities. The most well-known form of subject-based query is link analysis. Link analysis detects connectedness within rare patterns using known starting points. In contrast, a pattern-based query begins with a pattern that has been developed by some (possibly semi-automated) means. Analytic search then uncovers evidence of a specified pattern of activity that might be a threat.

Data mining goes hand in hand with data fusion, which is acquiring data from many sources, and then integrating the data into usable and accessible forms. Data mining formulates, analyzes, and implements basic analytic-search techniques that facilitate the extraction of meaningful information and knowledge from structured or unstructured data. Example: Are there telephone numbers in Columbia that are called more than 50 times per day from distinct cell phones in New York City? To find the answer, simply data mine the phone logs. Example: The Department of Defense regularly uses Bayesian inference for force protection. This requires assigning reliability quotients to vast bodies of data and is very similar to the way Amazon.com uses past buying to suggest new purchases.

Data mining identifies unsuspected interesting structure in data. Data fusion is the preliminary task of reconciling or integrating input from a variety of data sources to make it sufficiently "standard" so that all the data can be mined together. Data come from different sources but are jointly necessary to establish single propositions about the world, such as "This cargo container left Rotterdam with enriched uranium inside" or "The level of frustration inside al-Qaeda is growing because it is unable to hit major symbolic political and military targets inside the United States".

Strategic Strike

A strategic strike is a military operation designed to alter decisively an adversary's course of action. In planning for the nation's future, we must identify capabilities that U.S. military forces will need to deter and defeat adversaries who rely on surprise, deception, and asymmetric warfare to achieve their objectives.

The war on religiously-motivated Sunni Islamic terrorism will be prolonged, quite possibly measured in decades. It is likely to be characterized by a full spectrum of military actions, from tailored strikes against emerging terrorist targets to full-scale military operations against supporters of terrorism.

This situation imposes a unique set of targeting requirements. To quote from the DSB report, "the overall target set could include WMD targets, leadership targets, and other military assets---any of which can be located or deployed in such a way as to make them special targets. These special targets will be particularly difficult to attack. They may be buried deep underground and additionally hardened as well, or they may be mobile and so difficult to detect as to allow the United States only occasionally fleeting opportunities to target them, or they may pose great dangers to surrounding areas if striking them risks release of dangerous materials or triggers other particularly large and damaging effects".

The task force recommends an integrated, multi-tier intelligence system encompassing space and air-based sensors linked to lower tiers of close-in and intrusive microsensors. We are reaching the limits of being able to identify and track the most difficult targets from space and airborne systems. Our adversaries are skilled in dispersing, moving, and hiding what is most valuable to them.

A close-in sensor network is necessary to gather additional information, including that obtained from acoustic, seismic, visible, infrared, radio-frequency, hyperspectral-imagery, and magnetic sensors. This requires massive data-assimilation capabilities if we are to process, exploit, and disseminate this information to allow integrated tasking of strategic strike forces.

Nuclear Payloads

The DSB task force also makes a bold recommendation about strategic payloads that has nothing to do with data mining but may have a great deal to do with requirements for high-end computing. Currently, the Stockpile Stewardship Program (SSP), of which ASCI is a significant part, is focused primarily on refurbishing the legacy stockpile. The task force is skeptical that the legacy stockpile will meet the nation's future needs and recommends that SSP be shifted towards a new vision: enabling the design of new nuclear weapons to create a more diversified stockpile that would be more relevant to the future threat environment.

To quote, "[n]uclear weapons are needed that produce much lower collateral damage (great precision, deep penetration, greatly reduced radioactivity); have robust performance margins; are devised for ease of manufacture and maintenance; and produce special effects (e.g., enhanced EMP, enhanced neutron flux, or reduced fission yield). The Task Force recommends that research be initiated on weapons that meet this new vision".

The major question that comes to mind is, do the computational capabilities for new design, even design based on previously tested nuclear devices and/or designs, differ from the capabilities required to refurbish the legacy stockpile? Your correspondent senses that this new tasking would significantly stress what might be called the _agile programmability_ of high-end systems: the designer must be able to speedily perform a sequence of computational experiments on the systems available to him, where the results of one experiment are necessary to set up the next; in this case, the dominant productivity metric is ease and speed of performance programming (construction of new applications) by the designer, who may or may not be a seasoned computer or computational scientist.

Wireless Sensor Networks

The task force sees self-organizing wireless sensor networks, an outgrowth of DARPA's "smart-dust" program, as one component of a larger intelligence, surveillance, and reconnaisance (ISR) system. Because communication is expensive, an intelligent sensor will aggregate data, rather than relaying raw data back to its gateway. The whole idea of having a little computer next to every sensor is to be able to do _some_ of the processing inside the network, and only pass along the most important information. But what is important information from the network's perspective still has to be fused and mined, using advanced analytic search, into the relevant big picture at higher levels.

Suppose our task is to find and identify elusive or hidden targets, including individuals, small and dispersed WMD components, and buried or otherwise concealed objects. Assume we have an integrated, multi-tier intelligence system encompassing space and air-based sensors linked to close-in and intrusive lower tiers. Improved ISR is the single most important pacing factor in the future achievement of effective strategic strike. Success in this endeavor requires both new sensor packages and improved means for bringing together and fusing, in a timely fashion, the information provided by sensors, operatives on the ground, and human intelligence (HUMINT).

The entities that correspond to patterns of suspicious activity in classical counterterrorism are templates ("signatures") of high-value targets. These templates are subtle, elusive, and dynamic. The development of both patterns and templates should be computer-aided but requires a significant contribution from human intelligence (and HUMINT too). The data-mining problem in ISR systems with lower-tier sensor networks is to fuse sensor products and HUMINT and then perform analytic search for evidence of specified target templates. This form of data mining is exceedingly complex due to the extremely wide variety of data sources and types.

Your correspondent asserts that the algorithmically challenging problems in strategic data mining involve sparse-graph problems with millions of nodes and large fanouts. These problems are variations of shortest-paths, connected-components, and graph and path isomorphism---suitably constrained if necessary so as not to be NP-hard. These graph problems are well understood and have little chance of being solved on distributed-memory systems. A graph of 100 million nodes with an average fanout of 10,000 has _no_ locality. No clever rearrangement of nodes, similar to that suggested recently by some numerical analysts as beneficial in solving sparse linear systems, is going to help.

Some of the features of the sparse-graph problems at the heart of strategic data analysis make it difficult to design scalable parallel algorithms for these problems on distributed-memory machines. Another parallel architecture that should be considered as a potentially effective supercomputer for high-end analytic search is a high-end scalar computing system with a large location-insensitive shared memory. Flat shared memories of sufficient size can be implemented in any high-end computing system that tolerates the latency of frequent irregular fine-grained long-range communication. This computing system will be far more programmable if there is abundant lightweight synchronization as well.

The reader must resist the temptation to request too many details about the applications or the questions being asked. The only thing that will be described here is the character of possible solutions and their computing requirements.

Sparse-Graph Computation

Single-source shortest paths (SSSP) is a relevant sparse-graph problem for which it is notoriously difficult to design efficient parallel algorithms. SSSP is an interesting test case for the relative difficulty of designing parallel algorithms for different architecture classes. When the graph is dense, reasonable parallel algorithms may be found in any standard textbook. The usual approach is to partition the distance array 'd' and the adjacency matrix 'Adj' equally among 'p' processors. In this case, without too much difficulty, one can design an algorithm in which the computation time is not swamped by the communication time. Such parallel algorithms succeed because there are no load-balancing issues and communication is localized.

In sharp contrast, these partitioning schemes break down in the presence of sparse graphs, which are represented by adjacency lists. No satisfactory method of partitioning the adjacency lists is known. Either there are severe load-balancing issues or the time spent communicating information among the processors that store separate portions of the set of adjacency lists may increase dramatically. In sum, practical parallel formulations of SSSP are only known for sparse graphs with very special structures.

We introduce some terminology for directed graphs. G = (V, E) describes a _directed graph_ with _vertex set_ V and _edge set_ E, which is a subset of V x V. By convention, n = |V| and m = |E|. The out-degree of vertex v is |{(v, u) in E}|; the in-degree of vertex v is |{(u, v) in E}|. A weighted directed graph (digraph) has associated edge weights w: E ---> |R. For our purposes, we will consider sparse graphs where the average fanout (out-degree) is sqrt n. In this case, m = n sqrt n.

In the single-source shortest-paths problem, we are given a digraph G = (V, E) with real-valued, nonnegative edge weights w(u, v) and a distinguished _source vertex_ 's' in V. The problem is to compute, for all vertices 'v' in V, the length of a shortest path from 's' to 'v', where the length of the path is the sum of the weights of the edges on the path.

The standard sequential algorithm is Dijkstra's (or Johnson's) algorithm. The algorithm maintains for each vertex 'v' the tentative distance (upper bound) d[v] from the source and a set S of vertices for which a shortest path has been found. The algorithm keeps all unselected vertices in a priority queue Q and iterates over these vertices, selecting a vertex of minimum tentative distance in each iteration. Implementing the priority queue as a binary heap means that the 'n' extract_min operations and the 'm' decrease_key operations each cost log n for a total cost of O(n log n + m log n) = O(m log n). The algorithm is shown below.

  'forall' v 'in' V 'do' d[v] <-- infinity 'od'
   d[s] <-- 0
   Q <-- V                                // S = V - Q is implicit
  'while' Q <> empty-set 'do'
     u <-- extract_min(Q)                 // extract_min  heap operation
    'forall' v 'in' Adj[u] 'do'
       d[v] <-- min{d[v], d[u] + w(u,v)}  // decrease_key heap operation
    'od'
  'od'

Your correspondent would like to introduce two parallel _straw_ algorithms for SSSP suited to a scalar shared-memory machine modeled after the Cray MTA. In the first, he relies on the compiler to generate good parallel code for the parallel loops that appear as ordinary sequential loops. His programming model assumes: 1) an abundance of cheap flexible parallelism, 2) a flat shared memory, and 3) an abundance of lightweight synchronization. In particular, inexpensive future variables are implemented by full-empty bits.

Cray MTA programs are largely (ostensibly) sequential loops written without concern for data distribution. Parallel code is produced for each loop by the compiler. The compiler generates code to acquire multiple threads and distribute iterations of the loops to these threads. Barriers are inserted by the complier to separate the loops. The barriers are implemented as binary trees, where pairs of threads use full-empty bit locking. All this makes it easy to write and tune programs.

Parallelism is implicit in each loop. The compiler can parallelize a loop if two conditions are met. First, it must be able to determine before the loop begins how many iterations will be executed. Second, the loop must be one of the forms the compiler can handle. The three general forms are parallel loops, recurrences, and reductions. Reductions are simpler forms of recurrences that are easier to solve in parallel and involve less parallel overhead. In the author's SSSP code, there are only trivially parallel loops and reductions (specifically, finding the minimum element of a vector), but these inner loops are nested inside a sequential outer loop.

First Straw Algorithm

The data structures for this version of SSSP are relatively straightforward. Consider the data structure for a single vertex. It has fields for the out-degree and the in-degree. All outgoing edges are kept in a doubly-linked list 'Adj'; this allows easy deletion of an edge by any thread with a pointer to it. The outgoing edges are sorted in nondecreasing order of edge weight. In particular, a minimum-weight outgoing edge from vertex 'v' is stored in the head node of Adj[v]. In contrast, all incoming edges are kept in an adjacency array 'incoming'; each array component points to the list node in 'Adj' representing that edge. The use of an array allows easy parallel access to the incoming edges of a vertex.

A graphical representation may clarify things. Each list node of 'Adj' looks like {<*, v>: weight}; similarly, each array component of 'incoming' looks like [: pointer].

An array 'Gamma', used as a scratch-pad, contains those vertices that meet two criteria. First, each vertex has been selected for a shortest-paths tree and its 'dist' value has been settled. Second, each vertex has a strictly positive out-degree. The candidate nodes for the next selection are precisely the heads of the adjacency lists 'Adj' of vertices in 'Gamma'. As an invariant, each 'path-weight' field of 'Gamma' is the sum of dist[v] and the weight of the leading edge of Adj[v], where 'v' is the vertex indexed by that component of 'Gamma'. Here is the first code for parallel single-source shortest paths.

  'for' j := 1 'to' n 'do'
     dist[j]  := infinity                     // tentative distance
     pi[j]    := nil                          // parent in shortest-paths tree
  'od'
   dist[source] := 0
   /* assert parallel
  'for' j := 1 'to' source.in-degree 'do'     // keep outgoing edges only
     delete_edge(source.incoming[j].pointer)
  'od'
  'if' source.out-degree > 0 'then'
     length               := 1
     Gamma[1]             := source
     Gamma[1].path-weight := source.Adj.head.weight
  'else'
     length := 0
  'fi'
  'while' length <> 0 'do'
     j := 1
    'for' k := 2 'to' length 'do'             // reduction operation
      'if' Gamma[k].path-weight < Gamma[j].path-weight 'then' j := k 'fi'
    'od'
     (vertex, dist[vertex]) := (Gamma[j].Adj.head, Gamma[j].path-weight)
     pi[vertex] := Gamma[j]
     length++
     Gamma[length] := vertex
     /* assert parallel
    'for' j := 1 'to' vertex.in-degree 'do'   // keep outgoing edges only
      'if' vertex.incoming[j].source-vertex 'in' Gamma 'then'
         delete_edge(vertex.incoming[j].pointer)
      'fi'
    'od'
    "compact Gamma"
  'od'

The outer 'while' loop iterates 'n' times. The code was written to use the system's parallelism to minimize the execution time of each iteration. With sufficient parallelism, each reduction operation takes O(log length) = O(log n). Note that, as far as 'Gamma' is concerned, the reduction operation is a read-only operation, apart from the adding of one vertex. The second parallel loop deletes any 'Adj' edges that are incident to (directed towards) the newly selected vertex. This maintains the invariant that there are only outgoing edges from 'Gamma'. At most one node is deleted from any 'Adj' list. Procedure 'delete_edge' is reponsible for updating both 'out-degree' and 'path-weight'.

The compaction operation may be emulated in any number of ways. The goal is to perform the reduction operation on a vector of minimum length. The real parallelism in this code is in the reduction operation.

How fast is this code? Probably not fast enough. There are 'm' independent delete_edge operations and 'n' reduction operations on vectors of length at most 'n'. With unbounded parallelism, the total cost is O(n log n), which is a speed-up of sqrt n. The main defficiency is that no strategy has been designed to parallelize the outer loop.

Second Straw Algorithm

The first algorithm is overconstrained because it postulates that shortest paths are enumerated in the order of increasing distance from the source. By performing a fixed-point calculation, we can allow all iterations of the outer loop to proceed in parallel. It is well known that we may relax edges in arbitrary order and are guaranteed to reach a fixed point. The problem is that, if we do not specify a deterministic order, one edge may be relaxed several times, thus doing more work than the sequential algorithm. A good heuristic---not found by your correspondent---would allow maximal good nondeterminism while keeping the extra work within reasonable limits.

One useful observation is that, after a vertex has been expanded, there is no point in expanding it again unless its 'dist' value has been decremented in the meantime. We assign a thread to each vertex. There is a Boolean variable fresh[v] (actually a full-empty bit) that indicates whether dist[v] has been decremented since the last expansion of 'v'. We only expand 'fresh' vertices. To expand a vertex 'u', we scan Adj[u], temporarily locking each vertex 'v' before deciding whether to decrement dist[v]. Here is the second code for parallel single-source shortest paths.

  'forall' v 'in' V 'do' dist[v] := infinity; fresh[v] := false 'od'
   dist[source]  := 0
   fresh[source] := true

   thread u:

  'loop'
     readfe(fresh[u])                           // wait fresh, set stale
    'forall' v 'in' Adj[u] 'do'
       readfe(dist[v])                          // wait full, set empty
      'if' dist[v] > dist[u] + w(u, v) 'then'
         dist[v]  := dist[u] + w(u, v)          // if dist[v] changes, then
         fresh[v] := true                       // vertex v becomes fresh
      'fi'
       dist[v] := dist[v]                       // set full
    'od'
  'forever'

The algorithm terminates when every thread is suspended on a "stale" node, which is evidence that every node has been settled.

Each thread 'u' waits for vertex 'u' to become 'fresh' and immediately sets it to "stale". The 'forall' loop steps through the outgoing edges. Each dist[v] is temporarily locked and the test occurs. If dist[v] is decremented, then the full-empty bit of fresh[v] is set to full. Finally, dist[v] is unlocked.

An algorithm in a similar vein is to have each vertex suspend until all its predecessor vertices have settled, and then settle itself. Your correspondent does not _yet_ see a good self-test that a vertex has settled.

Dynamic Sparse Graphs Are Worse

There are common features of pattern-based analytic search with streaming data. In counterterrorism, classical graph algorithms, such as suitably optimized parallel breadth-first search, can be used as inference engines to derive new knowledge from the knowledge encoded in a concept graph. A concept graph is a standard knowledge-representation data structure. In intelligence work, we will be constantly streaming in new knowledge from external sources, resulting in continual revision of our concept graph. Analytic reasoning consists of running parallel graph algorithms on dynamic sparse graphs. The situation is the same in analytic search for strategic strike where, in this case, data is constantly streaming in from the sensor networks.

Hitherto, we have been dealing with large irregular sparse graphs with no locality. As it happens, the sparse graphs in strategic analytic search are typically _dynamic_. At any time in the analytic process, some quantity of transformed raw data has been represented as a graph and we are searching it to discover interesting subgraphs. But information from external data sources is constantly streaming in, which dynamically extends and modifies the graph. At the same time, analysts are constantly introducing new criteria, new rules, and new measures of interest. In comparison to commercial data mining, strategic analytic search---by necessity---is massively more fluid, more adaptive.

This has a fundamental computational consequence: conventional load balancing based on data decomposition must be abandoned and replaced by running the algorithm on a system with a flat shared memory. The reason is that it is no longer computationally feasible to do data redistribution after each graph modification prior to the next analytic search of the modified graph. Partitioning of large irregular sparse graphs is too expensive to be interleaved into an ongoing analytic-search process. With on-line strategic data mining of large dynamic graphs, we cannot amortize the high cost of graph partitioning. However, if the graph is stored in a flat shared memory, data distribution is unnecessary: location insensitivity means that performance is completely unaffected by where we chase the pointers to.

Lessons Learned

With all their imperfections, the two straw parallel algorithms are quite instructive about the computing requirements for efficient parallel SSSP. We see the need for systems that can hold large tables in large flat shared memories. If we design sophisticated global data structures, we want every thread in the system to be able to access these global structures without incurring any performance penalty. We need to tolerate the latency of long-range communication. We need to be free from any worries about either load balancing or localized communication.

The second straw algorithm is very much in the spirit of pipelining with futures. Some sort of pipelining is necessary to parallelize the outer 'while' loop. Pipelining without futures has enormous practical problems: maintaining the pipeline is quite complicated for the programmer and the pipelining forces highly synchronous code execution. Although the second straw algorithm doesn't have the right heuristics to properly control the nondeterminism to reasonably minimize the extra work, it rightly assumes a runtime system that manages the pipelining implicitly.

No designer will succeed in cracking SSSP unless his programming model offers both lightweight access to global data structures and implicit management of pipelining with futures. If it will be taken in the right spirit, your correspondent wishes to assert: features such as flat shared memory and pipelining with futures allow the sort of _computational play_ that is necessary to design efficient parallel algorithms for challenging sparse-graph problems such as SSSP.

Everything revolves around programmability.

As a final note, in analytic search for strategic strike, in computational nuclear-weapons design, in analytic search for counterterrorism, we need to better understand the relationship between tool users and tool developers. The users are military officers and weapons designers and intelligence professionals; the developers are computer and computational scientists and software engineers. Because of the sheer complexity of all three tasks, both tool users and tool developers must be _programmers_ at different levels of abstraction. We would like the tool users to be able to concentrate on the target aspects or the physics aspects or the intelligence aspects, but we cannot shield them from all of the computer aspects (although we can certainly try).

The problem with current systems is that distributed-memory machines are difficult to program. Almost every high-end system in production use is a large cluster of SMP nodes, where each shared-memory node contains a moderate number of cache-based scalar processors. These processors are not strongly parallel and the system interconnect does not have exceptional bisection bandwidth.

Parallel programming of these machines faces many challenging issues from data distribution to minimizing communication to task scheduling to load balancing. Each issue increases the cost of writing code. When you sweep these issues under the rug, the resulting code is either slow or unscalable. We need something better.


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for the opinions he expresses. Comments are always welcome and may be sent to the HPCwire editor, Tim Curns, at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |