![]() |
|
| The global publication of record for High Performance Computing / May 21, 2004: Vol. 13, No. 20 | |
|
||||
Features:LETTERS TO THE EDITOR: RESPONDING TO THE HIGH END CRUSADERReaders respond to the High End Crusader's article, "ADVANCED ANALYTIC-SEARCH TOOLS FOR STRATEGIC STRIKE" [107638] published in HPCwire May 14, 2004. The HEC also provides a general response to readers' comments. LETTER ONE:Dear HPCwire, I would like to add some comments along the Crusader's debate, especially on the issue of effective cache usage in sparse matrix computations. The main point is that although sparse linear systems, eigenvalue and least squares problems may use data caches effectively, the next round of high-performance statistical computations may not be that lucky. Super-pivots have been with us since the early 90's and they have proved an effective way to use data caches effectively in the context of sparse matrix factorizations. But since this forum is about super-computers, let's look at computations that are the next step in complexity beyond the bread and butter type. Take ordinary linear least squares (OLS) specifically, that is the problem of finding a vector x that minimizes the Euclidean norm of the residual vector r=b-Ax where A is a dense m by n matrix with m larger than n. That is a well known problem; it is often solved using a QR matrix factorization although other methods do exist. So what is a QR++ computation? OLS assumes that the elements of the residual vector are statistically independent and normally distributed with mean zero and variance one. If you would write the OLS matrix equation in short hand notation as:
b_i = A_ij * x_j + e_i ,
(1)
this says that the elements e_i of the vector e come from a specific bell- shaped distribution, that the expected value of e_i is zero and that of e_i*e_j is equal to zero when i not equal to j and equal to 1 otherwise. These are hardly trivial assumptions yet often they hold true because the e_i's can be thought as independent measurement errors. But what if the ith measurement is contaminated by the i-1st measurement? Our model now might look like this:
b_i = A_ij * x_j + f_i,
(2)
where f_1 = e_1 and f_i = e_i+e_{i-1} for i>1. In practice we need not know exactly how e_i and e_{i-1} add up to f_i, and one postulates f_i = u_1*e_i + u_2*e_{i-1} where u_1 and u_2 are unknown but bounded. Our problem is now to find u_1 and u_2 that maximize the likelihood of the vector f subject to linear constraints that involve both the dense matrix A and another sparse matrix, say L. The main advantage of solving (2) instead of (1) is that the result of the former can lead to more accurate estimates of the uncertainty in the computed vector x. When solving a 'regularized' version of (2), one comes about computing the gradient of the determinant of a matrix B ' * B with L*B=A and where B ' is the transpose of B. The adjoint computation of this gradient is in theory more efficient (flop-wise) than the corresponding tangent computation because it involves far fewer solutions of linear systems in L than the latter. It also involves multiplying the dense transpose of the adjoint of B by the sparse column of L. That multiplication is not flop intensive and there doesn't seem to be any easy way to overlap it with other flop intensive operations. The matrix L is actually the Cholesky factorization of the variance-covariance matrix of the vector f, that is L*L' is the expected value of rank-1 matrix f * f '. Often that matrix is block-diagonal and one might be tempted to conclude that the computation can be expressed in terms of many level-3 BLAS operations on those blocks. In practice however, a technique known to the statisticians as absorption can often reduce the size of those diagonal blocks to the point where level-3 BLAS is no longer attractive. Computations from totally new areas might represent too radical a change to use effectively contemporary hardware; bread and butter computations on the other hand are arguably too well suited for this hardware. In between, there is a class of computations from which one might learn much about the limits of current architectures, how to analyze those and how they might be pushed back. (To the editor:) I tried to skim over unnecessary details, more details can be found for instance along with some parallel "QR++" (aka. REML) software at http://csm.pnl.gov/statistics/dll. With best regards, Joel M. Malard, Ph.D. LETTER 2:Dear HPCWire, We at SGI are seeing the same trends as the High-End Crusader described, and are in fact delivering systems being used for this type of analysis. For example, last quarter SGI delivered it's first Altix system with 4 Terabytes of globally addressable memory to a customer working on very large, in-memory, data-sets. We see HPC leaders exploring Terabyte sized shared memory systems, and several systems with Terabyte scale memory have already been deployed in homeland defense, life sciences, oil & gas, and other sciences. High end users need to scale systems not only in processor and I/O capabilities, but also memory size. The need for larger memories is growing as we start to get requests to configure up to 8 Terabytes of cache coherent commodity memory in a single system today. This memory capacity is independent of the number of processors in the system, allowing customers to make the right tradeoff for their application requirements, rather than buying more processors than they can make use of. This gain in memory capacity does not come at a sacrifice of performance, either bandwidth or latency. The maximum access time from any processor executing a load to any data in memory is less than 900ns (a large proportion of the data can be accessed in a small fraction of that time). A key to delivering the application performance is to maintain memory bandwidth across the interconnect, allowing the user to configure the system so data can be delivered at processor bus speeds, regardless of the location in the system. Several users are also seeing large, shared memory systems as an important complement to their traditional small-node clusters. Clusters are very good at a variety of tasks, and often produce a tremendous amount of data. These folks view a shared memory system as essential for the analysis required to turn that data into information and insight. Shared memory capability removes the difficulties in programming distributed memory machines - why not just have the processor issue a load instruction to read any piece of data in the system? The time is right to ponder: what could you do with a Petabyte of memory? Breakthroughs are happening today with terabytes of memory and we are excited to see what more of our customers will achieve as the old assumptions are broken and new algorithms emerge. Mike Woodacre, Chief Engineer, SGI, on behalf of the SGI Server Platform Group. GENERALIZED REPLY FROM THE HECSeveral columns ago, Rob Schreiber from Hewlett-Packard took issue [HPCwire, 107235] with some things I had written about sparse-matrix multiply. Paraphrasing Bill Dally, I had written: If the cache line is 16 words long, then random single-word memory accessing, say, in sparse-matrix operations, destroys spatial locality and leads to an effective bandwidth that is 16 times smaller than the physical bandwidth. Dally's own words were: For scientific codes that perform random single-word memory references, e.g., gathers and scatters for sparse-matrix operations, the bandwidth problem is multiplied by the large granularity of cache accesses. Schreiber wrote that intelligent ordering of rows and columns of the matrix in matrix-vector multiply using compressed row storage can cluster nonzero elements near the diagonal, and so the part of the source vector that has been touched once and will be touched again will likely fit in some level of the cache. Etc., etc. I'm afraid that Rob has sold us all a bill of goods. With all these sophisticated techniques, it requires real expertise to know which one provides the win in any particular situation. Moreover, rearrangement of rows and columns is possible _only in some cases_, and even then the pre-processing and post-processing come at enormous cost. Imagine the case where numerous sparse-matrix multiplies are generated dynamically with great frequency--- that is, with completely different sparse matrices every time---from inside a larger algorithm, such as climate modeling or radiation hydro. In "Analytic Search For Strategic Strike", I focused on some real graphs ('n' vertices, average out-degree sqrt n) taken from genuine analytic-search problems where any thought of finding locality was a joke. Dealing with problems where locality won't save your bacon comes hard to people addicted to cache-based processors. This reminds me of a sentence in the HECRTF final report: [The Federal government went with COTS-based solutions.] In the absence of clear evidence against this strategy, the promise of high aggregate performance at relatively low cost made procurement of COTS-based systems a sensible and appropriate course of action. We now have evidence that there are applications of national importance that would benefit significantly from an alternative to COTS-based solutions. Better late than never, although a few crusaders have been warning of the rather obvious sparse-matrix gap for nearly a decade now. What is the lesson in all of this? We are learning that similar seeming algorithms are not the same but may require _very different_ things from a high-end architecture; in particular, we are learning to put sparse-matrix and sparse-graph problems into different boxes with very different computational requirements. Finally, just a word about what a flat shared memory is. Obviously, it is not one in which are processors are physically equidistant from all memory locations. Rather, it is one where redistributing the data to bring it closer to the processor that computes them doesn't bring significant performance improvement. Thus, a conventional distributed-shared memory machine with cache-based processors does _not_ have a flat shared memory because it makes an enormous difference if working sets fit into caches or if larger working sets are always contained in local memories, even if remote nodes appear to be "reasonably close" in terms of network-transport latency. Flat shared memory---which is sometimes an expensive goal---is usually implemented by supplying uniform exceptional bandwidth to all words of system memory, and then using processor parallelism to tolerate latency ---which smooths the nonuniformity out (if you tolerate the physical latency, it doesn't matter how much it actually is). In "Analytic Search", I focused on extremely demanding sparse-graph problems where the requirement for genuinely flat shared memory for global tables is at its most exacting. The High-End Crusader |
||||
| | Table of Contents | |