Multi-processor Performance on the Tera MTA

Allan Snavely and Larry Carter (SDSC/UCSD) with
Jay Boisseau and Amit Majumdar (SDSC),
Kang Su Gatlin and Nick Mitchell (UCSD),
John Feo and Brian Koblenz (Tera Computer)

Abstract

The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA.

In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes.

We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the "serial bottlenecks" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads.

Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near-perfect, but the others are reduced accordingly.

Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running "frays" (a collection of threads running on a single processor) and "crews" (a collection of frays, one per processor.)

Keywords: Tera multithreaded architecture, performance evaluation, NAS parallel benchmarks, scalability analysis, performance tuning, parallelizing compilers.

Contact Author:

Allan Snavely
UCSD Mail Code 0505
9500 Gilman Drive
La Jolla, CA 92093-0505
Email: allans@sdsc.edu
Phone: 619 534-5158
Web: www.sdsc.edu/~allans

1 INTRODUCTION

The Tera MTA ("Multi-Threaded Architecture") is a revolutionary commercial computer with the ability to perform zero-cost context switching and synchronization among threads. An architectural goal of the MTA is that a program written and optimized for a single processor will seamlessly scale up to multiple processor if there is sufficient parallel work in the program. On April 28th, 1998, the Tera Computer Company upgraded the MTA machine at the San Diego Supercomputer Center at the University of California, San Diego to a two processor configuration. This paper reports our experiences in taking the five NAS 2.3 kernel benchmarks [2] which we had previously optimized for a single processor MTA prototype [1] and assessing their performance in a two processor environment. We report performance obtained on one and two processors for the same size jobs as well as a quantification of parallel overhead incurred when running on more than one processor. We show how two useful Tera tools, Canal and Traceview, can be used to understand how the compiler has multithreaded a program and how effectively the program actually utilizes the hardware. The following sections briefly describe the MTA architecture, the NAS 2.3 kernel benchmarks, our tuning methodology, our results for each NAS benchmark and an analysis of parallel overhead.

2 The Tera MTA

The Tera MTA has no memory hierarchy for data. Instead, processors are connected via a network to DRAM memory units. Latencies to memory are high (on the order of 140-160 cycles.) However, even higher latencies can be tolerated by a combination of hardware lookahead and executing instructions from independent threads. To this end, the MTA keeps the context of up to 128 threads very close to the processor in hardware called streams. Each stream includes a program counter (PC) and a set of 32 registers; a stream holds the context for one thread. The processor executes instructions from streams that are not blocked in a fair fashion approximating round-robin. A stream can execute an instruction only once every 21 cycles (the length of the instruction pipeline), so a minimum of 21 streams are required to keep a processor 100% utilized even if no instructions reference memory. This implies that single-threaded performance on the MTA is poor. Also, if every instruction needed data from the previous instruction then only 128 cycles of memory latency could be tolerated. However, each stream can issue 8 memory references without waiting for any to return, so the instruction level parallelism (ILP) present in most codes sidesteps this problem.

The MTA uses a 3-wide instruction (VLIW) encoding 3 operations. Each cycle, an instruction can issue one memory reference and two other operations. The compiler annotates each instruction with a lookahead number. The lookahead is the number of subsequent instructions from the same stream that can be executed before the memory reference of the current instruction must be completed. A lookahead of 0 means that the next instruction needs the current memory reference and cannot be executed until it has completed. If lookahead were always 5, about 30 streams would suffice to saturate the processor. Fortunately, scientific codes exhibit a high level of ILP. The compiler effectively uses software pipelining to schedule the memory reference ahead of the uses, and it often achieves the maximum lookahead of 7 on the NAS kernel benchmarks.

The clock rate of our MTA is 255 MHz. Because each VLIW instruction can include a fused multiply-add operation and another (unfused) floating-point operation, the MTA is theoretically capable of over three quarters of a GFLOP/sec per processor. However, the limitation of only one memory reference per cycle, and the need to do indexing calculations, prevents this level of performance from being achievable on any but artificial codes.

The Tera MTA comes with a sophisticated compiler that automatically parallelizes serial code by decomposing it into threads. The compiler attempts to determine statically how much parallelism is available in a piece of computation (for example a loop) and decomposes it into a number of threads sufficient to saturate the hardware. While the results of such automatic parallelization are often satisfactory, the programmer may wish to "look under the hood" to see how the compiler has decomposed the computation. The feedback loop whereby the programmer observes the choices that the compiler has made and influences it to make different choices (due to the programmer's deeper understanding of the dynamic computation and/or the resources available) can lead to improved performance.

Tera provides two powerful tools that allow the programmer to profile performance and to understand how the compiler has structured computation. When a program is compiled with tracing enabled, an execution trace is created at runtime which can be viewed post-mortem via a graphical tool called Traceview. Traceview gives a clear idea of how well the executable uses the available hardware. If the programmer wishes to see how the program has been decomposed into threads and targeted to the hardware, a tool called Canal provides an annotated version of the source file which reveals this information. The use of these tools will be illustrated.

3 The NAS kernel benchmarks

Researchers at NASA have developed a set of eight benchmark programs that are representative of computational fluid dynamics applications. These NAS benchmarks have become a standard measure of supercomputer performance. Five of these programs, designated the "kernel benchmarks", are relatively small programs that nonetheless provide a significant performance challenge to processors, memory hierarchy and communications fabrics of today's supercomputers. These programs are CG (Conjugate Gradient), EP (Embarrassingly Parallel), FT (Fourier Transform), IS (Integer Sort) and MG (Multigrid). We use the NPB 2.3-serial versions of these codes, which are essentially the NPB 2.3 parallel benchmarks that have been "de-parallelized" by removing all MPI calls.

In [1] we extrapolated the performance of a one processor MTA prototype, which ran at 145 MHz, to 333 MHz and compared that performance to a single processor of the T90. In [3] we updated these results and present one- and two-processor results on three full application codes. Subsequently, we have made further improvements to the NAS kernel benchmarks. The performance of the previous and current versions are shown in table and graphical form below.

Benchmark	T90 Vanilla	MTA
Benchmark	T90 Vanilla	Vanilla	Best from [3]	Current best
CG (MFLOP/S)	176	142	158	178
FT (MFLOP/S)	177	39.2	138	199
IS (MOP/S)	62	61.6	72.1	103
EP (MOP/S)	7.51	4.27	7.60	7.72
MG (MFLOP/S)	519	160	160	237

Table 1: Single-processor performance
By the "vanilla" benchmarks, we mean the versions provided by NAS for CG, FT, IS and MG, and a modified EP that has a parallelized (for the MTA) or vectorized (for the T90) random number generator. In Figure 1 the line at "1" on the y-axis represents the performance of one processor of SDSC's 14-processor Cray T90 vector supercomputer. This computer has a much faster clock (440 MHz) than the Tera MTA (255 MHz), and the T90 is able to perform multiple memory operations per cycle. On the vanilla versions of CG and IS, the MTA is able to compensate for these handicaps. As we will describe later, FT required "de-optimizing" code that was intended to benefit cache-based processors, and EP benefited from simple inlining of procedure calls. The final code, MG, is limited in performance by the memory bandwidth, for which the T90 has an unbeatable advantage.

Figure 1.

It should be noted that we did not optimize the T90 code beyond ensuring that the important inner loops are vectorized. The T90 performance could undoubtedly be improved; however, we believe that since the vanilla codes were written in style suitable for vector machines, the improvements would not be as great as for the MTA.

Below we describe the tuning effort for each benchmark on the MTA.

3.1 CG

To understand how the programmer and compiler can cooperate together to obtain high performance on the MTA, it will be useful to make a somewhat extended example of CG. Conjugate Gradient (CG) is a well known method for solving symmetric positive definite systems by iteration. By far the major timed portion of the NAS 2.3 kernel benchmark is spent computing a sparse matrix-vector product:

do j=1, lastrow-firstrow+1 sum = 0.d0 do k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) enddo w(j) = sum enddo

This doubly nested loop is parallelizable at the outer level (which the MTA prefers to inner loop parallelism [4]). The computation in the inner loop requires 2 flops and 3 memory accesses. If we recall than the MTA can issue at most one memory access per cycle, we observe that, even with pipelining and prefetching, the MTA will be able to get at most 2 flops every 3 instructions on this code. In fact, the compiler does generate tight code which achieves a high percentage of this upper bound [1]. A "heroic" tuning effort can, as we previously reported, break through the 2/3 flop per cycle by bit packing colidx. Bit packing involves packing several values, each requiring fewer than 64 bits, into a single 64 bit word. In this case three values of colidx can be packed in a single word. This allows for three-fold unrolling of the inner loop with 7 memory references. (Note that 3 iterations of the inner loop would require 9 memory references without bit packing.) The optimized code achieves 6/7 flop per instruction.

The Canal tool, applied to the compiled object, outputs an annotated version of the source which allows the programmer to verify the number of instructions used by the compiler to carry out the computation and to see how many streams will be allocated at runtime. Here is Canal output for the loop above;

Loop 27 in conj_grad_ at line 724 in loop 26 Loop scheduled: 3 memory operations, 2 floating point operations 3 instructions, needs 25 streams for full utilization

The Tera compiler, while powerful, is not infallible. Examining the Canal output can help the programmer determine, for instance, whether the operations are packed into the VLIW instructions as tightly as they could theoretically be. If not, the computation may be restructured to coax the compiler to "get it right." We required several rounds of trial-and-error to get the desired 7-instruction loop for the bit-packed code.

Even when the instruction stream is packed tightly, performance could be lower than expected. Recall that the compiler estimates the number of streams required to saturate the processor. (In the example above, it estimated that 25 streams would suffice, which turned out not to be true.) If the programmer suspects that the processor has not been kept saturated, the program can be recompiled with a -trace flag and rerun to generate an execution trace viewable post-mortem. Figure 2 shows a graphical representation from the Traceview tool applied to a run of CG on two processors.

Figure 2.

The horizontal axis represents runtime in clock cycles ("tics"). The vertical axis represents available issue slots and the yellow curve shows the utilization of these issue slots by CG. The 1.0 level would represent full utilization of the available issue slots on one processor. The 2.0 level represents full utilization of two processors. As can be seen, CG runs almost purely serial at first. This is all setup which is untimed in the benchmark. We made no effort to parallelize the untimed portion. The yellow block at the right hand end of the graph shows the execution "going parallel"; this is the timed portion of the benchmark. CG is using about 1.7 of the available issue slots on two processors. Interestingly, the one processor version uses about 95%. The difference is largely due to the imperfect network.

If Traceview shows disappointing utilization in a timed section, we can zoom in an area of interest and locate the source code of the problem area. If the problem area is narrowed down to a particular loop, we would consult the Canal output again to see what mix of floating point operations and memory operations made up the loop. We could also see if, in our estimation, the compiler had allocated enough streams for the loop. To try more streams we can insert a pragma in the source code to convince the compiler to ask for more streams. Iterating in this way allows the programmer to improve code in a guided fashion with a clear picture of what the compiler and hardware are doing in response to the code.

The improvements between the earlier and current versions of CG are mainly due to using directives to allocate more streams per processor than the compiler thought necessary. There was also a minor improvement achieved by directing it to allocating chunks of work at a coarses granularity -- 5 rows of the matrix at once rather than one at a time.

3.2 FT

The vanilla NAS 2.3 serial kernel version of FT (Fourier Transform) is optimized for modern workstations with memory hierarchies. Parallelism was sacrificed in order to obtain a higher degrees of data locality. For the Tera MTA, the opposite must be done in order to achieve good performance.

The NAS serial code copies a set of 1-D FFT's (the number is a parameter in a configuration file) to contiguous storage. This is done to increase spatial data locality and to remove associativity problems, which are notorious on the FFT with its large power-of-2 strides. The inner loop iterates over the 1-D FFT's providing reuse of the twiddle factors and spatial data locality. This data locality provided by the copying obscures the inherent parallelism.

For the MTA though, data locality is not a concern, but exposing parallelism is of paramount important. Thus to tune FT for the MTA the code was "un-optimized", in the sense that we went back to a more intuitive notion of how to perform the 3D FFT. We eliminated all copies to temporary storage and the computation is done over the 3D grid (even though this involved large power-of-2 strides). We also moved the 1-D FFT's to the outermost loop for each dimension of the 3D FFT; yielding approximately N^2 degrees of parallelism (The parallelism depends on which dimension is being FFT'ed. For the "Class A" size problem, it is at least 128*256 = 32,768-way parallelism).

The 1-D FFT's are implemented using a combination of the Stockham variant 1 and variant 2 algorithm (as described in [5]). The original NAS code only used the second variant, but the second variant has the problem that the first few iterations have short inner loops. Our code employed a method similar to that in the NAS 1 FT where the first variant is done for two iterations and then the second variant is done for the rest of the code. Thus the smallest inner loop consists of doing at least four butterflies.

Furthermore while doing the first variant we do both stages at a the same time, thereby reducing load/stores. As it turns out, the addressing necessary to write the data into the correct location at the end of the second stage is straightforward and this allows us to switch from the first to the second variant without performing a transpose.

Other changes to the code included parallelizing the time-consuming loop in compute_initial_conditions, and constructing two distinct procedures for the FFT and inverse FFT. This is done to prevent conditional branches checking if the conjugate of the twiddle factors need to be computed.

3.3 IS

The primary optimization employed for IS was quite similar to the bit packing employed for CG. The MTA can issue one word memory operation per cycle. Bit packing gives the effect of multiple references per cycle by returning multiple values in a single word. For memory intensive codes this technique increases the ratio of floating point operations to memory references and improves performance.

3.4 EP

EP (Embarrassingly Parallel) is perfectly parallel at the outer loop level. As mentioned earlier, it was necessary to change the random number generator to remove it as a sequential bottleneck. Most of the time in EP is spent generating the random numbers and in library routines for square root and log. The improvement in EP's running time was due to inlining these library routines.

Unlike the other kernel benchmarks, EP does not have a high percentage of memory references per instructions, so its parallel efficiency is not affected by the suboptimal network.

3.5 MG

MG provides another instance of the burden of legacy code. Profiling quickly directed us to four routines: resid, psinv, interp, and comm1p. Together, these routines comprised roughly 70 percent of the execution time. Studying these routines, a pattern became apparent: MG was expressly optimized for vector performance.

The first three share a similar structure. Each consists of a three-deep loop nest, where the inner loop has been split into two loops. A temporary array holds the values between the two inner loops. Each of these two loops vectorizes perfectly. However, both loops perform suboptimally on the MTA and superscalar workstations, since they suffer from a high communication to computation ratio (i.e. they perform many load/store operations for each add/multiply operation). We remedied this problems by a three-step optimization: eliminate the temporary array, then fuse the two loops, and finally skew and tile the resulting single inner loop. Tiling and eliminating the temporary array decrease the communication to computation ratio. Tiling and fusing increase instruction level parallelism.

The fourth, comm1p, had a different structure. comm1p changes the grid resolution in the computation. Doing so involves only copying of data from one array to another. For historical reasons probably reflecting the MPI heritage of the code, the vanilla implementation breaks this simple copy operation into four stages. As with the other routines, the implementors introduced extra arrays to buffer between these stages. While, each of the stages vectorizes perfectly, the resulting computation involves many more memory operations than are necessary. We coded an optimized version of comm1p to perform only the simple array-to-array copy.

4 Two processor results

Figure 3 shows the results of running our most highly-tuned versions of each NAS 2.3 kernel benchmark on one and two processors of the MTA. "Parallel Efficiency" is defined in Section 5. Results are reported for the MTA at San Diego and the MTA belonging to Tera in Seattle. To enable the benchmarks to run on two processors, it was only necessary, in all cases, to recompile the unmodified source file with the appropriate compiler flags. All runs are for Class A of the benchmarks, the smallest supercomputer class size.

Figure 3.

The primary cause of sublinear speedup for these benchmarks is insufficient memory bandwidth. For both MTA's, at the time of writing this paper, the networks connecting the processors to memory have less bandwidth than specified in the design due to defective links in the networks. The Seattle two-processor machine's network is estimated to be about 85% of full bandwidth. This means it can support two processors each issuing a memory reference on about 85% of their cycles, or one processor issuing a memory reference at the maximum rate of one memory reference per cycle. The San Diego machine's network has still lower bandwidth.

EP is not a memory intensive code and so scales linearly. Each of the other benchmarks are highly memory intensive and need more memory bandwidth than provided by either machine. All but FT run faster on the machine with more bandwidth. At the time of writing, we do not understand FT's performance completely, but it may be related to compiler or library differences at the two sites.

5 Scaling Analysis

In this section, we attempt to categorize the different causes of inefficient parallel execution in our two-processor experiments. Letting T(P) represent the time required to execute a program on a P-processor machine, we define

Parallel Efficiency = T(1) / (P*T(P))

For a "perfectly scalable" program this ratio is 1. A value less than 1 is typical, a value greater than 1 would correspond to superlinear speedup.

We will measure times in cycles. The quantity then P*T(P) represents all the cycles issued by all processors during a parallel execution. Our goal is to classify these cycles into those that represent efficient parallel execution plus those that represent (different types of) overhead. The simplest breakdown is given by the identity

T(1) / (PT(P)) + (PT(P)-T(1)) / (P*T(P)) = 100%
i.e., Parallel Efficiency + Overhead = 100%

This says that the parallel efficiency plus the overhead (the fraction of cycles required by a parallel machine that are in excess of the cycles required by a uniprocessor) gives the total of all cycles in a parallel execution.

We will give a more detailed breakdown, guided by a more detailed understanding of the MTA. Wherever the compiler is unable to find parallelism, it generates "serial code" with only one thread of execution. Otherwise, it generates threaded code. Rather than statically allocating streams once per program, the Tera compiler generates code that allocates streams only for the duration of a parallel region. This ensures that if multiple jobs are running simultaneously, there will not be idle streams allocated to one job that could be used by another. However, it does incur extra overhead, overhead that might be considered unnecessary when (as in our experiments) only one job is running.

Threaded code comes in two varieties: if the compiler is generating threads for a single processor, it can generate low-overhead "fray code"; otherwise, it generates "crew code" that can use streams on multiple processors. The overhead of allocating streams for a parallel region is about 8500 cycles for fray code, 25,500 cycles for crew code on one processor, and 32,000 cycles per processor (64,000 total cycles) for crew code on two processors [6].

Whenever the MTA is executing sequential code on one processor, the cycles on the remaining (P-1) processors will be wasted overhead. Thus, we can define:

Serial Overhead = (P-1)T(serial) / (PT(P))

where T(serial) is the time spent in serial code. We measured T(serial) by instrumenting our code with calls to the hardware cycle counter at the beginning and end of each serial region, and running this code on a single processor.

Another source of parallel overhead is the difference between 8500 cycles for allocating streams in fray code (used in the one-processor times), versus the 64,000 cycles for crew code. This gives:

Allocation Overhead = 55,500Regions / (PT(P))

where "Regions" is the number of times the program enters a parallel region. We were able to compute this number for each benchmark by using the compiler flag -trace:PAR_SIG to identify where streams are allocated in the code, together with knowing how many times each loop is executed. The number of regions is 825 for CG, 99 for FT, 20 for IS, 1 for EP, and 951 for MG.

In another set of experiments, we tried running crew code on a single processor. We found that the crew code was slower by more than the extra allocation overhead (25,500-8500 cycles) per parallel region. A possible explanation is that, when generating crew code, the compiler sometimes uses a different method of distributing loop iterations to processors (it can use block, block-cyclic, or dynamic distribution). To reflect this difference, we define:

Crew Overhead = (Tcrew(1) - Tfray(1) - 17,000Regions) / (PT(P))

Finally, we define

Other Overhead = 100% - Efficiency - (Serial+Allocation+Crew) Overheads

We believe the major contributor to "other overhead" is the imperfect network. This is supported by the fact that the codes (except FT) ran significantly faster on the Seattle machine with the better network, and that the size of the "other overhead" is consistent with the amount of network degradation. However, another contributing factor could be load imbalance between the streams. This might be particularly relevant for FT, where the units of work being allocated to the streams are fairly large -- an entire 1-D FFT. Unfortunately, we didn't run any experiments that could distinguish between network and load-balancing problems.

The result of this analysis is shown in the table below. The efficiency reported here differs from that shown in Figure 3 since these experiments used instrumented versions of codes that preceeded some of the final tuning efforts.

Benchmark	Parallel Efficiency	Serial Overhead	Allocation Overhead	Crew Overhead	Other Overhead
CG	78.82%	0.23%	1.58%	2.05%	17.33%
FT	74.59	0.04	0.04	4.84	20.50
IS	79.37	0.71	0.30	0.44	19.18
EP	99.26	0.00	0.00	0.03	0.71
MG	84.04	0.22	1.02	1.45	13.28

Table 2: Classification of Parallel Cycles

6 Conclusions

This preliminary study has focused on the performance and scaling characteristics of the first two-processor MTA. Our approach was intended to be as demanding as possible, in the following senses:

The single-processor code was tuned to be as fast as possible. (It's easy to get impressive parallel efficiency for inefficient codes.)
We performed no additional tuning for the two-processor runs.
We use the same problem size for one- and two-processor cases, rather than scaling the problem up with the number of processors.

Our investigation has yielded what might be called a negative result: we were unable to identify any serious limit to the scalability of the MTA performance up to a hundred or so processors. Before this can be made into a positive result, the network problems need to be fixed and we need a better understanding of the load-balancing issues. Other conclusions that can be drawn from our work include:

A single processor of the MTA is powerful and can perform well on scientific codes.
The programming model for the MTA is simple and the compiler is effective in automatically parallelizing code.
No source code modification is required to run on multiple processors.
Overhead required to spawn threads in fray and crew mode is light
The unparallelized sections of code in the NAS 2.3 benchmarks are almost insignificant.

References

[1] "NAS Benchmarks on the Tera MTA": J. Boisseau, L. Carter, K. Gatlin, A. Majumdar, A. Snavely, Workshop on Multi-Threaded Execution, Architecture, and Compilers, February 1998.
[2] see http://science.nas.nasa.gov/Software/NPB
[3] "CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger",: Jay Boisseau, Larry Carter and Allan Snavely with David Callahan, John Feo, Simon Kahan and Zhijun Wu, Cray User's Group Conference, June, 1998.
[4] "Tuning the BLAS for the Tera": P. Briggs, Workshop on Multi-Threaded Execution, Architecture, and Compilers, February 1998.
[5] "A High-Performance FFT Algorithm for Vector Supercomputers": D.H. Bailey, NAS Technical Report, Nov 23, 1987.
[6] P. Briggs, personal correspondence.