Next: 5 Detailed Design Up: Case Studies Index Previous:3 System Design

SAR Case Study

4 Architecture Design

4.1 Architecture Process Description

The architecture process (Figure 4- 1) transformed processing requirements into candidate architectures of hardware and software elements. The system-level processing requirements were allocated to hardware and/or software functions. The architecture process resulted in an abstract behavioral description of the hardware and definition of the software required for each processor in the system. Hardware/software co-simulation verified the correctness and performance of the resulting architecture.

Figure 4- 1: Architecture process.

4.1.1 RASSP Innovations in the Architecture Design Process

4.1.1.1 Hierarchical simulation (Performance Modeling)

A hierarchical VHSIC Hardware Description Language (VHDL)-based virtual prototype approach was used for signal-processor simulation in Rapid Prototyping of Application-Specific Signal Processors (RASSP) methodology. This enabled design aspects to be followed through the design process from the Performance Model to detailed design level within a common framework. The purposes of the performance simulations were the following:

Provided a quick, accurate performance comparison for candidate architectures. This included hardware and software architectures.
Provided sizing information with regard to the number of processing elements needed. This reduced the risk of ending with a system deficient in processing power or with a more expensive system with too much power.
Identified bottlenecks in the interconnect network and defined remedies early in the development process before reaching the detailed design phase. The bottlenecks could be in hardware or software.
Evaluated different mappings of the processing algorithms to the hardware processing elements early in the design process. Potential mapping inefficiencies were identified early in the design process where the impact of change was minimal.
Determined memory requirements based on the actual sequence of processing steps. Proper memory sizing had to be based on peak use rather than average use. Simulations also identified inefficient distribution of memory among processing elements.

4.1.1.2 Autocoding Software

The software development of the RASSP architecture process deviated significantly from traditional (functional decomposition) approaches. The partitioned software functionality was broken into four major areas for real-time application software:

Algorithm - Specified by data-flow graphs (DFG).
Scheduling, communications, and execution - Specified by mapping graphs to a specific architecture.
Command Program - Specified by a hierarchical, finite-state machine that controlled the execution and interaction of the application for its various modes and sub-modes.
Application Specific Interface- Specified by a set of callable procedures that translated between the command program view of modes and sub-modes and the algorithm view of a collection of signal-processing graphs.

The RASSP program intended to automate these four areas as much as possible. This was accomplished by using a graph-based programming approach that supported correct-by-construction algorithm development. The scheduling, communications, and execution software was generated efficiently from an autocoding tool after the user defined the partitioning and mapping of the data-flow graph onto the specific hardware architecture. The command program was graphically captured in a state diagram and the software code was auto-generated from the tool. An Application Interface Builder automatically generates the application-specific interface from the data flow graph and state diagrams.

4.2 Functional Design

4.2.1 Architecture Sizing of SAR Algorithm

4.2.1.1 Algorithm Implementation Analysis (Latency, Bandwidth, Computational, and Memory Requirements)

Figure 4- 2 is the SAR signal processing block diagram. The SAR Signal Processor had to process up to three of four possible polarizations. Its architecture had to be scalable by a factor of two in processing power and inter- processor communication bandwidth . This scalability was for future enhancements, such as polarimetric whitening filtering, CFAR target recognition processing, and autofocussing for other modes of operation, such as spotlight.

Each image frame was composed of 512 pulses with 2048 complex samples per pulse. Storing one image frame of one polarization required 8.4 Mbytes of memory, assuming 8 bytes for each complex point in the array. Azimuth processing required two frames of data.

At the maximum pulse frequency (PRF) of 556 Hz, the 512 pulses needed to form an image frame were collected in less then 0.92 seconds. If images for three different polarizations were produced at this rate, then the output interface had to support an average transfer rate of 27.32 Mbytes/sec, or 512 pulses x 2048 samples per pulse x 8 bytes per sample x 3 polarizations x 1/.92 pulses per second.

The interconnect bandwidth requirements were analyzed for the candidate architectures by the performance modeling effort.

Latency through the SAR Signal Processor could not exceed 3 seconds. The PRF of 200 to 556 pulses per second, coupled with the 512 pulses per frame, gave an interval of 2.56 seconds to 0.92 seconds between frames of the same polarization. The three polarization frames were received interlaced, and the frame output was required to be sequential. Latency in this case was defined as the interval between the arrival of the last pulse of an image frame and the start of the resulting image frame output. With this definition of latency, maximum latency was not a design driver. Reduction of memory demand was more of a design driver than latency when developing an implementation that needed to process and output data as quickly as possible.

Figure 4- 2: SAR image processing data flow block diagram.

Table 4- 1 lists the memory requirements and processing throughput estimates at the maximum input data rate and are the result of manual calculations. This provided a starting point for the performance modeling effort that defined the number of processors needed to meet the real-time algorithm requirements. The 48-tap Finite Impulse Response filter (FIR) and Fast Fourier Transform (FFTs) in range and azimuth compression dominated the processing requirement. The memory requirements for azimuth compression were caused by corner turning.

Table 4- 1: SAR Processing and Memory Requirements

	MOPS	KBYTES
Data Preparation	33	49
Video To Baseband (48 tap FIR)	639	49
Equalization	21	49
Range FFT	188	49
RCS Compensation	7	16
Azimuth FFT	342	50,356
Kernel Multiply	41	254
Azimuth IFFT	342	254
Input/Output Formatting	16	25,165
TOTAL	1629	76,241

Scalability, performance, and future upgradability requirements led to the investigation of commercial-off-the-shelf (COTS), floating-point, digital-signal-processor (DSP) modules for most of the SAR processing. The FIR filter, comprising 40 percent of the total processing requirement, was a strong candidate for dedicated hardware implementation. Specialized processors sacrificed total programmability for improved efficiency in implementing a given functionality. For example, a custom module using specialized, programmable, FIR-filter integrated circuits had a recurring cost of < $2,000 to filter the processing of the SAR algorithm. If the 48-tap FIR filter processing was computed in the time domain using quad i860 COTS DSP modules, at ~$30,000 and 320 MOPS computing capability each, then the cost would have been ~$60,000. The architecture options to be investigated were identified at this point in the architecture process. The final selection was not made until after the more detailed evaluation by performance modeling and cost analysis. The detailed analysis evaluated a variety of architectures with different combinations of COTS and dedicated hardware. This included evaluation of a custom processor architecture specialized for high performance, fixed-point, block-oriented algorithms and array processing, such as FFTs.

4.2.1.2 Numerical Sensitivity Analysis

Matlab simulations were used to perform a numerical sensitivity analysis of the SAR algorithm to determine if using integer formats or fewer bits of precision would meet system accuracy requirements. Matlab was easier and more effective to use than was VHDL because there were available post analysis tools, math libraries, and experienced personnel. The reference image supplied by MIT/LL was computed using IEEE double-precision floating point. The SAR requirement was error power had to be less than - 103 dB relative to maximum output signal power. Table 4- 2 lists analysis results for the six architectures identified in Section 4.3.2 and Table 4- 3.

Table 4- 2: Results of Numerical Sensitivity Analysis

SAR Processing	Alt 1	Alt 2	Alt 3	Alt 4	Alt 5	Alt 6
FIR	SP FP	12 bit	23 bit	24-bit BFP	12 bit	23 bit
FFT processing	SP FP	SP FP	SP FP	24-bit BFP	24-bit BFP	24-bit BFP
Rest of Algorithm	SP FP	SP FP	SP FP	SP FP	SP FP	SP FP
Accuracy	- 163 dB	- 113 dB	- 161 dB	- 147 dB	- 113 dB	- 145 dB

SP FP - Single Precision Floating Point	12-bit - 12 bit integer
BFP - Block Floating Point	24-bit - 24 bit integer.

All combinations met requirements; however, developers decided that alternative 2 was too close to the requirement particularly when the supplied image did not have the maximum allowed differences in pixel values.

4.2.1.3 First Pass Partitioning of Hardware and Software

The following hardware features were common to all candidate architectures of the SAR Signal Processor (Figure 4- 3):

A Host Interface Board provided control interface to the SAR Signal Processor from the host through the required RS232 link. The Host Interface was the master control processor for booting the SAR processor, running test diagnostics, and performing configuration control in the SAR processor. A COTS single board computer was chosen to perform these functions.
A Data I/O Board provided the required fiber- optic (FO) interface to the radar's data source and data sink. The board extracted data samples and auxiliary data from incoming radar data, and it provided this data to the signal processors. The board handled the formatting of the output image data. The type of non- standard data operations that were required, combined with the high data throughput, necessitated a custom design for the board. The board could have included a hardware implementation of certain signal-processing functions, such as FIR filtering.
Signal Processor Board(s) performed most of the SAR processing. The board(s) received input radar data from the board, sent the output image data to the Data I/O Board, and was controlled by the Host Interface Board. Signal processing was FFT intensive, and it could have been performed with COTS or Specialized processor boards.
An VME bus and Raceway Network provided communication paths between the Host Interface, Data I/O, and Signal Processor Boards

Figure 4- 3: Block Diagram of SAR processor architecture with external interfaces in test mode.

The following software features were common to all candidate architectures of the SAR Signal Processor (Figure 4- 4):

GUI - Graphical User Interface. The user controlled the operation of the signal processor through the GUI program in the test (non-embedded operation) mode. The GUI command data and response data was sent to/from the Command Program via the Host Interface Board.
Command Program - The command program controlled the SAR processing program that ran the SAR algorithm.
SAR processing program was composed of control and polarization programs.

Figure 4- 4: Top level software partitioning.

4.2.2 Flow-Graph Generation

The SAR signal-processing algorithm was expressed using an architecture-independent format of a Data Flow graph (DFG) implementation. This allowed Developers to follow RASSP's hardware/software co-design process, where the DFG algorithm could be partitioned to hardware and software. The SAR development used the Processing Graph Method (PGM) technology from the Naval Research Laboratory (NRL). This was done because PGM was a standard, at least in the Navy, and tools were available to assist development. See the following documents for more details:

For PGM:

Processing Graph Method Specification: Version 1.0, Navy Standard Signal Processing Program, (PMS - 412), December 1987
Processing Graph Method Tutorial, Navy Standard Signal Processing Program, (PMS - 412), January 1990.

For a standard set of signal-processing PGM primitives developed for the Navy: ECOS Primitives Specification Library, CDRL Q003 and are referred to as the Q003 Primitives.

For data flow graphs and autocoding process, see the application note Autocoding for DSP Algorithm (PGM).

The JRS RSS CAD (computer-aided design) tools were used to construct the SAR DFG. For more information on the CAD toolset, see Reusable Software Subsystem (RSS) User's Manual, JRS Research Laboratories, March 1994. When constructing the DFG, components from the existing library should be used. For PGM, this requires familiarity with the Q003 primitive library. Most signal-processing functions can be implemented by using a combination of these primitives. For the SAR, all required functions were defined down to existing Q003 library elements. If existing primitives could not implement some of the processing, than a special primitive would be written. The new primitives were defined within an Ada environment; however, the underlying code for the primitive itself could be written using C language. Also the necessary interfaces to the Autocode toolset must be generated.

A data flow graph is relatively simple to read and put together; however, it is helpful to know the following PGM terminology. A graph represents a complete algorithm for a particular application, such as SAR. Graphs may contain subgraphs, which provide a particular hierarchical structure and simplify the creation of complex graphs. An example of subgraphs are the range and azimuth subgraphs in the SAR graph ( Figure 4- 5). A graph consists of a set of nodes that represent primitive functions, such as a Q003 library element. A node contains input and output ports (Figure 4- 6). Queues provide the primary data storage and transfer mechanism in a PGM graph and are represented by a first-in-first-out (FIFO) data structure. Nodes are low-level functions that range from simple to complex and perform processing for an application domain. An example of a node is the Finite Impulse Response (FIR) filter (Figure 4- 7) of the range subgraph. Associated with each node was a set of Node Execution Parameters (NEP):

Threshold Amount - minimum number of data elements on each input queue before the node executes
Read Amount - number of data elements read in from the input queue when the node executes
Consume Amount - number of data elements to be deleted from each input queue after the node executes
Produce Amount - number of data elements written to each output queue after the node executes

The PGM provided two additional data sources: Graph Variables and Graph Instantiation Parameters. These are individual data items and are used to parameterize the graph during execution, such as number of taps to use in a FIR Filter or to put in the FIR coefficients in the range subgraph.

Figure 4- 5: PGM graph of SAR.

Figure 4- 6: PGM node.

The fundamental rule governing node execution is that a node executes when all of its input queues contain more data then the threshold amounts. There is no notion of sequential execution time for a node, as would occur in a conventional thread control flow. Instead, nodes execute whenever there was sufficient data to process.

The last PGM concept to discuss is a family of nodes. Family notation allows a set of nodes that accomplish the same function to be grouped and handled graphically as one entity; this is, represented by a heavy outline of the box as seen by the range node and input and output queues of range in Figure 4- 5. Families are used to simplify the representation of parallelized functions.

The SAR algorithm has inherent parallelisms. For example, in range compression, each pulse of data could be processed independently, so pulse data could be divided among a set of processing nodes for concurrent processing. The SAR algorithm input signal consisted of 512 pulses of 4064 data items. The input data stream could be split: The first pulse of range data was sent to the first of a family of range subgraphs, and each subsequent pulse was sent to the next family member of range subgraphs. This range processing was grouped into one sub-graph called range (Figure 4- 7). For the SAR, this was done in the SPLIT node using the Q003 DFC_SWTH primitive. After each data pulse or block was processed in range, the processed data blocks were written into a double indexed queue.

The azimuth processing could now be processed in parallel. Figure 4- 8 shows azimuth processing, and this subgraph was called azimuth. The data was recombined with the CONCAT node using Q003 DFC_CAT primitive to produce the full frame of polarization data.

Each pulse or set of data was processed through a series of range-processing algorithms represented by the blocks video-to-baseband, equalization weight, range DFT, and RCS (Radar Cross-Section) calibration shown in Figure 4- 2.

The D_MUX node used the Q003 DFC_DMUX primitive to form sequences of even and odd pulse samples. Each sequence was passed through a FIR node that had a NEP threshold amount of 2032 and that used the Q003 FIR_R1S primitive. The node had 8 to 48 taps determined by a graph variable. The coefficients for the FIR were graph variables to the node. Even /odd outputs were combined into a family and muxed together back into a stream of data. The data was then converted to complex data and filled to 2048 to become multiple of twos. The complex equalization weights, graph variables in the V_MULT node, were multiplied with the data. Weighted I/Q data were transformed to (compressed) range data by using a 2048-point FFT node. The FFT node used the Q003 FFT_CC primitive. The RCS weights, graph variables in the R_MULT node using the Q003 VCC_VMULT primitive, were multiplied with the data. The FANOUT node, using the Q003 DFC_SEP primitive, distributed the data to the appropriate azimuth channel.

Figure 4- 7: PGM representation of range processing.

Figure 4- 8 shows azimuth processing. Each range subgraph outputted segments of range data into the number of azimuth segments. Each azimuth subgraph used the Q003 primitive DFC_SEP to collect all of the range segments that belonged in this azimuth segment. Azimuth processing then transposed the data using the Q003 primitive MOC_TPSE.

Figure 4- 8: PGM graph of azimuth processing.

Developers used PGSE to simulated and debug the SAR graph. The PGSE tool provided capabilities to execute graphs and debug facilities that used the Telesoft Ada debugger.

4.2.3 Develop Command Program

The command program initialized the SAR system, controlled the processing graphs as commanded by the radar system and controlled the self-test functions. The command program was designed using the Schalaer-Mellor object-oriented approach and the Cadre ObjectTeam OOA/OOD tools. Information had been manually transferred from the RDD - 100 description because RDD- 100 and the Cadre tool used different data paradigms. The program was developed in four stages:

Create an object relationship diagram
Create a state diagram for each object
Develop a driver procedure
Generate Ada code using the Cadre OOA2ADA tool.

The driver procedure performed the following functions:

Created each instance of each command program object
Established the object relationships
Set initial state of each object
Accepted user commands over the external command interface; generated events to the command processing object to process each command; and sent results back to the user command interface.

Developers experienced difficulty with the immature and unsupported OOA2ADA tool; the resultant code had to be extensively rewritten.

There were 3500 lines of code in the Command Program, of which 1800 were autocoded.

The CP_Callable Interface library implemented the interface between the command program and the autocoded application software. The design of the interface library was based on the SAR implementation in PGSE. The message structure was taken from an Auotcode Design Document written by the autocoder vendor Management Communications and Control Incorporated (MCCI). There were 2300 lines of code in the CP_Callable Interface.

4.3 Architecture Selection

4.3.1 Initial Size, Weight, and Power

There was a requirement for four 6U VME customer-supplied modules to be placed in the chassis, which led to the selection of a VME backplane for the SAR processor. The maximum allowable dimensions for the SAR Signal Processor chassis was 10.5X 20.5X 17.5 inches, which allowed up to a 21 slot 6U VME card rack. The physical specifications of the architecture were the following:

Processing Capability - 4.8 GFLOPS computing capability - twice the processing requirement for SAR. Assumed 15 COTS processor boards, each with four i860 (least computational power of candidates being investigated), four customer-reserved modules, one host board, and one I/O module.
Weight - 55 pounds. The fully loaded chassis had to be less than 60 pounds, which was not a design driver, even with 21 modules at 24 oz. each, 3 pounds for the backplane, 10 pounds for the power supply, and 10 pounds for chassis.
Power - 740 watts for a fully populated chassis. Assumed 21 modules at 30 watts apiece (40 watts is maximum average power dissipation per 6 U VME card for most air cooled chassis), 85-percent efficiency for the power supplies, and input power of 24 to 32 volts DC. Maximum power for a baseline system had to be 500 watts.

4.3.2 Architecture Definition

The candidate architectures included COTS and custom processor boards. However, certain features were common to all candidates:

A Host Interface Board that provided external control interface via RS232 link to host
A Data I/O Board that provided external FO interface
Signal Processor Boards that performed most of the SAR processing
An Interconnect Network that provided communication.

The high throughput requirement and the accuracy and scalability requirements narrowed candidate DSP components for the Signal Processor Boards to high-performance floating-point processors, such as Intel's I860, Analog Devices' ADSP21060 (SHARC), Motorola's DSP96002, and TMS320C40.T ADSP21060 had the best performance and the I860 had the second best performance. The ADSP21060 could also cluster several DSPs together and had its own internal memory to reduce the number of peripheral components. This allowed more DSPs per board, or about two to three times the number of i860s.

Candidate COTS board solutions needed to be expandable to a number of DSPs across multiple Processing Boards. Also important was the available interprocessor communication, operating system (OS), and software support. COTS boards from Mercury Computer Systems, Inc., were selected over comparable boards from Sky Computer and CSPI because RASSP's autocoding tools from MCCI were being implemented first on Mercury software.

One architecture evaluated for the SAR processor was a custom board based on the SHARP LH9124 DSP chip. The LH9124 was a high-performance, fixed-point DSP optimized for block-oriented algorithms and array processing, including FIR and FFT operations. For example, the LH9124 was capable of performing a 1K complex FFT in 80.7 microseconds, which was well under the 460 microseconds required for the Analog Devices SHARC DSP. The LH9124 had no address capability, so it needed external addressing, such as that generated by the SHARP LH9320 DSP address generator chip. A signal processing board would have required a more general purpose processor for control and system interface functions or have been managed completely by hardware control through using FPGAs (Field Programmable Gate Arrays) .

Performance modeling and Matlab simulations were used to size the different architectures. The eight candidate SAR processor architectures evaluated were the following:

COTS design with Mercury i860 MCV6 Processor Boards with and without a FIR Filter on the Data I/O Board
COTS design with Mercury ADSP21060 MCE6/MCV6 Processor Boards with and without a FIR Filter on the Data I/O Board
Custom Processor Boards based on ADSP21060 MCMs with and without a FIR Filter on the Data I/O Board
Custom Processor Boards based on the Sharp LH9124 with and without a FIR Filter on the Data I/O Board

During simulation the computation agent read pseudo-code that represented the program being executed from a file. The four basic pseudo-code instructions were compute, send, receive, and jump. The compute instruction represented execution of an application subroutine as a simple time delay. The delay times were obtained from published times for the candidate COTS library functions. The send instruction caused the computation agent to direct the communications agent to send a token to another CE. The token defined the data source, data destination, and data packet size. The receive instruction consumed received data. If the data had arrived, the specified queue was decremented. If the data had not arrived, the computation agent was blocked until the data arrived. The model tracked how much data was stored in the various queues, but it did not store actual data.

Figure 4- 11: Processing Element model structure

The communications agent transferred data tokens between the local CE's memory queues and other CEs. In the SAR Performance Model, the communication agent broke data packets into the actual packets that were sent over RACEway. Upon receiving a token, the communications agent incremented the amount of data in the appropriate queue by the received amount. When sending a token, the agent decremented the appropriate data queue by the transmitted amount. Figure 4- 12 shows the top level of the computation element in the form of the VHDL model.

Figure 4- 12: Top level of the computation element in the form of the VHDL model.

4.3.3.1.2 Switch Model

The switch element model, labeled X in Figure 4- 10, modeled the RACEway crossbar when evaluating architectures based on COTS DSP boards from Mercury Computer Systems. The Mercury crossbar had six ports with any port capable of connecting to any other port. Connections were made if the destination port unblocked. The input and output ports are shown separately in Figure 4- 13, although they were actually the same physical bi-directional port. Most of the switch model development time was devoted to accurately modeling how the RACEway crossbar handled message blocking and contention. A message was blocked if the output port was in use, either as the output or input port for another message. When messages arrived concurrently, priority was given to the message that arrived on the lower indexed port. Accurate modeling of message blocking and contention was needed to accurately evaluate the interconnect network performance. Once a link was established through the crossbar, it remained in use for a period of time determined by the data packet size.

Figure 4- 13: Switch Element X model structure.

The message token used to model messages passing through the switch element was defined as a record in VHDL (Figure 4- 14)

Figure 4- 14: Message token used to model messages passing through the switch element.

The token "purpose" was used to request an interconnect link, acknowledge granting of a request, not acknowledge granting a request, or to preempt a link. The "route" and "index" fields were used to determine the switch output port, and the "length" field determined how long the link would be busy. The combination of switch models and tokens provided accurate modeling of the SAR processor RACEway interconnect.

4.3.3.1.3 Hierarchical Structural Model

The CE and X elements were first assembled into models for the various board types as shown in Figure 4- 10. The board models were then connected together to generate the model of the entire SAR processor. The use of the hierarchical VHDL structural models at the processor board and system level made it easy to modify the architecture to investigate architectural alternatives.

4.3.3.2 Performance Modeling of the SAR Processor Software

The first step in the process followed to develop the SAR processor software model is shown in figure 4- 5. Each graph node represented a SAR processing primitive, such as FFT, vector multiply, or convolution. The arcs between graph nodes represented data dependencies.

Because a single processor could not perform all SAR processing in real time, the next step was to partition the data flow graph into a set of partitioned graphs. The partitioned graphs were then mapped to the processing elements in the hardware model. Graph partitioning and mapping for the SAR application were performed manually because tools for automatic partitioning and mapping were unavailable.

The final step was to generate the pseudo-code application program for each processing element by scheduling graph- node execution. An existing program was then used to generate the set of pseudo-code application programs for each processing element in the SAR processor. Static partitioning/mapping/scheduling were used because the required processing did not change dynamically. The pseudo-code programs were stored in files, and each instantiated processor element in the model read its program from file during simulation and performed the indicated operation. Arithmetic operations were modeled by a delay, and I/O operations were used to set up the queues in the processor element model's communication interface.

Figure 4- 15: Steps in developing the SAR processor software model.

Data communication was modeled by passing tokens through the modeled interconnect network. The Performance Model tokens identified message type, size, source, and destination. The size determined how long interconnect links were "busy" with the message, and the message type was used by the receiving processing node to determine when to fire the next processing step. When modeling the RACEway interconnect, the tokens also included the network routing information and, in some cases, message priority. Figure 4- 16 is an example of the pseudo-code generated for a CE in an 8-CE partition by the software generation program.

4.3.3.3 Performance Model Simulations

Several candidate SAR processor architectures were evaluated using simulation of the VHDL Performance Model performing the SAR algorithms. For example, the number of processing boards required was determined by simulating several image frames on models having different numbers of boards. The simulation results were post- processed to generate time-line plots showing use for each processing element. Changing the number of boards required minimal effort. The structural model of the hardware was modified by adding or subtracting boards, and the software generation program was rerun for the different number of processors and/or mapping assignment. A change in number of boards in the model took less than a day to complete, including resimulation. Changes to mapping assignment were completed in four hours or less. The low-level hardware models and the signal-processing DFGs were unchanged by the architecture variations.

Figure 4- 16: Pseudo-code generated for a CE in an 8-CE partition by the software generation program.

Five frames of data were processed to allow processing to reach the steady-state condition. The maximum resource requirement occurred in steady-state when data input, range processing, azimuth processing, and data output were all active. The performance simulations determined that three processing boards were required for the SHARC COTS architecture and six boards were required for the i860 COTS architecture.

If the rest of the board architecture was left unaffected, then switching among SHARC or i860 required changing only delay values assigned to processing operations in the processing element model. This was possible because the SHARC links were not used by the SAR processor architectures and so they were not included in the model. The full custom SHARP-based architectures were not performance modeled, and they were eliminated based on cost and schedule risks. A performance simulation of the SHARP-based architectures would have required more extensive model modifications. Also, modeling custom architectures required more effort in determining the time required for performing standard signal-processing operations. These times were usually available for COTS DSP boards and were incorporated into the processor element model.

Performance Model simulations also provided memory use at each processing element. The candidate COTS architectures had memory associated with each processor element instead of global memory. Dynamic memory use was captured during simulation by statements included in each processor element model, and memory use was plotted after post-processing the use data. Equalization of memory requirements over the processor elements was desired to minimize the number of processor/memory module types. The highest memory requirements were for the I/O control processor. This processor was a processor element assigned the data I/O control function during mapping of the SAR application. The performance simulations were used in developing a mapping that reduced the I/O processor memory requirements to those of a standard module type. In addition, the performance simulations were used to develop a priority scheme that avoided bottlenecks at the interface to the Data I/O Board. Incoming data was given higher priority than outgoing data.

Time-line plots of interconnect network were used to identify bottlenecks due to hardware or software. One result of the performance-based simulations was the determination that corner-turn data should be distributed as soon as it was calculated during range processing. Waiting to distribute the data until a full frame of range processing completed resulted in degraded performance due to high peak demand on the interconnect network. The corner-turn problem was detected when the use time-line plots for processor and interconnect link were examined. When the corner-turn data was not distributed when first calculated, all processors were stalled during corner-turn, while the interconnect became bogged down with multiple corner-turn transfers at the end of each frame of range processing. When the distribution of corner-turn data was spread over time, the number of processors required was reduced because processors did not stall waiting for input data, and the load on the interconnect network was leveled.

The development time for the SAR processor's VHDL performance models and simulations took two engineers about five weeks. The total time was 371 hours. About 1378 source lines of code (SLOC) were generated for the models, and an additional 1657 SLOC were generated for the test benches that verified the correctness of the models. Future efforts should require much less time because this original effort included significant learning time and time to develop models from scratch. Later efforts can reuse existing models, which will greatly reduce development time.

A SPARC- 10 CPU took 28 minutes to run a SAR processor performance simulation of a 24-processor architecture that ran five seconds of SAR application. When considering the number of processor elements modeled and their instruction rate, the effective execution rate of the simulation was about 2.8 million instructions-per-second. The performance simulations yielded measurements of processing and communication latencies; throughput; event timelines; and use of memories, processors, and links. The final SAR processor system met requirements with timing and resource use, and performance fell within eight percent of that predicted by the performance modeling.

Time-line information was captured by placing statements in the models to write the time and name of relevant events to a history file. The history files were used to produce time-line graphs that showed the history of task execution on each processor node. The time-lines were useful in visualizing and understanding the impact of software mapping options. The time-line graphs showed the time when the processor elements were idle due to data starvation or buffer saturation, and they helped to isolate resource contentions and bottlenecks. Figure 4- 17 is a processing timeline plot of when specific processor elements were busy processing tasks. Similar timeline graphs can be generated that show when processor elements are sending or receiving data or when communication links are in use.

Figure 4- 17: Sample processing timeline.

Plots of memory allocation as a function of time were valuable in visualizing and balancing memory use during execution of the SAR algorithm. Figure 4- 18 is a memory allocation time line from performance modeling.

Figure 4- 18: Example memory allocation time history.

4.3.4 Architecture Trade-off Analysis

The selected architecture for the SAR processor were COTS ADSP21060/2 boards with a FIR Filter on the Data I/O Board (candidate 4 in Table 4- 3). The FIR Filter provided greater processing margin in the COTS DSPs and a substantial recurring cost savings. Performance Model simulations determined the processing margin. The ADSP21060 architecture was the best candidate in size and weight. Also important factors were schedule, cost, and technical considerations. The backup architecture was the i860 COTS Processor Board (candidate 2 in Table 4- 3).

Table 4- 3: Comparison Matrix for Candidate SAR Processor Architectures

Architecture Candidates	1	2 (backup)	3	4 (selected)	5	6	7	8
Configuration
Host I/F Module	COTS 68040 SBC	COTS 68040 SBC	COTS 68040 SBC	COTS 68040 SBC	COTS 68040 SBC	COTS 68040 SBC	COTS 68040 SBC	COTS 68040 SBC
Data I/O Module FO I/F	TriQuint HRC - 500	TriQuint HRC - 500	TriQuint HRC - 500	TriQuint HRC - 500	TriQuint HRC- 500	TriQuint HRC - 500	TriQuin t HRC - 500	TriQuint HRC- 500
FIR	NO	PDSP16256	NO	PDSP16256	NO	PDSP1 6256	NO	PDSP16256
Complexity	Medium	High	Medium	High	Medium	High	Medium	High
Processor Module Type	MCV6	MCV6	MCE6/MCV6	MCV6	Custom	Custom	MC V6 Sharp	M CV6 Sharp
# of modules	7	5	3	2	3	2	1 3	1 3
Module Config.	4 i860's	4 i860's	8 ADSP21060	8 ADSP21060	8 ADSP21060	8 ADSP21060	4 I860 2 LH912 4	2 i860 1 LH912 4
Memory	32 Mb per module	32 Mb per module	32 Mb per module	32 Mb per module	32 Mb per module	32 Mb per module	32Mb 26Mb	32 Mb 24Mb
Interconnect (VME +)	RACEway	RACEway	RACEway	RACEway	RACEway	RACEw ay	RACEw ay	RACEway
Risks
Schedule/Cost	Lowest	Low	Medium	Medium	High	High	High	High
Technical	Lowest	Low	Low	Low	High	High	High	Medium
Major Risk Item	Obsolescence	Data I/O Complexity	NO VME to MCE6	Data I/O Complexity	Software - Board Support Package	Data I/O Complexity	Module Design	Data I/O Complexity
		Obsolescence	Module Availability	Module Availability	MCM Design	MCM Design
System Characteristics
Recurring cost
Memory (Total System)	240 Mbytes	176 Mbytes	184 Mbytes	136 Mbytes	160 Mbytes	144 Mbytes	111 Mbytes	109 Mbytes
Computation FFT	Single Precision Floating Point	Single Precision Floating Point	Single Precision Floating Point	Single Precision Floating Point	Single Precision Floating Point	Single Precision Floating Point	24 bit Block Floating Point	24 bit Block Floating Point
FIR	Single Precision Floating Point	12 or 23 bit integer	Single Precision Floating Point	12 or 23 bit integer	Single Precision Floating Point	12 or 23 bit integer	24 bit Block Floating Point	12 or 23 bit integer
Accuracy	- 163dB	- 113dB (12 bits) - 161dB (23bits)	- 163dB	- 113dB (12 bits) - 161dB (23bits)	- 163dB	- 113dB (12 bits) - 161dB (23bits)	- 147dB	- 113dB (12 bits) - 145dB (23bits)
Latency	< 3 Sec	< 3 Sec	< 3 Sec	< 3 Sec	< 3 Sec	< 3 Sec	< 3 Sec	< 3 Sec
Controllability, Testability, & Maintainability	Good	Good	Good	Good	Fair	Fair	Fair	Fair
Scalability(2x)	Does not meet requirement	Requirements met if modify chassis design	Meets Requirement	Exceeds Requirement	Meets Requirement	Exceeds Requirement	Meets Requirement	Meets Requirement
Size & Weight	Poor	Fair	Good	Good	Good	Excellent	Good	Good
Worse Case Power (Watts)	431	371	309	299	310	300	359	320

The lowest risk architecture in terms of schedule and cost was the i860 COTS Processor Board because it was available. PRICE was used as the tool to estimate development and life-cycle cost. The main concern with the i860 COTS boards were future obsolescence of the i860. Intel said it did not intend to upgrade the product. However, the i860 COTS architecture cold accommodate model-year upgrades because the backplane interface was processor independent. The main risk associated with the ADSP21060 COTS architecture was the availability of the COTS boards. They were unavailable when the architecture selection decision was made. Developing a custom ADSP21060 board or LH9124 board had greater schedule and cost risks associated with MCM (multi-chip module) development, custom processor-board development, and lack of software support. The final SAR processor hardware used i860 COTS boards because of availability of the ADSP21060 COTS boards. The SAR processor architecture provided a path for future upgrade to ADSP21060 or some other COTS boards.

4.4 Architecture Verification

4.4.1 Abstract Behavioral Simulation

An abstract Behavioral Model describes function as well as timing. The model is abstract in that interfaces are not resolved down to the individual hardware pin level. The abstract Behavioral Model, also called the abstract behavioral virtual prototype, for the SAR processor was generated by adding function to the Performance Model. The abstract behavioral virtual prototype was used to verify the numerical correctness of the software-to-hardware mapping, to generate test data, to provide system visualization, and to verify overall SAR processor implementation. Unlike performance modeling, actual data values were used in the abstract behavioral virtual prototype. However, bit-true format were not necessary. Therefore, the signal links that connected interconnect functional units were represented abstractly as pathways over which data packets were transferred. Time resolution was at the major event level.

The starting point for developing the SAR processor abstract Behavioral Model was the Performance Model. The processor element models were modified by adding actual program code for each software operation. The tokens used in modeling interconnect network activity were augmented by the addition of a field containing the actual data in the packet. The processor element models received the data packets, performed operations defined by the software for the abstract application program statements, and sent data packets to the next processing node. Sufficient memory must be allocated at each processor element to store real data. Timing was handled using delays, as was the case for performance modeling.

Figure 4- 19 is an example of the pseudo-code software program for the abstract behavioral simulation that corresponds to one pulse of range processing Performance Model pseudo-code in Section 4.3.3.2.

Figure 4- 19: Example of pseudo-code program for the abstract behavioral simulation.

A comparison of this code to that for the Performance Model in Section 4.3.3.2 shows that the two are similar, but that more information is required in the abstract Behavioral Model. In the Performance Model all the range processing steps were lumped into one combined delay term in a compute instruction. In the abstract Behavioral Model, each operation was defined separately and had its own call to a procedure in the CE model.

In the Performance Model, the Data I/O Board was modeled as a source and sink for data packets. In the abstract behavior virtual prototype, the Data I/O Board model included functions, such as FIR filtering, that were implemented in hardware. In addition, the abstract behavior virtual prototype was designed to interface to the Executable Specification test bench. The Executable Specification test bench modeled the SAR processor interface at the bit-true level, which required more detail in the Data I/O Board model to convert to the token representation of the abstract Behavioral Model elements.

The SAR processor abstract behavioral virtual prototype was used to:

Provide an unambiguous definition of SAR processor functionality, including system level timing. This was valuable system documentation and facilitated future modifications or upgrades.
Provide a means to visualize and understand operation of the SAR processor. By demonstrating the behavior of the system before it was built, the virtual prototype provided a convenient means to communicate this understanding to other engineers and customers.
Generate intermediate data sets for design verification of individual SAR processor functional elements. This was possible because the virtual prototype provided access to intermediate data, such as the data at the interface between the Data I/O Board and the rest of the processor.
Provide test data sets used in hardware/software debug of the SAR processor.

The abstract behavioral virtual prototyping required 1,171 labor hours for model generation and simulations. The model required 3,480 lines of new code and 1,102 lines of reuse code. Most of the reuse code was from the Executable Specification. The test benches required 500 lines of new code and 1,657 lines of reused code.

The abstract behavioral simulation of the SAR system consumed approximately 14 CPU-hours for 5 seconds of real time data and exhibited an effective execution rate of 23,810 instructions per second. The processed output images shown in Figure 4- 20 matched the resulting target system to within - 150 dB of error power per pixel. It was much more convenient to work with smaller data sets and test images when investigating design options. A test image that was 1/64 the size of a full image was developed and used during debug.

4.4.2 Autocode Generation

The Autocoding Toolset developed by MCCI was used to render the SAR PGM graphs into a set of C language source files that implemented SAR's signal processing functionality. The source code produced contained calls to the MCCI's Static Run Time System (SRTS) libraries, which provided run-time support for graph execution and control and for queue/data management.

The Autocoding Toolset was composed of the Partition Builder, MPID Generator, and the Application Generator.

The partition builder processed the node assignment to processor information to determine unique partitions called Partition Graphs (PG). A partition was a group of nodes that all functions execute on the same processor. A processor could have more than one partition assigned to it.
The MPID Generator translated each partition graph into C source-code statements that implemented a control flow version, called an MPID, of the processing described by the partition graph. This MPID referenced the vendor math libraries for a particular target processor and included calls to services provided by the SRTS, such as reading and writing to queues.
The Application Generator (AG) tool translated the Equivalent Application Graphs (EAGs) for the application into C source code and data structures that interfaced with the graph executing SRTS.

Figure 4- 20: SAR image generated by virtual prototype.

The following summarizes the development of the SAR application using the Autocoding Toolset (Figure 4- 21):

Converted the SAR graph, which used Q003 primitives and was tested using PGSE (Section 4.2.2 Flow Graph Generation), to a Domain Primitive Graph. Domain Primitives cannot currently be used with PGSE. For most Q003 primitives, there was a direct mapping to a Domain Primitive, although some reordering of inputs and outputs was required.
Partitioned the graph by constructing a partition file, or by copying the file from the JRS tools if the partitioning was done there. Developers generated the file from information of previous mapping done in performance modeling and then executed the Partition Builder to generate PG1 to PGn.
Used the MPID Generator to autocode each PG.
Compiled and unit tested source code generated for each PG.
Used the Application Generator to autocode the Equivalent Application Graph. Compiled the application by using the cross compiler for the target processor and by using the Makefile system generated by the Autocoding Toolset.
Placed the load image on the SAR processor and ran the application.

Figure 4- 21: Autocode tool methodology.

The Autocoding Toolset produced a complete solution for the SAR application:

A full set of executables for the SAR was generated using the AG.
The graph manager fully booted the application program.
The software ran on the SAR signal processor interfaced with the MIT/LL real-time source/sink hardware.
The run- time service library handled all interprocessor communication.
The custom I/O board interfaced seamlessly to the SAR application graph.
Overall, generated software was robust and did not cause crashing during testing.

Autocoding demonstrated a substantial time saving as shown in Table 4- 4. Overall development time for the real-time application software was reduced by a factor of seven overall (10X in software development and 5 X in integration and test time) and the development cost was decreased by a factor of 4. The processing efficiency of the autocoded software was within 10 percent of manually optimized code. The autocoded software data memory size was about 50 percent higher than for manually generated code. This was a problem in testing because there was not enough memory in the card set in the system; therefore, one of the DSP cards had to be replaced with one that had more memory.

Table 4- 4 : Autocoded Software Versus Hand-coded Software

Lines of Code	Total number of lines of code generated with autocoding was 60 percent greater than hand-coding
Performance	Same number of processors; about equal with hand-coding, within 10 percent
Memory	Amount of data memory was 50 percent greater than hand coding. This was an impact because a DSP card with more memory was required
Development time	10X improvement over hand-coding
Test time	5X improvement over hand-coding

4.5 Lessons Learned in the Architecture Design of the SAR Benchmark

4.5.1 Hierarchical Simulation (Performance Modeling)

Creation of the SAR processor Performance Model was a learning experience for LM ATL. Techniques and models were not in place when the benchmark started. LM ATL had performed performance modeling effectively in the past on multiprocessor systems using a 'C-language- based in-house tool called CSIM. The lessons learned during the performance modeling effort were the following:

Careful selection of the model's primitive level is essential. Early attempts at model generation using a tool from the University of Virginia, called Adept, resulted in models that were time consuming to generate and simulation times that were excessive (> 1 day).Adept used very low-level primitives and protocols to build complex processor and switch models. Much better results (simulation times of less than 30 minutes) resulted when processor and switch models were generated as the primitive level elements. In addition to saving generation and simulation time, these models were more accurate in modeling the processing and switch elements because they produced no artifacts resulting from the combination of more primitive elements.
The use of hierarchical models, where each level of the hierarchy corresponded to a level in the hardware hierarchy, simplified the investigation of different hardware combinations. In the SAR processor COTS architecture Performance Model, the hierarchy levels were the processor, daughter-card, processor board, and SAR processor. This made it easy to add or subtract processor boards in the model when sizing the SAR processor. This also led to the development of a family of basic models that could be used in future performance simulation.
The performance simulations accurately predicted the physical processor system's actual run-time performance. With the SAR, performance simulations were within eight percent of the physical system. This accuracy could be improved by better modeling of the operating system's overhead. The Performance Model's predictions tended to be slightly optimistic unless all overhead was accounted for.
Performance simulations were valuable in resolving hardware/software codesign issues early in the design process before detailed design started. An example from the SAR processor was the bottleneck in scheduling corner-turn data distribution.
Accuracy of Performance Model simulations depended on the ability to define the signal processing in terms of DSP primitives for which execution times on the hardware were either published or could be benchmarked. Obtaining execution times for signal-processing primitives was easier for COTS-based systems than it was for full custom architectures. This was one reason why a Performance Model was not developed for the Sharp LH9124 based architectures.

4.5.2 Hierarchical Simulation (Abstract Behavioral)

Adding actual data and processing operations to the Performance Model created the abstract Behavioral Model for the SAR processor. The resulting virtual- prototype simulations were numerically correct for the software mapping to the hardware. The following were lessons learned during the development of the abstract behavioral virtual prototype development and simulation:

The abstract, behavioral, virtual prototype was an excellent predictor of processor performance, and it was a useful vehicle for debug because it provided access to intermediate values in the signal processing.
The simulations can provide test data for the detailed design of individual software and hardware components. The risk of incompatibility at component interfaces was minimized because the simulation verified that the components operated together. This was demonstrated by the data format definition at the interface between the Data I/O and the Processor Boards. The data generated by the hardware on the Data I/O Board had to be the same as that expected by the software on the Processor Boards.
The best starting point for developing the abstract Behavioral Model was the Performance Model. The structural description of the hardware hierarchy was already developed in the Performance Model. Traceability between the Performance Model and the abstract Behavioral Model was maintained. A common language, VHDL, was used for model types.
The correct level of abstraction had to be used in the models. Too much detail resulted in simulation times that were too long. Too little detail provided less evaluation of the software to hardware mapping. The SAR processor abstract Behavioral Model simulation time was in the 12-hour range. Full simulations were most useful as a final check of the design.

4.5.3 Autocoding Software

The SAR benchmark was performed early in the RASSP program before several of the RASSP tools used for the software development were in place. The following were lessons learned with data flow capture using the PGM based tools from JRS' PGM" based tools and MCCI's beta version of the Autocoding Toolset:

Autocoding demonstrated a substantial timesaving development time for the real-time application software was reduced by a factor of 7, and the development cost was decreased by a factor of 4. The autocoding processing efficiency of the code was within 10 percent of the manually optimized code. The autocoding data memory size was not as proficient, being about 50 percent higher than for the manually generated code.
The data flow graphical capture tool (GRED) had promise but was awkward to use. It was seven years old and needed to be updated to a more modern graphical user interface.
Separate simulators and autocoding tools using different libraries was inefficient. The graph had to go through manual translations from the Q003 primitives used in the PGSE simulator to the domain primitives used in the autocoder. The simulation was ADA based and the autocoder was C based, so there was some interface problems with wrappers around 'C' coded primitives.
The automated computer-generated mapping software in the JRS netsyn tool needed more work before it could be used effectively.
Auto-instrument of the code would allow an easy method to determine the margins each processor had to help optimize the mapping.

A new tool, LM ATL's Graphical Entry, Distributed Application Environment (GEDAE™), corrected the above problems about one year later (See Appendix A.2).

The following were lessons learned on the command program with using an object-oriented approach and autocoding:

Autocoding must work without manual corrections; otherwise, cost and schedule improvements might be lost. The command program was quickly and easily captured in the object relationship diagrams. However, on this benchmark, due to the immaturity of the translator to ADA tool and manual fixes to workaround tool defects, no efficienc y gains were made. The SAR benchmark had a requirement to use ADA wherever it was possible. Because the single board computer that the command program resided on had an ADA compiler available, the command program had to be in ADA. CADRE had an unsupported tool for ADA generation but it did not work well for this application. Quite a bit of manually coded software was generated to work around the problems.
The Command-Program generation tools and DFG simulator should be integrated together for easy verification during development.
An automatically generated interface between the command program and the application program is needed. On this benchmark a CP_Callable interface library was manually generated, which took a couple of weeks of effort. On later benchmarks, a tool was created called Application Interface Builder that automated this generation of code.

Next: 5 Detailed Design Up: Case Studies Index Previous:3 System Design

Page Status: in-review, January 1998 Dennis Basara