Next: 5 Detailed Design
Up: Case Studies Index
Previous:3 System Design
The RASSP program intended to automate these four areas as much as possible. This was
accomplished by using a graph-based programming approach that supported
correct-by-construction algorithm development. The scheduling,
communications, and execution software was generated efficiently from an autocoding tool
after the user defined the partitioning and mapping of the data-flow graph onto the
specific hardware architecture. The command program was graphically captured in a state
diagram and the software code was auto-generated from the tool. An Application
Interface Builder automatically generates the application-specific interface from the
data flow graph and state
diagrams.
Figure 4- 2 is the SAR signal processing block diagram. The SAR Signal Processor
had to process up to three of four possible polarizations. Its architecture had to be scalable
by a factor of two in processing power and inter- processor communication bandwidth
. This scalability was for future enhancements, such as polarimetric whitening
filtering, CFAR target recognition processing, and autofocussing for other modes of
operation, such as spotlight.
Each image frame was composed of 512 pulses with 2048 complex samples per pulse.
Storing one image frame of one polarization required 8.4 Mbytes of memory, assuming 8
bytes for each complex point in the array. Azimuth
processing required two frames of data.
At the maximum pulse frequency (PRF) of 556 Hz, the 512 pulses needed to form an
image frame were collected in less then 0.92 seconds. If images for three different
polarizations were produced at this rate, then the output interface
had to support an average transfer rate of 27.32 Mbytes/sec, or 512 pulses x 2048 samples
per pulse x 8 bytes per sample x 3 polarizations x 1/.92 pulses
per second.
The interconnect bandwidth requirements were analyzed for the candidate architectures by
the performance modeling
effort.
Latency through the SAR Signal Processor could not exceed 3 seconds. The PRF of 200 to
556 pulses per second, coupled with the 512 pulses per frame, gave an interval of 2.56
seconds to 0.92 seconds between frames of the same
polarization. The three polarization frames were received interlaced, and the frame output
was required to be sequential. Latency in this case was defined as the interval between the
arrival of the last pulse of an image frame and
the start of the resulting image frame output. With this definition of latency, maximum
latency was not a design driver. Reduction of memory demand was more of a design driver
than latency when developing an implementation
that needed to process and output data as quickly as possible.
Table 4- 1 lists the memory requirements and processing throughput estimates at the
maximum input data rate and are the result of manual calculations. This provided a starting
point for the performance modeling effort that defined the
number of processors needed to meet the real-time algorithm requirements. The
48-tap Finite Impulse Response filter (FIR) and Fast Fourier Transform (FFTs) in
range and azimuth compression dominated the
processing requirement.
The memory requirements for azimuth compression were caused by corner turning.
4 Architecture Design
4.1 Architecture Process Description
The architecture process (Figure 4- 1) transformed processing requirements into
candidate architectures of hardware and
software elements. The system-level processing requirements were allocated to
hardware and/or software functions.
The architecture process resulted in an abstract behavioral description of the hardware and
definition of the software
required for each processor in the system. Hardware/software co-simulation verified
the correctness and performance of
the resulting architecture.
4.1.1 RASSP Innovations in the Architecture Design
Process
4.1.1.1 Hierarchical simulation (Performance Modeling)
A hierarchical VHSIC Hardware Description Language (VHDL)-based virtual
prototype approach was used for
signal-processor simulation in Rapid Prototyping of
Application-Specific Signal Processors (RASSP) methodology.
This enabled design aspects to be followed through the design process from the
Performance Model to detailed
design level within a common framework. The purposes of the performance simulations
were the following:
4.1.1.2 Autocoding Software
The software development of the RASSP architecture process deviated significantly from
traditional (functional
decomposition) approaches. The partitioned software functionality was broken into four
major areas for real-time
application software:
4.2 Functional Design
4.2.1 Architecture Sizing of SAR Algorithm
4.2.1.1 Algorithm Implementation Analysis (Latency,
Bandwidth, Computational, and Memory Requirements)
  | MOPS | KBYTES |
Data Preparation | 33 | 49 |
Video To Baseband (48 tap FIR) | 639 | 49 |
Equalization | 21 | 49 |
Range FFT | 188 | 49 |
RCS Compensation | 7 | 16 |
Azimuth FFT | 342 | 50,356 |
Kernel Multiply | 41 | 254 |
Azimuth IFFT | 342 | 254 |
Input/Output Formatting | 16 | 25,165 |
TOTAL | 1629 | 76,241 |
Scalability, performance, and future upgradability requirements led to the investigation of commercial-off-the-shelf (COTS), floating-point, digital-signal-processor (DSP) modules for most of the SAR processing. The FIR filter, comprising 40 percent of the total processing requirement, was a strong candidate for dedicated hardware implementation. Specialized processors sacrificed total programmability for improved efficiency in implementing a given functionality. For example, a custom module using specialized, programmable, FIR-filter integrated circuits had a recurring cost of < $2,000 to filter the processing of the SAR algorithm. If the 48-tap FIR filter processing was computed in the time domain using quad i860 COTS DSP modules, at ~$30,000 and 320 MOPS computing capability each, then the cost would have been ~$60,000. The architecture options to be investigated were identified at this point in the architecture process. The final selection was not made until after the more detailed evaluation by performance modeling and cost analysis. The detailed analysis evaluated a variety of architectures with different combinations of COTS and dedicated hardware. This included evaluation of a custom processor architecture specialized for high performance, fixed-point, block-oriented algorithms and array processing, such as FFTs.
 
All combinations met requirements; however, developers decided that alternative 2 was too
close to the requirement
particularly when the supplied image did not have the maximum allowed differences in
pixel values.
The following software features were common to all candidate architectures of the SAR
Signal Processor (Figure 4- 4): For a standard set of signal-processing PGM primitives developed for
the Navy: ECOS Primitives
Specification Library, CDRL Q003 and are referred to as the Q003 Primitives.
For data flow graphs and autocoding process, see the application note Autocoding for
DSP Algorithm (PGM).
The JRS RSS CAD (computer-aided design) tools were used to construct the SAR
DFG. For more information on
the CAD toolset, see Reusable Software
Subsystem (RSS) User's Manual,
JRS Research
Laboratories, March 1994. When constructing the DFG, components from the existing
library should be used. For PGM, this requires familiarity with the Q003 primitive library.
Most signal-processing functions can be implemented by using a combination of these
primitives. For the SAR, all required functions were defined down to existing Q003 library
elements. If existing primitives could not implement some of the processing, than a special
primitive would be written. The new primitives were defined within an Ada environment;
however, the underlying code for the primitive itself could be written using C language.
Also the necessary interfaces to the Autocode toolset must be generated.
A data flow graph is relatively simple to read and put together; however, it is helpful to
know the following PGM terminology. A graph represents a complete algorithm for a
particular application, such as SAR. Graphs may contain subgraphs, which provide a
particular hierarchical structure and simplify the creation of complex graphs.
An example of subgraphs are the range and azimuth subgraphs in the SAR graph ( Figure
4- 5). A graph consists of a set of nodes that represent primitive functions, such
as a Q003 library element. A node contains input and output ports (Figure 4- 6).
Queues provide the primary data storage and transfer mechanism in a PGM graph and
are represented by a first-in-first-out (FIFO) data structure. Nodes are
low-level functions that range from simple to complex and perform processing for
an application domain. An example of a node is the Finite Impulse Response (FIR) filter
(Figure 4- 7) of the range subgraph. Associated with each node was a set of Node
Execution Parameters (NEP):
The PGM provided two additional data sources: Graph Variables and Graph Instantiation
Parameters. These are individual data items and are used to parameterize the graph
during execution, such as number of taps to use in a FIR Filter or to put in the FIR
coefficients in the range subgraph.
The fundamental rule governing node execution is that a node executes when all of its
input queues contain more data then the threshold amounts. There is no notion of
sequential execution time for a node, as would occur in a conventional thread control flow.
Instead, nodes execute whenever there was sufficient data to process.
The last PGM concept to discuss is a family of nodes. Family notation allows a set of
nodes that accomplish the same function to be grouped and handled graphically as one
entity; this is, represented by a heavy outline of the box as seen by the range node and input
and output queues of range in Figure 4- 5. Families are used to simplify the
representation of parallelized functions.
The SAR algorithm has inherent parallelisms. For example, in range compression, each
pulse of data could be processed independently, so pulse data could be divided among a set
of processing nodes for concurrent processing. The SAR algorithm input signal consisted
of 512 pulses of 4064 data items. The input data stream could be split: The first pulse of
range data was sent to the first of a family of range subgraphs, and each subsequent pulse
was sent to the next family member of range subgraphs. This range processing was
grouped into one sub-graph called range (Figure 4- 7). For the SAR, this was
done in the SPLIT node using the Q003 DFC_SWTH primitive. After each data
pulse or block was processed in range, the processed data blocks were written into a
double indexed queue.
The azimuth processing could now be processed in parallel. Figure 4- 8 shows
azimuth processing, and this subgraph was called azimuth. The data was recombined with
the CONCAT node using Q003 DFC_CAT primitive to produce the full frame of
polarization data.
Each pulse or set of data was processed through a series of range-processing
algorithms represented by the blocks video-to-baseband, equalization weight,
range DFT, and RCS (Radar Cross-Section) calibration shown in Figure 4- 2.
The D_MUX node used the Q003 DFC_DMUX primitive to form sequences of even and
odd pulse samples. Each sequence was passed through a FIR node that had a NEP
threshold amount of 2032 and that used the Q003 FIR_R1S primitive. The node had 8 to
48 taps determined by a graph variable. The coefficients for the FIR were graph variables
to the node. Even /odd outputs were combined into a family and muxed together back into a
stream of data. The data was then converted to complex data and filled to 2048 to become
multiple of twos. The complex equalization weights, graph variables in the V_MULT node,
were multiplied with the data. Weighted I/Q data were transformed to (compressed) range
data by using a 2048-point FFT node. The FFT node used the Q003 FFT_CC
primitive. The RCS weights, graph variables in the R_MULT node using the Q003
VCC_VMULT primitive, were multiplied with the data. The FANOUT node, using the
Q003 DFC_SEP primitive, distributed the data to the appropriate azimuth channel.
Figure 4- 8 shows azimuth processing. Each range subgraph outputted segments of
range data into the number of azimuth segments. Each azimuth subgraph used the Q003
primitive DFC_SEP to collect all of the range segments that belonged in this azimuth
segment. Azimuth processing then transposed the data using the Q003 primitive
MOC_TPSE.
Developers used PGSE to simulated and debug the SAR graph. The PGSE tool provided
capabilities to execute graphs
and debug facilities that used the Telesoft Ada debugger.
The driver procedure performed the following functions:
Developers experienced difficulty with the immature and unsupported OOA2ADA tool; the
resultant code had to be extensively rewritten. There were 3500 lines of code in the
Command Program, of which 1800 were autocoded.
The CP_Callable Interface library implemented the interface between the command program
and the autocoded application software. The design of the interface library was based on
the SAR implementation in PGSE. The message structure was taken from an Auotcode Design
Document written by the autocoder vendor Management Communications and Control
Incorporated (MCCI). There were 2300 lines of code in the CP_Callable Interface.
The high throughput requirement and the accuracy and scalability requirements narrowed
candidate DSP components for the Signal Processor Boards to high-performance
floating-point processors, such as Intel's I860, Analog Devices'
ADSP21060 (SHARC), Motorola's DSP96002, and TMS320C40.T ADSP21060 had
the best performance and the I860 had the second best performance. The ADSP21060
could also cluster several DSPs together and had its own internal memory to reduce the
number of peripheral components. This allowed more DSPs per board, or about two to
three times the number of i860s.
Candidate COTS board solutions needed to be expandable to a number of DSPs across
multiple Processing Boards. Also important was the available interprocessor
communication, operating system (OS), and software support. COTS boards from
Mercury Computer Systems, Inc., were selected over comparable boards from Sky
Computer and CSPI because RASSP's autocoding tools from MCCI were being
implemented first on Mercury software.
One architecture evaluated for the SAR processor was a custom board based on the
SHARP LH9124 DSP chip. The LH9124 was a high-performance, fixed-point
DSP optimized for block-oriented algorithms and array processing, including FIR and
FFT operations. For example, the LH9124 was capable of performing a 1K complex FFT
in 80.7 microseconds, which was well under the 460 microseconds required for the Analog
Devices SHARC DSP. The LH9124 had no address capability, so it needed external
addressing, such as that generated by the SHARP LH9320 DSP address generator chip. A
signal processing board would have required a more general purpose processor for
control and system interface functions or have been managed completely by hardware
control through using FPGAs (Field Programmable Gate Arrays) .
Performance modeling and Matlab simulations were used to size the different architectures.
The eight candidate SAR processor architectures evaluated were the following:
Performance modeling goals were realized by developing VHDL token-based
Performance Models for the candidate
architectures. The Performance Models described the SAR processor's
time-related aspects, including response ,
throughput, and use. Neither the actual application data nor the transforms on it were
described, other than what was
required to control the sequence of events. For more detail on performance modeling, see
the application note Token-based Performance Modeling.
During simulation the computation agent read pseudo-code that represented the
program being executed from a file.
The four basic pseudo-code instructions were compute, send, receive, and jump. The
compute instruction represented
execution of an application subroutine as a simple time delay. The delay times were
obtained from published times
for the candidate COTS library functions. The send instruction caused the computation
agent to direct the communications agent to send a token to another CE. The token defined
the data source,
data destination, and data
packet size. The receive instruction consumed received data. If the data had arrived, the
specified queue was
decremented. If the data had not arrived, the computation agent was blocked until the data
arrived. The model tracked
how much data was stored in the various queues, but it did not store actual data.
 
The communications agent transferred data tokens between the local CE's memory
queues and other CEs. In the
SAR Performance Model, the communication agent broke data packets into the actual
packets that were sent over
RACEway. Upon receiving a token, the communications agent incremented the amount of
data in the appropriate
queue by the received amount. When sending a token, the agent decremented the
appropriate data queue by the
transmitted amount. Figure 4- 12 shows the top level of the computation element in
the form of the VHDL model.
The message token used to model messages passing through the switch element was
defined as a record in VHDL
(Figure 4- 14)
The token "purpose" was used to request an interconnect link, acknowledge granting of a
request, not acknowledge
granting a request, or to preempt a link. The "route" and "index" fields were used to
determine the switch output port,
and the "length" field determined how long the link would be busy. The combination of
switch models and tokens
provided accurate modeling of the SAR processor RACEway interconnect.
Because a single processor could not perform all SAR processing in real time, the next step
was to partition the data
flow graph into a set of partitioned graphs. The partitioned graphs were then mapped to the
processing elements in
the hardware model. Graph partitioning and mapping for the SAR application were
performed manually because tools
for automatic partitioning and mapping were unavailable.
The final step was to generate the pseudo-code application program for each
processing element by scheduling graph-
node execution. An existing program was then used to generate the set of pseudo-code
application programs for each
processing element in the SAR processor. Static partitioning/mapping/scheduling were
used because the required
processing did not change dynamically. The pseudo-code programs were stored in
files, and each instantiated
processor element in the model read its program from file during simulation and performed
the indicated operation.
Arithmetic operations were modeled by a delay, and I/O operations were used to set up the
queues in the processor
element model's communication interface.
Data communication was modeled by passing tokens through the modeled interconnect
network. The Performance Model tokens identified message type, size, source, and
destination. The size determined how long interconnect links were "busy" with the
message, and the message type was used by the receiving processing node to determine
when to fire the next processing step. When modeling the RACEway interconnect, the
tokens also included the network routing information and, in some cases, message priority.
Figure 4- 16 is an example of the pseudo-code generated for a CE in an
8-CE partition by the software generation program.
Five frames of data were processed to allow processing to reach the steady-state
condition. The maximum resource requirement occurred in steady-state when data
input, range processing, azimuth processing, and data output were all active. The
performance simulations determined that three processing boards were required for the
SHARC COTS architecture and six boards were required for the i860 COTS architecture.
If the rest of the board architecture was left unaffected, then switching among SHARC or
i860 required changing only delay values assigned to processing operations in the
processing element model. This was possible because the SHARC links were not used by
the SAR processor architectures and so they were not included in the model. The full
custom SHARP-based architectures were not performance modeled, and they were
eliminated based on cost and schedule risks. A performance simulation of the
SHARP-based architectures would have required more extensive model modifications.
Also, modeling custom architectures required more effort in determining the time required
for performing standard signal-processing operations. These times were usually
available for COTS DSP boards and were incorporated into the processor element model.
Performance Model simulations also provided memory use at each processing element. The
candidate COTS architectures had memory associated with each processor element instead
of global memory. Dynamic memory use was captured during simulation by statements
included in each processor element model, and memory use was plotted after
post-processing the use data. Equalization of memory requirements over the processor
elements was desired to minimize the number of processor/memory module types. The
highest memory requirements were for the I/O control processor. This processor was a
processor element assigned the data I/O control function during mapping of the SAR
application. The performance simulations were used in developing a mapping that reduced
the I/O processor memory requirements to those of a standard module type. In addition, the
performance simulations were used to develop a priority scheme that avoided bottlenecks at
the interface to the Data I/O Board. Incoming data was given higher priority than outgoing
data.
Time-line plots of interconnect network were used to identify bottlenecks due to
hardware or software. One result of the performance-based simulations was the
determination that corner-turn data should be distributed as soon as it was calculated
during range processing. Waiting to distribute the data until a full frame of range
processing completed resulted in degraded performance due to high peak demand on the
interconnect network. The corner-turn problem was detected when the use
time-line plots for processor and interconnect link were examined. When the
corner-turn data was not distributed when first calculated, all processors were stalled
during corner-turn, while the interconnect became bogged down with multiple
corner-turn transfers at the end of each frame of range processing. When the
distribution of corner-turn data was spread over time, the number of processors
required was reduced because processors did not stall waiting for input data, and the load
on the interconnect network was leveled.
The development time for the SAR processor's VHDL performance models and
simulations took two engineers about five weeks. The total time was 371 hours. About
1378 source lines of code (SLOC) were generated for the models, and an additional 1657
SLOC were generated for the test benches that verified the correctness of the models.
Future efforts should require much less time because this original effort included significant
learning time and time to develop models from scratch. Later efforts can reuse existing
models, which will greatly reduce development time.
A SPARC- 10 CPU took 28 minutes to run a SAR processor performance simulation
of a 24-processor architecture that ran five seconds of SAR application. When
considering the number of processor elements modeled and their instruction rate, the
effective execution rate of the simulation was about 2.8 million
instructions-per-second. The performance simulations yielded measurements of
processing and communication latencies; throughput; event timelines; and use of memories,
processors, and links. The final SAR processor system met requirements with timing and
resource use, and performance fell within eight percent of that predicted by the performance
modeling.
Time-line information was captured by placing statements in the models to write the
time and name of relevant events to a history file. The history files were used to produce
time-line graphs that showed the history of task execution on each processor node.
The time-lines were useful in visualizing and understanding the impact of software
mapping options. The time-line graphs showed the time when the processor elements
were idle due to data starvation or buffer saturation, and they helped to isolate resource
contentions and bottlenecks. Figure 4- 17 is a processing timeline plot of when
specific processor elements were busy processing tasks. Similar timeline graphs can be
generated that show when processor elements are sending or receiving data or when
communication links are in use.
Plots of memory allocation as a function of time were valuable in visualizing and balancing
memory use during
execution of the SAR algorithm. Figure 4- 18 is a memory allocation time line from
performance modeling.
The lowest risk architecture in terms of schedule and cost was the i860 COTS Processor
Board because it was available. PRICE was used as the tool to estimate development and
life-cycle cost. The main concern with the i860 COTS boards were future
obsolescence of the i860. Intel said it did not intend to upgrade the product. However, the
i860 COTS architecture cold accommodate model-year upgrades because the
backplane interface was processor independent. The main risk associated with the
ADSP21060 COTS architecture was the availability of the COTS boards. They were
unavailable when the architecture selection decision was made. Developing a custom
ADSP21060 board or LH9124 board had greater schedule and cost risks associated with
MCM (multi-chip module) development, custom processor-board development,
and lack of software support. The final SAR processor hardware used i860 COTS boards
because of availability of the ADSP21060 COTS boards. The SAR processor architecture
provided a path for future upgrade to ADSP21060 or some other COTS boards.
The starting point for developing the SAR processor abstract Behavioral Model was the
Performance Model. The processor element models were modified by adding actual
program code for each software operation. The tokens used in modeling interconnect
network activity were augmented by the addition of a field containing the actual data in the
packet. The processor element models received the data packets, performed operations
defined by the software for the abstract application program statements, and sent data
packets to the next processing node. Sufficient memory must be allocated at each processor
element to store real data. Timing was handled using delays, as was the case for
performance modeling.
Figure 4- 19 is an example of the pseudo-code software program for the abstract
behavioral simulation that corresponds to one pulse of range processing Performance
Model pseudo-code in Section 4.3.3.2.
A comparison of this code to that for the Performance Model in Section 4.3.3.2 shows that
the two are similar, but that more information is required in the abstract Behavioral Model.
In the Performance Model all the range processing steps were lumped into one combined
delay term in a compute instruction. In the abstract Behavioral Model, each operation was
defined separately and had its own call to a procedure in the CE model.
In the Performance Model, the Data I/O Board was modeled as a source and sink for data
packets. In the abstract behavior virtual prototype, the Data I/O Board model included
functions, such as FIR filtering, that were implemented in hardware. In addition, the
abstract behavior virtual prototype was designed to interface to the Executable Specification
test bench. The Executable Specification test bench modeled the SAR processor interface at
the bit-true level, which required more detail in the Data I/O Board model to convert to
the token representation of the abstract Behavioral Model elements.
The SAR processor abstract behavioral virtual prototype was used to:
The abstract behavioral virtual prototyping required 1,171 labor hours for model
generation and simulations. The model required 3,480 lines of new code and 1,102 lines of
reuse code. Most of the reuse code was from the Executable Specification. The test benches
required 500 lines of new code and 1,657 lines of reused code.
The abstract behavioral simulation of the SAR system consumed approximately 14
CPU-hours for 5 seconds of real time data and exhibited an effective execution rate of 23,810 instructions per second. The processed output images shown in Figure 4- 20 matched the resulting target system to within - 150 dB of error power per pixel. It was much more convenient to work with smaller data sets and test images when investigating design options. A test image that was 1/64 the size of a full image was developed and used during debug.
The Autocoding Toolset was composed of the Partition Builder, MPID Generator, and the
Application Generator.
The following summarizes the development of the SAR application using the Autocoding
Toolset (Figure 4- 21):
The Autocoding Toolset produced a complete solution for the SAR application:
Autocoding demonstrated a substantial time saving as shown in Table 4- 4. Overall
development time for the real-time application software was
reduced by a factor of seven overall (10X in software development and 5 X in integration
and test time) and the development cost was decreased by a factor of 4. The processing
efficiency of the autocoded software was within 10 percent of manually optimized code.
The autocoded software data memory size was about 50 percent higher than for
manually generated code. This was a problem in testing because there was not enough
memory in the card set in the system; therefore, one of the DSP cards had to be replaced
with one that had more memory.
A new tool, LM ATL's Graphical Entry, Distributed Application Environment
(GEDAE), corrected the above problems about one year later (See Appendix A.2).
The following were lessons learned on the command program with using an
object-oriented approach and autocoding:
4.2.1.2 Numerical Sensitivity Analysis
Matlab simulations were used to perform a numerical sensitivity analysis of the SAR
algorithm to determine if
using integer formats or fewer bits of precision would meet system accuracy requirements.
Matlab was easier and
more effective to use than was VHDL because there were available post analysis tools,
math libraries, and
experienced personnel. The reference image supplied by MIT/LL was computed using
IEEE double-precision floating
point. The SAR requirement was error power had to be less than - 103 dB relative to
maximum output signal power.
Table 4- 2 lists analysis results for the six architectures identified in Section 4.3.2 and
Table 4- 3.
SAR
Processing
Alt
1
Alt
2
Alt
3
Alt
4
Alt
5
Alt
6 FIR
SP FP
12
bit 23
bit 24-bit
BFP
12 bit
23 bit FFT
processing
SP FP
SP
FP
SP
FP
24-bit
BFP
24-bit
BFP
24-bit BFP Rest of
Algorithm
SP FP
SP
FP
SP
FP
SP
FP
SP
FP
SP FP Accuracy
- 163
dB
- 113
dB
- 161
dB
- 147
dB
- 113
dB
- 145 dB SP FP - Single Precision Floating Point
12-bit
- 12 bit integer BFP - Block Floating Point 24-bit - 24 bit integer.
4.2.1.3 First Pass Partitioning of Hardware and
Software
The following hardware features were common to all candidate architectures of the SAR
Signal Processor (Figure 4-
3):
4.2.2 Flow-Graph Generation
The SAR signal-processing algorithm was expressed using an
architecture-independent format of a Data Flow graph (DFG) implementation. This allowed Developers to
follow RASSP's
hardware/software co-design process, where the DFG algorithm could be partitioned
to hardware and software. The
SAR development used the Processing Graph Method (PGM) technology from the Naval
Research Laboratory
(NRL). This was done because PGM was a standard, at least in the Navy, and tools were
available to assist
development. See the following documents for more details:
For PGM:
4.2.3 Develop Command Program
The command program
initialized the SAR system, controlled the processing graphs as commanded by the radar
system and controlled the self-test functions. The command program was designed
using the Schalaer-Mellor object-oriented approach and the Cadre
ObjectTeam OOA/OOD tools. Information had been manually transferred from the
RDD - 100 description because RDD- 100 and the Cadre tool used different data
paradigms. The program was developed in four stages:
4.3 Architecture Selection
4.3.1 Initial Size, Weight, and Power
There was a requirement for four 6U VME customer-supplied modules to be placed in
the chassis, which led to the selection of a VME backplane for the SAR processor. The
maximum allowable dimensions for the SAR Signal Processor chassis was 10.5X 20.5X
17.5 inches, which allowed up to a 21 slot 6U VME card rack. The physical
specifications of the architecture were the following:
4.3.2 Architecture Definition
The candidate architectures included COTS and custom processor boards. However,
certain features were common to
all candidates:
4.3.3 Performance Modeling
The RASSP design process emphasized the integrated design and development of hardware
and software in a hardware/software codesign process that included performance modeling
and simulation. Performance modeling provided early design verification via simulation of
the software as partitioned, mapped, and executed on the hardware architecture. Design
verification early in the design process reduced the risk of costly architectural modifications
later
in the detailed design phase. Performance modeling enabled a range of potential
architectures to be investigated before selecting the "best" architecture for implementation
(Figure 4- 9). Performance modeling and simulation were performed during the
selection of the SAR processor architecture to help determine the size of the system,
interconnect network architecture, software-to-hardware mapping, and
performance required of each component.
4.3.3.1 Performance Modeling of the SAR Processor
Hardware
A hierarchical approach was taken to develop hardware models for performance simulation
of the candidate
architectures. Processor and switch models were at the lowest level of the hierarchy (Figure
4- 10). Tokens, rather
than actual data, represented data passing between CEs (processing elements) and through
crossbars. The token was
coded in VHDL as a record with fields that defined source CE, destination CE, data size,
data id, and route through
the RACEway interconnect.
4.3.3.1.1 CE Model
The processor element model, labeled CE in Figure 4- 10, modeled the computation
and communication of the
processor chip, such as Analog Devices SHARC or Intel i860 for the SAR processor
benchmark. Figure 4- 11 is a
block diagram of the CE model. The CE model was conceptually divided into two
concurrent processes: computation
agent and communications agent.
4.3.3.1.2 Switch Model
The switch element model, labeled X in Figure 4- 10, modeled the RACEway
crossbar when evaluating architectures
based on COTS DSP boards from Mercury Computer Systems. The Mercury crossbar had
six ports with any port
capable of connecting to any other port. Connections were made if the destination port
unblocked. The input and
output ports are shown separately in Figure 4- 13, although they were actually the
same physical bi-directional port.
Most of the switch model development time was devoted to accurately modeling how the
RACEway crossbar handled
message blocking and contention. A message was blocked if the output port was in use,
either as the output or input
port for another message. When messages arrived concurrently, priority was given to the
message that arrived on the
lower indexed port. Accurate modeling of message blocking and contention was needed to
accurately evaluate the
interconnect network performance. Once a link was established through the crossbar, it
remained in use for a period
of time determined by the data packet size. 4.3.3.1.3 Hierarchical Structural Model
The CE and X elements were first assembled into models for the various board types as
shown in Figure 4- 10. The
board models were then connected together to generate the model of the entire SAR
processor. The use of the
hierarchical VHDL structural models at the processor board and system level made it easy
to modify the architecture
to investigate architectural alternatives.
4.3.3.2 Performance Modeling of the SAR Processor
Software
The first step in the process followed to develop the SAR processor software model is shown in figure 4- 5. Each graph node represented a SAR processing
primitive, such as FFT, vector
multiply, or convolution. The arcs between graph nodes represented data dependencies.
4.3.3.3 Performance Model Simulations
Several candidate SAR processor architectures were evaluated using simulation of the
VHDL Performance Model performing the SAR algorithms. For example, the number of
processing boards required was determined by simulating several image frames on models
having different numbers of boards. The simulation results were post- processed to
generate time-line plots showing use for each processing element. Changing the
number of boards required minimal effort. The structural model of the hardware was
modified by adding or subtracting boards, and the software generation program was rerun
for the different number of processors and/or mapping assignment. A change in number of
boards in the model took less than a day to complete, including resimulation. Changes to
mapping assignment were completed in four hours or less. The low-level hardware models
and the signal-processing DFGs were unchanged by the architecture variations.
4.3.4 Architecture Trade-off Analysis
The selected architecture for the SAR processor were COTS ADSP21060/2 boards with a
FIR Filter on the Data I/O Board (candidate 4 in Table 4- 3). The FIR Filter provided
greater processing margin in the COTS DSPs and a substantial recurring cost savings.
Performance Model simulations determined the processing margin. The ADSP21060
architecture was the best candidate in size and weight. Also important factors were
schedule, cost, and technical considerations. The backup architecture was the i860 COTS
Processor Board (candidate 2 in Table 4- 3).
Architecture
Candidates
1
2 (backup)
3
4 (selected)
5
6
7
8
Configuration Host I/F Module
COTS
68040 SBC
COTS
68040 SBC
COTS
68040 SBC
COTS 68040 SBC
COTS 68040
SBC
COTS
68040 SBC
COTS
68040 SBC
COTS 68040
SBC Data
I/O Module
  
FO I/F
TriQuint
HRC - 500
TriQuint
HRC - 500
TriQuint
HRC - 500
TriQuint HRC - 500
TriQuint HRC-
500
TriQuint HRC - 500
TriQuin
t HRC - 500
TriQuint HRC-
500
   
FIR
NO
PDSP16256
NO
PDSP16256
NO
PDSP1
6256
NO
PDSP16256
   
Complexity
Medium
High
Medium
High
Medium
High
Medium
High
Processor
Module
   
Type
MCV6
MCV6
MCE6/MCV6
MCV6
Custom
Custom
MC
V6
 
Sharp
M
CV6
Sharp
 
 
# of modules
7
5
3
2
3
2
1
 
3
1
3
 
 
Module Config.
4 i860's
4 i860's
8
ADSP21060
8 ADSP21060
8 ADSP21060
8
ADSP21060
4
I860
 
2
LH912
4
2
i860
1
LH912
4
 
 
Memory
32 Mb per
module
32 Mb per
module
32 Mb per
module
32 Mb per module
32 Mb per
module
32 Mb
per module
32Mb
 
26Mb
32
Mb
24Mb
Interconnect (VME +)
RACEway
RACEway
RACEway
RACEway
RACEway
RACEw
ay
RACEw
ay
RACEway
Risks Schedule/Cost
Lowest
Low
Medium
Medium
High
High
High
High Technical
Lowest
Low
Low
Low
High
High
High
Medium
tr>
Major Risk Item
Obsolescence
Data I/O
Complexity
NO VME to
MCE6
Data I/O Complexity
Software -
Board Support Package
Data
I/O Complexity
Module
Design
Data I/O
Complexity
Obsolescence
Module Availability
Module Availability
MCM Design
MCM Design
 
  System Characteristics Recurring cost
 
 
 
 
 
 
 
  Memory (Total System)
240 Mbytes
176 Mbytes
184 Mbytes
136 Mbytes
160 Mbytes
144
Mbytes
111
Mbytes
109
Mbytes
Computation
  
FFT
Single
Precision Floating Point
Single
Precision Floating Point
Single Precision
Floating Point
Single Precision Floating
Point
Single Precision
Floating Point
Single
Precision Floating
Point
24 bit
Block Floating
Point
24 bit Block
Floating Point
FIR
Single
Precision Floating Point
12 or 23 bit
integer
Single Precision
Floating Point
12 or 23 bit integer
Single Precision
Floating Point
12 or
23 bit integer
24 bit
Block Floating
Point
12 or 23 bit
integer
Accuracy
- 163dB
- 113dB (12
bits)
- 161dB (23bits)
- 163dB
- 113dB (12 bits)
- 161dB (23bits)
- 163dB
- 113dB
(12 bits)
- 161dB (23bits)
- 147dB
- 113dB (12 bits)
- 145dB (23bits) Latency
< 3 Sec
< 3 Sec
< 3 Sec
< 3 Sec
< 3 Sec
< 3 Sec
< 3 Sec
< 3 Sec Controllability, Testability, &
Maintainability
Good
Good
Good
Good
Fair
Fair
Fair
Fair
Scalability(2x)
Does not
meet requirement
Requirements met if
modify chassis design
Meets
Requirement
Exceeds Requirement
Meets
Requirement
Exceeds Requirement
Meets
Requirement
Meets
Requirement
Size & Weight
Poor
Fair
Good
Good
Good
Excellent
Good
Good
Worse Case Power (Watts)
431
371
309
299
310
300
359
320 4.4 Architecture Verification
4.4.1 Abstract Behavioral Simulation
An abstract
Behavioral Model describes function as well as timing. The model is abstract in that
interfaces are not resolved down to the individual hardware pin level. The abstract
Behavioral Model, also called the abstract behavioral virtual prototype, for the SAR
processor was generated by adding function to the Performance Model. The abstract
behavioral virtual prototype was used to verify the numerical correctness of the
software-to-hardware mapping, to generate test data, to provide system
visualization, and to verify overall SAR processor implementation. Unlike performance
modeling, actual data values were used in the abstract behavioral virtual prototype.
However, bit-true format were not necessary. Therefore, the signal links that
connected interconnect functional units were represented abstractly as pathways over which
data packets were transferred. Time resolution was at the major event level.
4.4.2 Autocode Generation
The Autocoding Toolset developed by MCCI was used to render the SAR PGM graphs
into a set of C language source files that implemented SAR's signal processing
functionality. The source code produced contained calls to the MCCI's Static Run
Time System (SRTS) libraries, which provided run-time support for graph execution
and control and for queue/data management.
Lines
of
Code
Total number of lines of code generated with autocoding was 60 percent
greater than hand-coding Performance
Same number of processors; about equal with hand-coding, within 10 percent
Memory
Amount of data memory was 50 percent greater than hand coding. This was an
impact because a DSP card with more memory was required Development
time
10X improvement over hand-coding Test time
5X improvement over hand-coding 4.5 Lessons Learned in the Architecture Design of the SAR
Benchmark
4.5.1 Hierarchical Simulation (Performance Modeling)
Creation of the SAR processor Performance Model was a learning experience for LM ATL.
Techniques and models were not in place when the benchmark started. LM ATL had
performed performance modeling effectively in the past on multiprocessor systems using a
'C-language- based in-house tool called CSIM. The lessons learned
during the performance modeling effort were the following:
4.5.2 Hierarchical Simulation (Abstract Behavioral)
Adding actual data and processing operations to the Performance Model created the abstract
Behavioral Model for the SAR processor. The resulting virtual- prototype simulations
were numerically correct for the software mapping to the hardware. The following were
lessons learned during the development of the abstract behavioral virtual prototype
development and simulation:
4.5.3 Autocoding Software
The SAR benchmark was performed early in the RASSP program before several of the
RASSP tools used for the software development were in place. The following were lessons
learned with data flow capture using the PGM based tools from JRS' PGM"
based tools and MCCI's beta version of the Autocoding Toolset:
Next: 5 Detailed Design
Up: Case Studies Index
Previous:3 System Design