Next: Up: Case Studies Index Previous:3.0 The RASSP Development Process Used to Attack the Problem

Semi-Automated IMINT Processing System (SAIP) Case Study

4.0 The Successes, Setbacks and Results of the SAIP Benchmark Development Effort.

There were two goals for the RASSP SAIP BM4 development effort. The first was to develop an improved SAIP HighClass processor that demonstrated the size, weight, power and cost reductions needed to meet future SAIP operational system requirements. The second was to apply and demonstrate the RASSP hardware/software codesign methodology and tools for improving the design and significantly reducing the time, effort and cost for developing signal processing systems. A summary of the results of the BM4 relative two these objectives are shown in Figure 4 - 1.

Figure 4-1 Summary of SAIP Benchmark Development Accomplishments

The following sub sections describe in more detail the results of the BM4 development efforts. The first sub section summarizes the accomplishments in achieving the SAIP goals for a tactically sized prototype. The subsequent sub sections describe the results, successes and setbacks encountered applying RASSP' s rapid prototyping tools and techniques.

4.1 Results, Successes and Setbacks in Applying the RASSP Rapid Prototyping Processes and Tools

A comparison of the SAIP HighClass processor development goals and the final results of the BM4 development efforts are summarized in Figure 4 - 2.

Figure 4-2 Summary of SAIP HighClass Prototype Requirements/Results.

The final BM4 prototype hardware met or exceeded most of the SAIP HighClass processing requirements/goals. The result of the BM4 effort was an all COTS HighClass processor consisting of six VME-6U boards, one less than the proposed seven board solution. While the system only achieved a 25 image chip per second throughput rate, adding a fifth Sharc DSP board would have resulted in a seven board design exceeding the 30 image chip per second requirement. The prototype achieved the 100X throughput density improvement by increasing throughput 7X while reducing the size by more than 15X. In addition, the effort lead to a decrease of 3X in the number of computations required to meet the 30 chips per second processing requirement (reducing the total computation from 18 Gflops to 6 Gflops). The BM4 prototype matched 100% of the executable specification target classification results, performed all of the required HighClass mode control functions, and interfaced directly with the existing SAIP control system. Finally, RASSP' s virtual prototyping techniques led to a design which eliminated the need for a sophisticated interconnect network or external processor memory, resulting in a 3x reduction in the prototype hardware cost.

The BM4 SAIP HighClass processor development efforts clearly demonstrated the feasibility of achieving the size, weight, power and cost reductions needed to meet future operational SAIP system requirements. Efforts are currently underway to leverage the BM4 computational improvements in the current SAIP ACTD System as well as apply RASSP' s model year architecture and virtual prototyping concepts for future SAR Automatic Target Recognizer (ATR) development efforts.

4.2 Successes, Setbacks and Lessons Learned

A key objective of the BM4 development effort was to demonstrate the capabilities and benefits of applying RASSP' s innovative processes and tools. The BM4 development effort was focused on using the RASSP spiral development model and hardware/ software codesign processes. The effort was accomplished in three spiral development cycles. During each cycle, the system design was extended, the hardware/software architecture was refined, and the final system requirements were updated. RASSP's iterative, risk retirement process lead to a final HighClass processor design that was significantly less complex and lower cost than originally proposed. In this section, the successes, setbacks and lessons learned in applying the RASSP process are described and analyzed. Insights are provided on how to apply the RASSP processes and tools for future signal processing efforts, and the capabilities and benefits that can be derived by adopting RASSP's innovative concepts and techniques, and tools.

Figure 4 - 3 shows the evolution of the BM4 SAIP processor architecture through the three spiral design cycles. The system architecture consisted of three basic subsystems:

a custom design MSE subsystem
a COTS DSP HDI subsystem
a HighClass control processor subsystem integrated in a standard VME chassis.

At the end of the System Architecture Definition design cycle, the original seven board, mixed COTS/Custom architectures had expanded to an all COTS 13 board design or mixed COTS/Custom 9 board solution. During the Architecture Hardware/ Software Codesign design cycle the all COTS hardware/software architecture was reduced to a six board design, while the mixed COTS/custom architecture was reduced to a five board solution. Since both exceeded the goal of a seven board system design, the lower cost/lower risk all COTS approach was selected for final implementation.

Figure 4-3 Evolution of the SAIP HighClass Processor Design .

During each development cycle significant improvements were made in the design and performance of the HighClass architecture. At the start of each cycle, efforts were focused on critical design issues and risk retirement efforts. The highest payoff hardware/software development activities were identified and accomplished. (See Section 3) Throughout the design cycles key risk retirement activities were concentrated in three areas:

HighClass algorithm functional analysis and hardware/software tradeoffs to optimize the computational implementation of the HDI and MSE processing functions,
Virtual prototyping of candidate architectures to optimize and select the best hardware/software approach, and
Implementation of the HighClass prototype using RASSP' s software development and autocoding processes and tools

Applying the RASSP processes and tools made key contributions in each of these areas. Employing the RASSP concepts and techniques, we were able to evaluate and assess the benefits and shortcomings of the process and develop insights for enhancing its future use. The following paragraphs relate some of the key benefits, shortcomings and lessons learned during the BM4 development effort.

4.3 HighClass Algorithm Functional Analysis and Computational Optimization

A key factor in achieving the final BM4 performance was a 3X reduction in the computational requirements based on detailed HDI and MSE functional analysis and optimization efforts. This reduction was accomplished using RASSP' s iterative requirement and specification refinement process throughout the entire development effort. The functional analysis and optimization efforts were initiated during the System Architecture Definition design cycle and continued until the latter stages of the final Detailed Design, Integration and Test phase. Concepts that played a critical role in these efforts included:

Use of an executable specification (E-Spec) to establish the functional performance requirements of the HighClass processor design
Development of a functional virtual prototype to analyze the execution timelines and implement computational improvements
Use of a simple, iterative process to refine and optimize the final implementation

At the start of the BM4 development effort, MIT/LL provided ATL with a HighClass executable specification. This executable specification consisted of 39 C, C++ and Fortran source code files containing over 18,000 lines of source code. While the executable specification was cumbersome and difficult to analyze, it provided an accurate functional performance specification.

Original E-Spec			Rewritten E-Spec
Language	Files	SLOC
C++	2	316	Language	Files	SLOC
C	27	2,516	C	3	1,648
Fortran	10	15,803
Total	39	18,635

Figure 4-4 Conversion of Executable Specification is a Key Element in Capturing System Requirements .

A lesson we have learned on each of the RASSP benchmarks was that the executable specification provided at the start of the project will not be mature or efficient. Quite the contrary, an executable specification will consist of code developed by algorithm developers who are more concerned with algorithm performance than computation or memory efficiency. It can be made up of individual simulation programs developed over time and loosely integrated to support refinement and improvement of algorithm performance. As a result, the signal processor developer must plan on refining and streamlining the executable specification at the start of a project. In the BM4 case, the original executable specification (18,635 SLOC) was analyzed and rewritten as a single language, C code performance simulation with less than 1,700 lines of code.

Having an executable specification as a starting point for the functional analysis was a clear benefit for evaluating the computation timelines and identifying algorithm optimization opportunities. The insights provided by the executable specification was a key factor in achieving the 3X improvement in execution time.

The BM4 computational optimization efforts followed a simple iterative improvement process. The HighClass execution time line profiles were evaluated to identify the most dominant computational functions. These functions were analyzed and alternative computational approaches were implemented and evaluated. A test bench was established to compare the functional results of alternative implementations with the executable specification to assure equivalent classification performance. As better computational approaches were identified, they were implemented and the process was iterated to improve the other computational elements. In the BM4 case, the process was iterated several times with the final results being a reduction of 3.3X in the HDI computational requirement and decreases of 2.6X and 1.75X for the MSE LRC and HRC processing requirements respectively.

Figure 4-5 Funcational Analysis of Processing Requirements Can Lead to Significant Savings

Applying RASSP's simple iterative functional analysis and optimization process benefited the final BM design by reducing the processing requirements and prototype hardware by a factor of three. Using RASSP's concept of emphasizing system functional requirement tradeoffs throughout the entire hardware/software codesign effort, we were able to not only capitalize on hardware and software architectural improvements but also able to reduce the prototype cost by 3X.

4.4 Virtual Prototyping of Candidate Hardware/Software Architectures

Virtual prototyping is a key element of the RASSP methodology and was used on all of the RASSP benchmarks. In the case of BM4 virtual prototyping was used to investigate and evaluate hardware/ software architecture tradeoffs for the top level system design and the candidate MSE custom board designs. ATL' s RASSP virtual prototyping process uses a hierarchical performance modeling approach to analyze and validate the candidate architectures as the hardware and software design matures. During the system definition cycle, high level, abstract performance models of the individual subsystems were used to verify the SAIP HighClass requirements and specifications. The fidelity of these models was refined as the design progressed through the hardware/software codesign cycle and the final detailed design, integration and test phases. Performance modeling was also a key element of the custom MSE board design tradeoff study. Both the system level and MSE custom board virtual prototypes used RASSP' s VHDL abstract modeling concepts to analyze design tradeoffs and arrive at a less complex and significantly lower cost final design.

During the system definition cycle a top level performance model was developed to analyze and verify the communication requirements between the individual HDI and MSE processors. This virtual prototype was assembled using Omniview's Cosmos modeling tools. Using Cosmos' model library a Mercury RACEway Sharc board model was developed and evaluated. The results of this model established that the communication requirements between the individual HDI and MSE processors were minimal (less than 5 percent of the total execution time). In addition, it identified the potential for replacing the Mercury Raceway communication network with a simple Sharclink design.

During the Architecture Hardware/Software Architecture Codesign cycle, performance models were used to analyze and validate the hardware/software designs for the FPGA, C80 and C6201 custom board designs. The FPGA custom board virtual prototype was developed using Omniview's Cosmos performance modeling tools, while the C80 and C6201 processor models were developed using in-house lightweight less detailed VHDL models and tools (see Performance Modeling Application Note).

Figure 4-6 Virtual Prototypes Provide a Low Cost Approach to verify and Optimize the Hardware/Software Designs

Cosmos was successfully used to model and analyze the MSE/MAD Classifier FPGA custom board hardware and software design. Techniques were devised to use Cosmos' existing library elements and adapt them to model the custom FPGA MSE operators, the image chip and template caches, the board controller, the template storage memory, as well as the VME and RACEway network interfaces. Similarly, existing token based VHDL processor and network models were used to assemble virtual prototypes of the C80 and C6201 custom board design. In the case of the C80 model the on chip CPUs and the internal communication network were modeled as an on chip network. Using this model we were able to test and validate the MSE custom board software design for distributing the low and high resolution image chip and template data.

Results from the custom board performance modeling efforts are, shown in table 4 - 1.

	FPGA	c80	c80	c6201
Design Factor	(40 Mhz)	(40 Mhz)	(60 Mhz)	(200 Mhz)
# Processors Per Board	6	4	4	6
Processor Rate (chips/sec/board)	17.5	17	25	31
Bus Utilization (%)	N/A	10.3	10.8	11.1
Processor Utilization (%)	89/57*	94	94	96
* low resolution/high resolution

Table 4-1 Custom Board Virtual Prototyping Results

Virtual prototyping of the FPGA, C80 and C6201 MSE custom board designs proved to be invaluable in resolving a number of critical hardware/software architectural issues. First and foremost the virtual prototypes provided the performance data for establishing the processing throughput rates of the candidate DSP custom board designs. This data was a key factor in the MSE custom board tradeoff analysis and selection. The virtual prototypes also provided the mechanism for investigating and establishing the custom board software designs. The MSE low and high resolution control software was developed and verified. The feasibility of performing both the low and high resolution processing on a single DSP was established. Finally, the performance simulations clearly demonstrated the bandwidth requirements for image chip and template data transfer were minimal and that neither a high speed interconnect network or dedicated template caches were required.

Two different performance modeling tools, Omniview' s Cosmos tools and ATL' s VHDL library of reusable models, were used to model the SAIP system. Performance modeling of the SAIP BM4 system was highly beneficial in analyzing the COTS board tradeoffs, clarifying the communication requirements, retiring software design risks and steering the detailed hardware/software design process.

Figure 4-7 System Level Virtual Prototyping Provided a Mechanism for Verifying the Final System Design and Reducing the Hardware/Software Integration Time

The SAIP system virtual prototyping effort provided the following key contributions:

Verified the Alex COTS boards would meet the MSE and HDI processing requirements
Showed the 72 Sharc processors could be programmed efficiently
Validated the Sharclink network would meet the HDI/MSE communication requirements
Established the image chip as being the basic level of processing granularity
Verified a static schedule would provide high efficiency
Established that the HDI execution time was the dominant factor in meeting the system requirements
Confirmed the LRC and HRC random execution times would not degrade system performance.

Using both Cosmos and ATL' s lightweight VHDL modeling approaches (shown in Figure 4 - 7) provided the opportunity to compare and assess the differences between these tools. The SAIP virtual prototyping efforts highlighted the strengths and weaknesses of each approach. The following paragraphs summarizes the salient features, strengths and limitations of the tools.

Some of the key features and characteristics of the Cosmos performance modeling tools are;

Different components/tools integrated into an highly effective single GUI
Easy to use hardware architecture GUI
Good software task graph GUI
Provides a library of sophisticated general purpose processor and network models
Detailed VHDL model behavior hidden in the underlying library models
Supports both static and dynamic task scheduling
Supports multitasking and interrupt process modeling
Performs acceptably for simulations of 60 simple processors or 25/30 complex processors
Automatically traces all processor, task and network events, unless explicitly turned off
Generates large simulation models with increased memory requirements and relatively slow simulation times
Routing/simulation code generation for larger systems increases to several hours

Similarly, the key features and characteristics of ATL's lightweight performance modeling tools and models are;

Separate architecture and high-level data flow graph (DFG) GUIs
Good software GUI for mapping application to architecture 3.Makes use of existing library models as starting point for models development
Requires VHDL modeling experience to develop specific simulation models
Requires development of VHDL hierarchical structural models
Only provides static task scheduling
Efficient static scheduler for automatically generating the program and routing files
Multitasking provided automatically via time multiplexing by GUI/scheduler.
Trace events explicitly defined by detailed VHDL code
Individual tools that are compatible but not integrated into a single GUI.
Extremely fast simulation runtimes. (No significant system performance degradation with 70 processors.)

The following table provides a comparison of the Cosmos and the lightweight VHDL performance modeling simulations.

	Simulation Time (sec)		Simulation Runtime		Simulation Size
	Cosmos	LW VHDL	Cosmos	LW VHDL	Cosmos	LW VHDL
Single Board Model	20 Sec	N/A	75 Min	N/A	155MB	N/A
Two Board Model	20 Sec	5 Sec	10 Hrs	5 Sec	260MB	10MB
Four Board Model	N/A	20 Sec	N/A	40 Sec	N/A	48MB

From the simulation runtimes, it is apparent that the simulation code generated by the Cosmos tool is significantly larger and runs slower than the lightweight VHDL models. Cosmos' sophisticated, general purpose models result in large simulation executables and longer runtimes. Comparison of the Cosmos one board model with the two board model show the non linear effect the simulation size had on the simulation runtime. In this case, the size of the two board model exceeded the workstation' s memory limits and significantly diminished the usefulness of the Cosmos tool. On the other hand Cosmos' integrated tools and general purpose model library made assembling and integrating the model relatively simple.

There were a number of lessons learned as the result of the SAIP system level performance modeling effort. Some of these lessons are:

RASSP' s token level abstract modeling techniques were highly beneficial in establishing the HighClass VME and interprocessor communications requirements
Using virtual prototypes we were able to quickly validate the performance of candidate custom board and COTS hardware architectures and make cost effective tradeoffs.
The SAIP communication requirements were low, resulting in the performance being mainly dependent on functional execution times.
The VME communication bandwidth was sufficient to meet the required image chip data transfer rates.
The Sharclinks provided a highly efficient network for interprocessor communication.
The random LRC and HRC execution times did not degrade system performance.
Lightweight VHDL performance models provide a practical approach for modeling large systems.
Static scheduling worked well and outperformed the dynamic scheduling.
The Cosmos tools begin to lose effectiveness for systems with more than 50 simple processors or 25 complex processors.

RASSP' s virtual prototyping concepts and techniques made significant contributions to both the MSE custom board tradeoff analysis as well as the final COTS system design. Using a combination of the Cosmos tools and ATL' s inhouse VHDL modeling tools, ATL was able to evaluate and verify that a single C6201 custom MSE board design would meet the SAIP system requirements. Virtual prototyping efforts also retired a number of hardware/software design risks for the final all COTS prototype design. The ability to model hardware/software design concepts provided the performance and timing data needed to make critical design tradeoffs and achieve the 100X throughput density improvement in the final HighClass processor architecture.

4.5 HighClass Software Develop and Autocoding Efforts

The HighClass MSE, HDI and control processor software was all captured, optimized and integrated using RASSP' s data flow graph (DFG) embedded software development processes and tools. As part of RASSP, ATL developed innovative DFG software concepts for efficiently implementing embedded DSP software for complex signal processing applications. These processes and tools were used to capture the HighClass image chip processing functions and integrate them with the SAIP system. The tools used to develop the BM4 software were GEDAE™, a graphic DFG software development tool, and the Application Interface Builder (AIB) which provided the interface between application control software and the embedded signal processing functions. The overall software development environment is shown in figure 4 - 8 For BM4, GEDAE™ was used to capture, distribute and map the MSE and HDI software onto the Alex COTS DSP boards. AIB was used to build the HighClass command program for managing the SAIP image chip target classification processing requests.

Figure 4-8 RASSP' Integrated Software Development and Autocoding Process and Tool were a Key Factor in Efficiently Implementing the SAIP Application

GEDAE™ is a graphical data flow software development tool that allows signal processing software to be captured as DFGs and autocoded for COTS DSP boards. GEDAE™ is a new software tool developed by ATL that had only been introduced as a commercial product in mid 1997. As a result, a number of limitations and shortcomings arose during the HighClass software development. In addition, the DFG software development processes were also immature. The combination of GEDAE™' s shortcomings and the lack of a proven DFG development process resulted in a number of challenges and setbacks in capturing and optimizing the MSE and HDI software.

The figure below highlights the challenges and results of the MSE and HDI DFG development efforts. In the case of the MSE function the critical issue was developing an efficient implementation of the highly repetitive template matching function. For the HDI DFG the key challenge was efficiently capturing and distributing the highly complex HDI function. The following paragraphs describe the issues, which arose in implementing and optimizing the MSE and HDI DFGs, discuss the solutions that were developed, and relate lessons learned for improving the process in the future.

Figure 4-9 The RASSP Software Development and Autocoding Tools were Successfully Applied to Achieve Unprecedented Processor Efficiencies

The initial MSE DFG development effort focused on developing an efficient GEDAE™ primitive for the MSE iterative processing loop. The effort started by capturing the optimized MSE C code as a GEDAE™ primitive and running it on a single Sharc processor. The C code was compiled using the optimized C compiler and required just under 15 operations per pixel to perform the MSE function. The resulting assembly code was analyzed and found to have an excessive amount of loop overhead. As a result, an effort was initiated to optimize the MSE assembly code to reduce the inter loop execution time. This optimization effort took two manweeks and resulted in an assembly code primitive that required slightly less than 3 operations per pixel. In this case, we were able to reduce the simple, highly repetitive MSE execution time by 5X using a custom Sharc assembly code primitive. It should be pointed out that, while GEDAE™ provides the capability to easily build signal processing DFGs using library or encapsulated C code primitives, highly repetitive functions, like MSE, can benefit significantly from the development and integration of application specific assembly code primitives.

The second major challenge faced in developing the MSE DFG was managing distributing the large template data sets. In this case, the MSE low resolution classification DFG needed the ability to control the size and location of the target template and the image chip data in the Sharc' s internal memory banks. Neither GEDAE™ or Alex' s operating system provided the ability to explicitly allocate and control the storage of data in the Sharc's internal memory. As a result, extensions were made to both the GEDAE™ and Alex software tools. Using these extensions, we were able to explicitly control the location and size of the MSE template and image chip data storage and capitalize on the Sharc' s vector processing capabilities. Using these extensions, we achieved more than 90% utilization of the Sharc' s internal memory resources.

The final challenge was developing a mechanism to perform the high resolution template match by loading a single image chip and cycling multiple the high resolution templates through the primitive. GEDAE™' s original data flow concept received and consumed equal amounts of data for each execution cycle. The MSE HRC DFG needed to be able to receive the image chip data once and cycle multiple template data sets through the primitive. This limitation was identified and extensions were developed to allow the image chip data to remain static in memory while the template data was cycled through.

The MSE DFG development effort resulted in greater that 90% processor and memory utilization. To achieve this efficiency a number of roadblocks arose which required extensions to the GEDAE™ as well as the Alex software tools. This is not unusual when new, emerging software tools and DSP products are being used. In most instances, developing state of the art signal processor involves dealing with emerging tools and products where limitations and shortcomings must be resolved. In developing the MSE DFG all of the GEDAE™ and Alex operating system limitations were eliminated and the desired results were achieved. The final MSE DFG was efficiently mapped, autocoded and distributed across 25 Sharc processors. The final result was that the MSE DFGs, shown in figure 4 - 10 achieved an average of less than 3 cycles per pixel execution time, over 95% Sharc processor utilization and in excess of 90% memory usage.

Figure 4-10 Using Autocoding and Operating System Extension, Greater than 90% Processor and Memory Efficiencies were Achieved for the MSE DFG.

While MSE was a simple repetitive function, the HDI function was highly complex involving hundreds of computational primitives. This complexity slowed down the algorithm optimization as well as DFG development efforts. DFG developments were initiated before the final algorithm requirement analysis was completed. These initial DFG development efforts faced a number of problems. Because the final HDI algorithm requirements had not established, these early efforts failed to account for DFG complexity issues, which arose during the final DFG development phase.

Early HDI DFG development efforts focused on identifying and developing the primitives needed for the HDI "remove prior transforms� functions. Our development process emphasized making maximum use of the existing library primitives. The HDI preprocessing functions were made up of complicated indexing, sampling and mathematical functions that did not exist as primitives in GEDAE™' s function library. In some instances these functions were HDI specific. In other cases, they were general purpose primitives that had not yet been incorporated in GEDAE™' s library. As a result, the DFG development efforts were diverted to develop the required low level library elements.

At the start of the HDI DFG development effort, GEDAE™' s primitive library was immature. While some functions existed, in many cases they did not support the required data types. The preprocessing functions also required variable size data types that were, at the time, not supported by GEDAE™. In all, a total of 126 primitives were identified that needed to be developed for the HDI DFGs. Of these 95 were general purpose functions, which were subsequently added to GEDAE™'s library. The remaining 31 were HDI specific. Development of these primitives significantly expanded the scope of the initial HDI DFG development efforts.

Once the required primitives were implemented, development of the "remove prior transforms" DFG was initiated. The preprocessing functions included complicated indexing, sampling and mathematical functions specific to the HDI algorithm. These complex functions were captured as DFGs using low level GEDAE™ library primitives (e.g. add, subtract, multiply, etc.). This approach led to highly complicated DFGs with multiple levels of hierarchy. These complicated DFGs resulted in large execution schedules and increased program memory requirements that were later replaced by HDI specific C code primitives to achieve memory and runtime requirements.

These early efforts taught us that a top down design approach is critical to the development of efficient DFGs. Understanding the overall algorithm requirements is essential to efficient data flow design. Literal translation of the C code functions into complex DFGs, using low level primitives is not an effective approach for capturing complex algorithms. This lessen was clearly brought home when the early, highly complex DFG' s and primitives had to be modified to make use of encapsulated HDI specific C code primitives to meet the memory and execution time requirements.

Figure 4-11 Use of Low Level Library Primitives to Perform Application Specific Function Can Result in Complex, Inefficient DFGs

In summary, early HDI DFG development efforts resulted in the development of over 100 individual DFG primitives. Translation of the C code for the "remove prior transfer" functions into complex low level DFGs led to highly inefficient DFG designs. In the end, much of the effort expended in this early development phase was replaced by newly developed GEDAE™ library primitives and/or encapsulated HDI specific primitives which were more efficient.

When the final HDI functional analysis was completed, the final HDI DFG implementation, optimizations and testing was initiated. The initial "remove prior transform" DFG were integrated with DFGs developed for the HDI image formation functions. These new DFGs were assembled using a combination of DFG library functions as well as encapsulating C code for HDI specific functions. Once the full HDI DFG had been assembled, it was compiled and executed on a single Sharc processor. This initial compilation resulted in an execution time of more than 10 seconds (versus the requirement of 1.5 second) and memory use in excess of 1 megabyte (versus the 512 Kbytes available).

These results presented significant challenges for achieving the desired performance. To overcome these challenges the final HDI DFG development efforts focused on two critical aspects. The first was restructuring and modifying the DFGs to fit in the Sharc' s 512 Kbytes of on chip memory. The second challenge was reducing HDI DFG execution time to less than 1.5 seconds. An iterative, three cycle process was used to attack these issues. Each of the optimization cycles focused on analyzing memory use and execution times, and identifying and implementing DFG improvements and autocoding to enhancements to reduce the memory usage and DFG execution times. The following tables show the memory use and executions times at the end of the cycle. The following sections summarize the changes made to achieve those improvements and lessons learned during each phase.

Progression of Memory Usage and Execution Time Reductions
	Cycle 1	Cycle 2	Cycle 3
Memory Usage (Kbytes)	873	725	456
Execution Time (seconds)	4.45	2.97	1.42

Table 4 - 3

During the initial cycle, the HDI DFGs were restructured in two ways. One effort focused on quickly reimplementing the HDI DFG. Code that was recognized as common signal processing primitives were replaced with GEDAE™ library functions. Large portions of code that were not fundamental signal processing operations were left as custom C-code primitives, calls to optimized GEDAE™ vector functions. The focus, decompose, and MLM parts of the algorithm were encapsulated. Only small amounts of the code in decompose and MLM were replaced with vector operations. These changes were responsible for the reduction from the initial 10 second execution time to 4.45 seconds during the first cycle.

The second major change was the elimination of GEDAE™ family functions. GEDAE™ provides the ability to use families to implement repetitive functions (for loops). Families allow the user to design DFGs where the individual family subtasks can be distributed across multiple processors. Each family element is allocated static input and output memory buffers. While this is beneficial when the function is distributed across multiple processors, it leads to inefficient memory use if the function is performed on a single processor. Since the HDI DFG was targeted to run on a single processor the use of families added significantly to the memory requirements. As a result, during the initial cycle, most of the families were removed. This change was the primary contributor to the reduction of program memory requirements from over 1 megabyte to 725 kilobytes.

At the end of the first cycle it became apparent, we were not going to achieve the required execution time and memory storage requirements without changes to GEDAE™, the Alex operating system, the Wideband optimized Sharc library, as well as the HDI DFG implementation approach.

As a result the necessary GEDAE™ and Alex operation system enhancements were identified and efforts initiated make the changes. These enhancements were:

Moving the memory allocation task from the embedded GEDAE™ kernel and up onto the host processor.
Modifying GEDAE™' s subscheduler to allow the use of "in-place� input/output memory buffers
Modifying GEDAE™' s scheduler to allow sequential subschedules to reuse the same memory resources
Providing the capability to define and control the specific memory location and size of parameters and variables
Modifying the Alex operating system to provide the ability to control the allocation of the Sharc on chip memory
Revising the Alex GEDAE™ port software to reduce routing table for the individual Sharc processors.

In addition, Wideband was contacted to determine if they could provide optimized versions of their Sharc library functions. The Wideband libraries provided with the Alex boards, did not take advantage of the Sharc' s ability to perform multiple memory fetches in a single clock cycle. Discussions with Wideband indicated that while they normally provided optimized C code library functions, but could provide optimized Sharc functions. Wideband provided a list of the functions required for HDI and they agreed to furnish the required library primitives.

While the GEDAE™, Alex and Wideband changes were being accomplished, efforts were focused on restructuring of the HDI DFG. These efforts were concentrated on completely removing families from the HDI DFG. In addition, the C code primitives were rewritten to maximize the use of optimized vector routines. The individual HDI focus, make-looks, decompose and MLM functions were analyzed and recoded to us fundamental, optimized GEDAE™ primitives. The new implementations maximized the use of course grain vector operations to exploit the Sharc' s vector processing capabilities. These changes reduced the memory storage requirements to 725 kilobytes and less than 3 seconds execution time. With the projected savings associated with the GEDAE™, Alex and Wideband enhancements the memory storage and execution time objectives seemed to be achievable.

During the final refinement cycle the GEDAE™, Alex and Wideband enhancements were incorporated. Two problems contributed to slowing down the final cycle. First, the GEDAE™ modifications were not completed at the start of the cycle and had to be integrated incrementally. Second, bugs discovered in the enhanced software as well as the Analog Device compiler had to be corrected. Once these bugs were identified and corrected, the final DFG refinements could be accomplished.

Efforts were focused on incorporating all of the GEDAE™ and Alex software enhancements to allow the DFG to be accommodated in the on chip memory. When the changes were incorporated, the HDI DFG total memory requirements were reduced to 456 Kbytes allowing it to fit in the on chip memory.

When the DFG could be loaded in the on chip memory, the Wideband optimized Sharc functions were integrated. These final changes involved only minor modifications to the HDI DFG. The final modifications reduced the execution time to 1.42 seconds for a single Sharc processor. When the final DFG was later integrated in the top level HighClass DFG, it ran at a rate of less than 1.45 seconds per image chip.

In summary the final integration and optimization of the HDI DFG faced a number of challenges. The first and most significant was optimizing the memory usage to allow the HDI DFG to fit on a single Sharc processor. This requirement was critical to exploit the Sharc's vector processing capabilities. Major enhancements had to be made to the GEDAE™ autocoding capabilities as well as the Alex software to allow the code to fit in the on chip memory. Adding these tool enhancements will allow future users to have the access and control of memory allocation, which is critical to realtime software efficiency. Using these extensions, we were able to achieve 90% memory utilization and fully exploit the vector processing capabilities of the Sharc.

The second major challenge was achieving a 3X reduction in execution time. The primary factor in achieving the execution time improvement was the use of optimize DFG functions. The functions included optimized HDI specific C code functions, optimized GEDAE™ library functions, and optimized Wideband Sharc assembly code functions. Focusing on the use of optimized functions and GEDAE™ 's ability to autocode integrate them into the final embedded autocode executable modules was the key to achieving the final 1.42 second execution time.

In retrospect, the HDI DFG development efforts faced a number of challenges/ problems and provided a number of lessons on how to improve future DFG software development efforts. Some of the key lessons learned are:

Initiating detailed DFG development before the top level data flow design is established can be unproductive.
DFG design needs to reflect the final partitioning and mapping strategy and primitives must be designed to support this distribution strategy.
The availability of the library primitives is critical to efficient DFG development.
Primitive development requirements must be taken into account in defining the top level DFG design and development strategy.
Attempting to maximize the use of existing library primitives for application specific functions leads to very complex, inefficient DFGs.
Encapsulation of application specific C code functions can eliminate the need for complex, cumbersome DFGs.

A key process that emerged during the BM4 software development effort was the use of a top level DFG to develop and optimize an applications data flow design. For BM4, a top level DFG was developed, mapped and optimized for the Alex COTS DSP boards while the final MSE and HDI DFGs were still under development. This graph was constructed using time delay functions to represent the execution times for the HDI and MSE low and high resolution classification functions.

Figure 4-12 Using a Top Level Virtual HighClass DFG Allowed us to Optimize the Data Flow Design Prior to Completion of the Final HDI and MSE DFGs

In effect, this top level DFG represented an emulation of the HighClass image processing functions that could be mapped and optimized to the prototype hardware while the final HDI and MSE DFGs were still under development. Using this DFG, we were able to identify and resolve a number of critical data flow design and integration issues before the final system integration and test. We were able to identify shortcomings in the Alex and GEDAE™ software tools and have them updated to provide the necessary capabilities. Once the GEDAE™ and Alex software had been updated the top level DFG was used to refine and optimize the HighClass DFG design. This top level data flow development effort resulted in a hardware/ software design, which achieved better than 90% processor efficiency. By overcoming these shortcomings and demonstrating a highly efficiency data flow design early in the HighClass DFG development effort we avoided costly delays later in the final system integration efforts.

Once the HDI and MSE DFGs were completed, integration of the final HighClass DFG was initiated. Figure 4 - 13 shows the two cycle process used to integrate the final HighClass DFGs with the Alex DSP boards. The initial cycle focused on integrating and optimizing a "two family" version of the HighClass top level DFG. The "two family DFG" designation refers to the use of two MSE low resolution classification families for the low resolution classification function. The final "five family DFG" used 5 family elements.

The initial cycle focused on integrating two MSE-LRC families (using 8 Sharcs), 14 HDI processors, 2 MSE HRC processors and 5 individual Sharcs to; control GEDAE™ graph execution, perform pre and post processing functions, assemble the HDI high resolution images and assemble the HRC templates sets. A total of 29 Sharcs (representing approximately 2/5 of the final system) were used to host the two-family DFG. This reduced size DFG allowed us to optimize the data queues and partitioning, and balance the HDI and MSE execution times using a smaller, simpler configuration before attempting to implement the final 72 processor DFG.

Figure 4-13 Using a Less Complex HighClass DFG Decreased Final Integration; Extending to the Full DFG Took Only Two Weeks

The smaller DFG allowed us to refine and optimize the DFG more efficiently than the full 72 processor graph. Even with the reduced size, changes to the DFG involved several hours to modify, recompile and load the new DFG, and evaluate the resulting execution data performance. By comparison, changes to the final 72 processor DFG required 4 to 6 hours, which significantly limited our ability to optimize the final graph.

Once the data queues, partitioning and load balance had been optimized for the "two family" DFG, the final 72 processor "five family" DFG was assembled and optimized. The effort required to expand the 29 processor configuration to the final 72 processor configuration was accomplished in two weeks.

Using the GEDAE™ autocoding tools we were able to achieve over 90% memory and processor utilization across the final distributed 72 processor architecture. We were able to develop, distribute, debug and optimize the HighClass DFG without writing a single line of code for interprocessor communication, memory allocation, or final code debugging. All of the executable software for the 72 Sharc processors was automatically generated, compiled, linked and download by the autocoding tools. The execution and timing data need to optimize the design was provided by GEDAE™'s unique execution trace table capabilities which provided the insights needed to achieve the 90% processor efficiency.

Achieving this level of efficiency on a network of 72 tightly coupled DSPs meets or exceeds the level of efficiency that can be achieved using hand coded embedded processor software development processes. In fact in most cases just measuring the overall network performance represents a significant amount of effort and is rarely expended to show the actual execution timing of the final system hardware. RASSP' s unique software development and autocoding processes and tools not only provided the capability to demonstrate the final hardware/software performance but provided the debugging and optimization capabilities needed to achieve this high level of memory and processor efficiency.

4.6 Overview of BM4 Manhours and Schedule

The primary purpose of the RASSP benchmark development efforts was to demonstrate the advantages, benefits and improvements associated with applying the RASSP methodology and tools for developing future signal processors. As a result, a primary requirement was to apply and demonstrate as many of the RASSP concepts and processes, and record the amount of effort required to accomplish the individual design tasks. Consequently, ATL monitored and recorded the amount of effort, time and results for each of the individual BM4 development tasks. The development results of the BM4 effort have been reviewed in the previous sections of this case study. In this final section, the level of effort and time required to accomplish the BM4 prototype development effort are discussed.

As part of the benchmark process, metrics were established to measure the time and effort expended in developing and integrating the benchmark applications. The original level of effort, proposed for the BM4 HighClass processor prototype development, was 70 manmonths (5.2 manyears). The development was scheduled to be accomplished in 9 months. As the project evolved a number of changes occurred, both technical and programmatic. These changes resulted in the total effort growing to 99 manmonths (7.9 manyears) and the schedule was extended to 17 months. The program changes and problems that caused these increases are described below. In addition, insights are provided for future users of the RASSP methodology and tools, to help estimate the effort required to accomplish a rapid prototyping development project like BM4.

The original level of effort estimated for accomplishing the SAIP HighClass processor development is shown in figure 4 - 14. The figure shows the percentage of the effort budgeted for each of the second level development tasks as well as the number of manmonths budgeted for completing the effort. Figure 4 - 15 shows the same breakdown for the level of effort that was actually expended on the individual tasks. Finally, figure 4 - 16 shows the difference between the original BM4 estimates and the actual levels of effort.

Figure 4 - 14 Summary of Proposed BM4 Development Effort

Figure 4 - 15 Summary of Actual BM4 Development Effort

Figure 4 - 16 Summary of the Increases to the BM4 Development Effort

Figure 4 - 14 shows the number of manhours allocated in the original development plan and was used to estimate the cost for developing the BM4 prototype. As previously described, the RASSP process is a spiral design process, where individual risk retirement tasks are identified and performed, and the results are used to establish the most cost beneficial approach for achieving the final system design. Because the spiral design process is a flexible, iterative process, changes occurred in the BM4 development plan that impacted effort required to accomplish the development objectives. Several significant changes as well as unanticipated design problems and resource conflicts impacted the amount of effort and length of time to accomplish the BM4 project. As shown in figure 4 - 15, not only did the amount of effort change but the distribution of the effort shifted significantly. The changes, problems and outside influences that created these differences are described in the following paragraphs.

In the case of the functional analysis effort, several issues arose causing our initial estimate to increase by 8 manmonths. First, the HDI algorithm and executable specification were more complex than originally anticipated and took significantly longer to analyze than planned. This problem resulted in an increase of 4 manmonths in the HDI functional analysis effort. The second issue was the necessity to continue optimization efforts to support the final DFG memory and execution time optimization efforts. This support effort added another two manmonths. Finally, the functional analysis effort was expanded to include analysis and implementation tradeoffs for the MSE algorithms that were not originally planned. The proposed architecture was based on a custom board for the MSE computation. As a result, the original plan did not include a detailed implementation tradeoff effort for the MSE algorithm. When the potential benefits of a COTS DSP board implementation of the MSE "early termination" approach was discovered, two manmonths of effort was added to investigate this alternative.

The 5.5 manmonth increase in the virtual prototyping efforts was caused by two separate factors. First the Omniview Cosmos tool is an emerging performance modeling tool. ATL had minimal experience with the tool and its library models. In addition, cosmos was still under development during the BM4 development effort. The immaturity of the cosmos tool and ATL's lack of experience with it resulted in the cosmos virtual prototyping efforts being impacted by 1.5 manmonths to become familiar with the tool and overcome problems and software bugs. The second factor contributing was the development of both cosmos as well as lightweight VHDL BM4 virtual prototypes of the BM4 system. These dual modeling activities were accomplished to demonstrate a comparison of the two virtual prototyping approaches as well as overcome execution time limitations of the cosmos tools. The development of two versions of the original Mercury and final Alex system designs added approximately 4 manmonths to the virtual prototyping effort. Future users should not incur the increases caused by these tool maturity problems or the need for dual modeling efforts.

By far the most significant increase in the BM4 development effort was associated with the detailed development of the DFG and control program software. In this case the overall effort increased by nearly a manyear (a 75 % increase). Four factors lead to this increase. First, a primary factor was the maturity of the GEDAE, Alex and ADI software tools. Like many advanced technology programs, BM4 had to deal with problems arising from the use of new emerging tools and processors. These included; immature GEDAE primitive libraries, limitations and errors in the GEDAE tools and Alex operating system, as well as bugs in the ADI compiler. Combined these problems contributed approximately 3 manmonths to the increase in the DFG development effort. A second factor was the use of low level primitives and a bottoms up approach to develop the initial HDI DFGs. These early efforts created memory and execution problems and later had to be replaced with high level DFGs. Consequently, these early DFG develop efforts expended 4 manmonths that was essentially lost when the early DFGs were replaced. The third issue contributing to the increase in the DFG development activities were problems encountered optimizing the HDI DFG memory usage and execution times to achieve the 90 % memory utilization and processor efficiency goals. Since BM4 was focused on a 100X increase in processor density, memory and execution time constraints placed a premium on the DFG program size and execution efficiency. These constraints resulted in requirements for extensions to both GEDAE™ and the Alex operating system. These extensions added to the complexity of the HDI DFG implementation effort. The effort had to be expanded to include identification of the necessary changes, debugging of the extensions, and several iterative DFG integration cycles. These complicating factors increased the DFG development effort by 3 manmonths. The final factor contributing to the increase was a 2 manmonth growth in the effort required to integrate and test the HighClass control software with the SAIP system emulator. To fully test the HighClass prototype and perform final acceptance test, the SAIP emulator code had to be modified to provide the capability to support both fixed and random sequences of test image chips. In addition, the software had to be extended to allow the test results to be recorded and displayed for evaluation during the formal acceptance tests. These unanticipated efforts resulted in a 2 manmonth increase.

With the exception of the 2 manmonth increase to modify the SAIP system emulator, none of the problems causing the 12 manmonth increase should occur on future DFG and control program development efforts. As a result of the BM4 effort, GEDAE™'s libraries and tools have matured and now provide significantly more primitive functions and fewer autocoding limitations. Definition of a better, top down DFG development process should eliminate the need to rework future DFG designs. Finally, extensions made to GEDAE for optimizing the BM4 memory usage and execution time should allow future applications to be developed, mapped and tested more efficiently.

The increase in system hardware/software integration effort of 3.7 manmonths was associated with two issues. First was the unanticipated requirement to integrate an early version of the system for the RASSP Final Technical Review. Originally, the BM4 prototype was scheduled to be completed prior to this review. Slippage in the development schedule necessitated an interim integration effort prior to the final system completion. This added approximately 1.5 manmonths to the system integration task. The second factor contributing to the increase was underestimating the amount of effort to develop and integrate the final acceptance test procedures and software. This second factor contributed the remaining 2.2 man months increase.

Termination of the custom board development effort resulted in a net 3.4 manmonth decrease in this effort. These savings resulted from the elimination of the final custom board detailed design and fabrication tasks. However, the custom board preliminary design activities actually exceeded the original plan by approximately 6 manmonths because the effort was broadened during the architecture codesign cycle to include programmable DSP custom board designs as well as the FPGA custom logic approach. In the end these increased efforts provided the necessary tradeoff data to allow ATL to select the most efficient implementation for the MSE subsystem.

Finally, the program management/case study effort increased 3.1 manmonths. This increase was totally attributable to the expanded case study documentation efforts. The case study effort was originally planned to require 5 manmonths but actually ended up requiring just less than 10 manmonths. On the other hand, the program management costs decreased by approximately 2 manmonths even though the schedule grew from nine to seventeen months.

The 8 month extension of the development schedule was the result of a number of factors. During the architecture codesign cycle the BM4 development effort was suspended for two months to address higher priority RASSP development and legacy documentation effort. In addition, during the detailed design and integration cycle, the effort was again interrupted to prepare for the RASSP Final Technical Review and resolve personnel resource conflicts with the Benchmark 3 development efforts. These two problems added an additional one month slippage. The remaining 5 months slippage was directly related to the HDI functional analysis and HDI DFG optimization efforts. The complexity of the HDI functional analysis task added three months to the planned completion date of the functional design, while the problems encountered in optimizing HDI DFG memory usage and execution time added 2 months. While future RASSP users will be subject to slippage caused by the complexity of the application, they should experience the schedule interruptions or DFG optimization delays faced by the BM4 development effort.

In summary, the increase in the level of effort and the schedule extensions were caused by a number of problems. The primary factors were: expanded virtual prototyping activities and tradeoff analyses; maturity issues associated with the virtual prototyping, autocoding and DSP board tools; and the complexity of the HighClass application. In the case of the virtual prototyping and tradeoff analyses many of the increases were the result of efforts, which were added to demonstrate key aspects of the RASSP process and will not be a factor for future developers. Similarly, extensions made to the Cosmos and GEDAE tools as a result of the BM4 development effort should eliminate many of the problems, which caused the growth of these efforts. Finally, lessons learned during the BM4 development effort and described in this case study should provide insights necessary for improving the RASSP rapid prototyping concepts, techniques and tools, and lead to significant savings on future signal processing development efforts.

Next: Up: Case Studies Index Previous:3.0 The RASSP Development Process Used to Attack the Problem

Approved for Public Release; Distribution Unlimited Bill Ealy