4.5 HighClass Software Develop and Autocoding Efforts
The HighClass MSE, HDI and control processor software was all captured, optimized and integrated using RASSP' s data flow graph (DFG) embedded software development processes and tools. As part of RASSP, ATL developed innovative DFG software concepts for efficiently implementing embedded DSP software for complex signal processing applications. These processes and tools were used to capture the HighClass image chip processing functions and integrate them with the SAIP system. The tools used to develop the BM4 software were GEDAE, a graphic DFG software development tool, and the Application Interface Builder (AIB) which provided the interface between application control software and the embedded signal processing functions. The overall software development environment is shown in figure 4 - 8 For BM4, GEDAE was used to capture, distribute and map the MSE and HDI software onto the Alex COTS DSP boards. AIB was used to build the HighClass command program for managing the SAIP image chip target classification processing requests.
Figure 4-8 RASSP' Integrated Software Development and Autocoding Process and Tool were a Key Factor in Efficiently Implementing the SAIP Application  
GEDAE is a graphical data flow software development tool that allows signal processing software to be captured as DFGs and autocoded for COTS DSP boards. GEDAE is a new software tool developed by ATL that had only been introduced as a commercial product in mid 1997. As a result, a number of limitations and shortcomings arose during the HighClass software development. In addition, the DFG software development processes were also immature. The combination of GEDAE' s shortcomings and the lack of a proven DFG development process resulted in a number of challenges and setbacks in capturing and optimizing the MSE and HDI software.
The figure below highlights the challenges and results of the MSE and HDI DFG development efforts. In the case of the MSE function the critical issue was developing an efficient implementation of the highly repetitive template matching function. For the HDI DFG the key challenge was efficiently capturing and distributing the highly complex HDI function. The following paragraphs describe the issues, which arose in implementing and optimizing the MSE and HDI DFGs, discuss the solutions that were developed, and relate lessons learned for improving the process in the future.
Figure 4-9 The RASSP Software Development and Autocoding Tools were Successfully Applied to Achieve Unprecedented Processor Efficiencies 
The initial MSE DFG development effort focused on developing an efficient GEDAE primitive for the MSE iterative processing loop. The effort started by capturing the optimized MSE C code as a GEDAE primitive and running it on a single Sharc processor. The C code was compiled using the optimized C compiler and required just under 15 operations per pixel to perform the MSE function. The resulting assembly code was analyzed and found to have an excessive amount of loop overhead. As a result, an effort was initiated to optimize the MSE assembly code to reduce the inter loop execution time. This optimization effort took two manweeks and resulted in an assembly code primitive that required slightly less than 3 operations per pixel. In this case, we were able to reduce the simple, highly repetitive MSE execution time by 5X using a custom Sharc assembly code primitive. It should be pointed out that, while GEDAE provides the capability to easily build signal processing DFGs using library or encapsulated C code primitives, highly repetitive functions, like MSE, can benefit significantly from the development and integration of application specific assembly code primitives.
The second major challenge faced in developing the MSE DFG was managing distributing the large template data sets. In this case, the MSE low resolution classification DFG needed the ability to control the size and location of the target template and the image chip data in the Sharc' s internal memory banks. Neither GEDAE or Alex' s operating system provided the ability to explicitly allocate and control the storage of data in the Sharc's internal memory. As a result, extensions were made to both the GEDAE and Alex software tools. Using these extensions, we were able to explicitly control the location and size of the MSE template and image chip data storage and capitalize on the Sharc' s vector processing capabilities. Using these extensions, we achieved more than 90% utilization of the Sharc' s internal memory resources.
The final challenge was developing a mechanism to perform the high resolution template match by loading a single image chip and cycling multiple the high resolution templates through the primitive. GEDAE' s original data flow concept received and consumed equal amounts of data for each execution cycle. The MSE HRC DFG needed to be able to receive the image chip data once and cycle multiple template data sets through the primitive. This limitation was identified and extensions were developed to allow the image chip data to remain static in memory while the template data was cycled through.
The MSE DFG development effort resulted in greater that 90% processor and memory utilization. To achieve this efficiency a number of roadblocks arose which required extensions to the GEDAE as well as the Alex software tools. This is not unusual when new, emerging software tools and DSP products are being used. In most instances, developing state of the art signal processor involves dealing with emerging tools and products where limitations and shortcomings must be resolved. In developing the MSE DFG all of the GEDAE and Alex operating system limitations were eliminated and the desired results were achieved. The final MSE DFG was efficiently mapped, autocoded and distributed across 25 Sharc processors. The final result was that the MSE DFGs, shown in figure 4 - 10 achieved an average of less than 3 cycles per pixel execution time, over 95% Sharc processor utilization and in excess of 90% memory usage.
Figure 4-10 Using Autocoding and Operating System Extension, Greater than 90% Processor and Memory Efficiencies were Achieved for the MSE DFG.  
While MSE was a simple repetitive function, the HDI function was highly complex involving hundreds of computational primitives. This complexity slowed down the algorithm optimization as well as DFG development efforts. DFG developments were initiated before the final algorithm requirement analysis was completed. These initial DFG development efforts faced a number of problems. Because the final HDI algorithm requirements had not established, these early efforts failed to account for DFG complexity issues, which arose during the final DFG development phase.
Early HDI DFG development efforts focused on identifying and developing the primitives needed for the HDI "remove prior transformsÓ functions. Our development process emphasized making maximum use of the existing library primitives. The HDI preprocessing functions were made up of complicated indexing, sampling and mathematical functions that did not exist as primitives in GEDAE' s function library. In some instances these functions were HDI specific. In other cases, they were general purpose primitives that had not yet been incorporated in GEDAE' s library. As a result, the DFG development efforts were diverted to develop the required low level library elements.
At the start of the HDI DFG development effort, GEDAE' s primitive library was immature. While some functions existed, in many cases they did not support the required data types. The preprocessing functions also required variable size data types that were, at the time, not supported by GEDAE. In all, a total of 126 primitives were identified that needed to be developed for the HDI DFGs. Of these 95 were general purpose functions, which were subsequently added to GEDAE's library. The remaining 31 were HDI specific. Development of these primitives significantly expanded the scope of the initial HDI DFG development efforts.
Once the required primitives were implemented, development of the "remove prior transforms" DFG was initiated. The preprocessing functions included complicated indexing, sampling and mathematical functions specific to the HDI algorithm. These complex functions were captured as DFGs using low level GEDAE library primitives (e.g. add, subtract, multiply, etc.). This approach led to highly complicated DFGs with multiple levels of hierarchy. These complicated DFGs resulted in large execution schedules and increased program memory requirements that were later replaced by HDI specific C code primitives to achieve memory and runtime requirements.
These early efforts taught us that a top down design approach is critical to the development of efficient DFGs. Understanding the overall algorithm requirements is essential to efficient data flow design. Literal translation of the C code functions into complex DFGs, using low level primitives is not an effective approach for capturing complex algorithms. This lessen was clearly brought home when the early, highly complex DFG' s and primitives had to be modified to make use of encapsulated HDI specific C code primitives to meet the memory and execution time requirements.
Figure 4-11 Use of Low Level Library Primitives to Perform Application Specific Function Can Result in Complex, Inefficient DFGs  
In summary, early HDI DFG development efforts resulted in the development of over 100 individual DFG primitives. Translation of the C code for the "remove prior transfer" functions into complex low level DFGs led to highly inefficient DFG designs. In the end, much of the effort expended in this early development phase was replaced by newly developed GEDAE library primitives and/or encapsulated HDI specific primitives which were more efficient.
When the final HDI functional analysis was completed, the final HDI DFG implementation, optimizations and testing was initiated. The initial "remove prior transform" DFG were integrated with DFGs developed for the HDI image formation functions. These new DFGs were assembled using a combination of DFG library functions as well as encapsulating C code for HDI specific functions. Once the full HDI DFG had been assembled, it was compiled and executed on a single Sharc processor. This initial compilation resulted in an execution time of more than 10 seconds (versus the requirement of 1.5 second) and memory use in excess of 1 megabyte (versus the 512 Kbytes available).
These results presented significant challenges for achieving the desired performance. To overcome these challenges the final HDI DFG development efforts focused on two critical aspects. The first was restructuring and modifying the DFGs to fit in the Sharc' s 512 Kbytes of on chip memory. The second challenge was reducing HDI DFG execution time to less than 1.5 seconds. An iterative, three cycle process was used to attack these issues. Each of the optimization cycles focused on analyzing memory use and execution times, and identifying and implementing DFG improvements and autocoding to enhancements to reduce the memory usage and DFG execution times. The following tables show the memory use and executions times at the end of the cycle. The following sections summarize the changes made to achieve those improvements and lessons learned during each phase.
Progression of Memory Usage and Execution Time Reductions |
  | Cycle 1 | Cycle 2 | Cycle 3 |
Memory Usage (Kbytes) | 873 |
725 |
456 |
Execution Time (seconds) | 4.45 |
2.97 |
1.42 |
Table 4 - 3 
During the initial cycle, the HDI DFGs were restructured in two ways. One effort focused on quickly reimplementing the HDI DFG. Code that was recognized as common signal processing primitives were replaced with GEDAE library functions. Large portions of code that were not fundamental signal processing operations were left as custom C-code primitives, calls to optimized GEDAE vector functions. The focus, decompose, and MLM parts of the algorithm were encapsulated. Only small amounts of the code in decompose and MLM were replaced with vector operations. These changes were responsible for the reduction from the initial 10 second execution time to 4.45 seconds during the first cycle.
The second major change was the elimination of GEDAE family functions. GEDAE provides the ability to use families to implement repetitive functions (for loops). Families allow the user to design DFGs where the individual family subtasks can be distributed across multiple processors. Each family element is allocated static input and output memory buffers. While this is beneficial when the function is distributed across multiple processors, it leads to inefficient memory use if the function is performed on a single processor. Since the HDI DFG was targeted to run on a single processor the use of families added significantly to the memory requirements. As a result, during the initial cycle, most of the families were removed. This change was the primary contributor to the reduction of program memory requirements from over 1 megabyte to 725 kilobytes.
At the end of the first cycle it became apparent, we were not going to achieve the required execution time and memory storage requirements without changes to GEDAE, the Alex operating system, the Wideband optimized Sharc library, as well as the HDI DFG implementation approach.
As a result the necessary GEDAE and Alex operation system enhancements were identified and efforts initiated make the changes. These enhancements were:
- Moving the memory allocation task from the embedded GEDAE kernel and up onto the host processor.
- Modifying GEDAE' s subscheduler to allow the use of "in-placeÓ input/output memory buffers
- Modifying GEDAE' s scheduler to allow sequential subschedules to reuse the same memory resources
- Providing the capability to define and control the specific memory location and size of parameters and variables
- Modifying the Alex operating system to provide the ability to control the allocation of the Sharc on chip memory
- Revising the Alex GEDAE port software to reduce routing table for the individual Sharc processors.
In addition, Wideband was contacted to determine if they could provide optimized versions of their Sharc library functions. The Wideband libraries provided with the Alex boards, did not take advantage of the Sharc' s ability to perform multiple memory fetches in a single clock cycle. Discussions with Wideband indicated that while they normally provided optimized C code library functions, but could provide optimized Sharc functions. Wideband provided a list of the functions required for HDI and they agreed to furnish the required library primitives.
While the GEDAE, Alex and Wideband changes were being accomplished, efforts were focused on restructuring of the HDI DFG. These efforts were concentrated on completely removing families from the HDI DFG. In addition, the C code primitives were rewritten to maximize the use of optimized vector routines. The individual HDI focus, make-looks, decompose and MLM functions were analyzed and recoded to us fundamental, optimized GEDAE primitives. The new implementations maximized the use of course grain vector operations to exploit the Sharc' s vector processing capabilities. These changes reduced the memory storage requirements to 725 kilobytes and less than 3 seconds execution time. With the projected savings associated with the GEDAE, Alex and Wideband enhancements the memory storage and execution time objectives seemed to be achievable.
During the final refinement cycle the GEDAE, Alex and Wideband enhancements were incorporated. Two problems contributed to slowing down the final cycle. First, the GEDAE modifications were not completed at the start of the cycle and had to be integrated incrementally. Second, bugs discovered in the enhanced software as well as the Analog Device compiler had to be corrected. Once these bugs were identified and corrected, the final DFG refinements could be accomplished.
Efforts were focused on incorporating all of the GEDAE and Alex software enhancements to allow the DFG to be accommodated in the on chip memory. When the changes were incorporated, the HDI DFG total memory requirements were reduced to 456 Kbytes allowing it to fit in the on chip memory.
When the DFG could be loaded in the on chip memory, the Wideband optimized Sharc functions were integrated. These final changes involved only minor modifications to the HDI DFG. The final modifications reduced the execution time to 1.42 seconds for a single Sharc processor. When the final DFG was later integrated in the top level HighClass DFG, it ran at a rate of less than 1.45 seconds per image chip.
In summary the final integration and optimization of the HDI DFG faced a number of challenges. The first and most significant was optimizing the memory usage to allow the HDI DFG to fit on a single Sharc processor. This requirement was critical to exploit the Sharc's vector processing capabilities. Major enhancements had to be made to the GEDAE autocoding capabilities as well as the Alex software to allow the code to fit in the on chip memory. Adding these tool enhancements will allow future users to have the access and control of memory allocation, which is critical to realtime software efficiency. Using these extensions, we were able to achieve 90% memory utilization and fully exploit the vector processing capabilities of the Sharc.
The second major challenge was achieving a 3X reduction in execution time. The primary factor in achieving the execution time improvement was the use of optimize DFG functions. The functions included optimized HDI specific C code functions, optimized GEDAE library functions, and optimized Wideband Sharc assembly code functions. Focusing on the use of optimized functions and GEDAE 's ability to autocode integrate them into the final embedded autocode executable modules was the key to achieving the final 1.42 second execution time.
In retrospect, the HDI DFG development efforts faced a number of challenges/ problems and provided a number of lessons on how to improve future DFG software development efforts. Some of the key lessons learned are:
- Initiating detailed DFG development before the top level data flow design is established can be unproductive.
- DFG design needs to reflect the final partitioning and mapping strategy and primitives must be designed to support this distribution strategy.
- The availability of the library primitives is critical to efficient DFG development.
- Primitive development requirements must be taken into account in defining the top level DFG design and development strategy.
- Attempting to maximize the use of existing library primitives for application specific functions leads to very complex, inefficient DFGs.
- Encapsulation of application specific C code functions can eliminate the need for complex, cumbersome DFGs.
A key process that emerged during the BM4 software development effort was the use of a top level DFG to develop and optimize an applications data flow design. For BM4, a top level DFG was developed, mapped and optimized for the Alex COTS DSP boards while the final MSE and HDI DFGs were still under development. This graph was constructed using time delay functions to represent the execution times for the HDI and MSE low and high resolution classification functions.
Figure 4-12 Using a Top Level Virtual HighClass DFG Allowed us to Optimize the Data Flow Design Prior to Completion of the Final HDI and MSE DFGs  
In effect, this top level DFG represented an emulation of the HighClass image processing functions that could be mapped and optimized to the prototype hardware while the final HDI and MSE DFGs were still under development. Using this DFG, we were able to identify and resolve a number of critical data flow design and integration issues before the final system integration and test. We were able to identify shortcomings in the Alex and GEDAE software tools and have them updated to provide the necessary capabilities. Once the GEDAE and Alex software had been updated the top level DFG was used to refine and optimize the HighClass DFG design. This top level data flow development effort resulted in a hardware/ software design, which achieved better than 90% processor efficiency. By overcoming these shortcomings and demonstrating a highly efficiency data flow design early in the HighClass DFG development effort we avoided costly delays later in the final system integration efforts.
Once the HDI and MSE DFGs were completed, integration of the final HighClass DFG was initiated. Figure 4 - 13 shows the two cycle process used to integrate the final HighClass DFGs with the Alex DSP boards. The initial cycle focused on integrating and optimizing a "two family" version of the HighClass top level DFG. The "two family DFG" designation refers to the use of two MSE low resolution classification families for the low resolution classification function. The final "five family DFG" used 5 family elements.
The initial cycle focused on integrating two MSE-LRC families (using 8 Sharcs), 14 HDI processors, 2 MSE HRC processors and 5 individual Sharcs to; control GEDAE graph execution, perform pre and post processing functions, assemble the HDI high resolution images and assemble the HRC templates sets. A total of 29 Sharcs (representing approximately 2/5 of the final system) were used to host the two-family DFG. This reduced size DFG allowed us to optimize the data queues and partitioning, and balance the HDI and MSE execution times using a smaller, simpler configuration before attempting to implement the final 72 processor DFG.
Figure 4-13 Using a Less Complex HighClass DFG Decreased Final Integration; Extending to the Full DFG Took Only Two Weeks  
The smaller DFG allowed us to refine and optimize the DFG more efficiently than the full 72 processor graph. Even with the reduced size, changes to the DFG involved several hours to modify, recompile and load the new DFG, and evaluate the resulting execution data performance. By comparison, changes to the final 72 processor DFG required 4 to 6 hours, which significantly limited our ability to optimize the final graph.
Once the data queues, partitioning and load balance had been optimized for the "two family" DFG, the final 72 processor "five family" DFG was assembled and optimized. The effort required to expand the 29 processor configuration to the final 72 processor configuration was accomplished in two weeks.
Using the GEDAE autocoding tools we were able to achieve over 90% memory and processor utilization across the final distributed 72 processor architecture. We were able to develop, distribute, debug and optimize the HighClass DFG without writing a single line of code for interprocessor communication, memory allocation, or final code debugging. All of the executable software for the 72 Sharc processors was automatically generated, compiled, linked and download by the autocoding tools. The execution and timing data need to optimize the design was provided by GEDAE's unique execution trace table capabilities which provided the insights needed to achieve the 90% processor efficiency.
Achieving this level of efficiency on a network of 72 tightly coupled DSPs meets or exceeds the level of efficiency that can be achieved using hand coded embedded processor software development processes. In fact in most cases just measuring the overall network performance represents a significant amount of effort and is rarely expended to show the actual execution timing of the final system hardware. RASSP' s unique software development and autocoding processes and tools not only provided the capability to demonstrate the final hardware/software performance but provided the debugging and optimization capabilities needed to achieve this high level of memory and processor efficiency.
4.6 Overview of BM4 Manhours and Schedule
The primary purpose of the RASSP benchmark development efforts was to demonstrate the advantages, benefits and improvements associated with applying the RASSP methodology and tools for developing future signal processors. As a result, a primary requirement was to apply and demonstrate as many of the RASSP concepts and processes, and record the amount of effort required to accomplish the individual design tasks. Consequently, ATL monitored and recorded the amount of effort, time and results for each of the individual BM4 development tasks. The development results of the BM4 effort have been reviewed in the previous sections of this case study. In this final section, the level of effort and time required to accomplish the BM4 prototype development effort are discussed.
As part of the benchmark process, metrics were established to measure the time and effort expended in developing and integrating the benchmark applications. The original level of effort, proposed for the BM4 HighClass processor prototype development, was 70 manmonths (5.2 manyears). The development was scheduled to be accomplished in 9 months. As the project evolved a number of changes occurred, both technical and programmatic. These changes resulted in the total effort growing to 99 manmonths (7.9 manyears) and the schedule was extended to 17 months. The program changes and problems that caused these increases are described below. In addition, insights are provided for future users of the RASSP methodology and tools, to help estimate the effort required to accomplish a rapid prototyping development project like BM4.
The original level of effort estimated for accomplishing the SAIP HighClass processor development is shown in figure 4 - 14. The figure shows the percentage of the effort budgeted for each of the second level development tasks as well as the number of manmonths budgeted for completing the effort. Figure 4 - 15 shows the same breakdown for the level of effort that was actually expended on the individual tasks. Finally, figure 4 - 16 shows the difference between the original BM4 estimates and the actual levels of effort.
Figure 4 - 14 Summary of Proposed BM4 Development Effort  
Figure 4 - 15 Summary of Actual BM4 Development Effort  
Figure 4 - 16 Summary of the Increases to the BM4 Development Effort  
Figure 4 - 14 shows the number of manhours allocated in the original development plan and was used to estimate the cost for developing the BM4 prototype. As previously described, the RASSP process is a spiral design process, where individual risk retirement tasks are identified and performed, and the results are used to establish the most cost beneficial approach for achieving the final system design. Because the spiral design process is a flexible, iterative process, changes occurred in the BM4 development plan that impacted effort required to accomplish the development objectives. Several significant changes as well as unanticipated design problems and resource conflicts impacted the amount of effort and length of time to accomplish the BM4 project. As shown in figure 4 - 15, not only did the amount of effort change but the distribution of the effort shifted significantly. The changes, problems and outside influences that created these differences are described in the following paragraphs.
In the case of the functional analysis effort, several issues arose causing our initial estimate to increase by 8 manmonths. First, the HDI algorithm and executable specification were more complex than originally anticipated and took significantly longer to analyze than planned. This problem resulted in an increase of 4 manmonths in the HDI functional analysis effort. The second issue was the necessity to continue optimization efforts to support the final DFG memory and execution time optimization efforts. This support effort added another two manmonths. Finally, the functional analysis effort was expanded to include analysis and implementation tradeoffs for the MSE algorithms that were not originally planned. The proposed architecture was based on a custom board for the MSE computation. As a result, the original plan did not include a detailed implementation tradeoff effort for the MSE algorithm. When the potential benefits of a COTS DSP board implementation of the MSE "early termination" approach was discovered, two manmonths of effort was added to investigate this alternative.
The 5.5 manmonth increase in the virtual prototyping efforts was caused by two separate factors. First the Omniview Cosmos tool is an emerging performance modeling tool. ATL had minimal experience with the tool and its library models. In addition, cosmos was still under development during the BM4 development effort. The immaturity of the cosmos tool and ATL's lack of experience with it resulted in the cosmos virtual prototyping efforts being impacted by 1.5 manmonths to become familiar with the tool and overcome problems and software bugs. The second factor contributing was the development of both cosmos as well as lightweight VHDL BM4 virtual prototypes of the BM4 system. These dual modeling activities were accomplished to demonstrate a comparison of the two virtual prototyping approaches as well as overcome execution time limitations of the cosmos tools. The development of two versions of the original Mercury and final Alex system designs added approximately 4 manmonths to the virtual prototyping effort. Future users should not incur the increases caused by these tool maturity problems or the need for dual modeling efforts.
By far the most significant increase in the BM4 development effort was associated with the detailed development of the DFG and control program software. In this case the overall effort increased by nearly a manyear (a 75 % increase). Four factors lead to this increase. First, a primary factor was the maturity of the GEDAE, Alex and ADI software tools. Like many advanced technology programs, BM4 had to deal with problems arising from the use of new emerging tools and processors. These included; immature GEDAE primitive libraries, limitations and errors in the GEDAE tools and Alex operating system, as well as bugs in the ADI compiler. Combined these problems contributed approximately 3 manmonths to the increase in the DFG development effort. A second factor was the use of low level primitives and a bottoms up approach to develop the initial HDI DFGs. These early efforts created memory and execution problems and later had to be replaced with high level DFGs. Consequently, these early DFG develop efforts expended 4 manmonths that was essentially lost when the early DFGs were replaced. The third issue contributing to the increase in the DFG development activities were problems encountered optimizing the HDI DFG memory usage and execution times to achieve the 90 % memory utilization and processor efficiency goals. Since BM4 was focused on a 100X increase in processor density, memory and execution time constraints placed a premium on the DFG program size and execution efficiency. These constraints resulted in requirements for extensions to both GEDAE and the Alex operating system. These extensions added to the complexity of the HDI DFG implementation effort. The effort had to be expanded to include identification of the necessary changes, debugging of the extensions, and several iterative DFG integration cycles. These complicating factors increased the DFG development effort by 3 manmonths. The final factor contributing to the increase was a 2 manmonth growth in the effort required to integrate and test the HighClass control software with the SAIP system emulator. To fully test the HighClass prototype and perform final acceptance test, the SAIP emulator code had to be modified to provide the capability to support both fixed and random sequences of test image chips. In addition, the software had to be extended to allow the test results to be recorded and displayed for evaluation during the formal acceptance tests. These unanticipated efforts resulted in a 2 manmonth increase.
With the exception of the 2 manmonth increase to modify the SAIP system emulator, none of the problems causing the 12 manmonth increase should occur on future DFG and control program development efforts. As a result of the BM4 effort, GEDAE's libraries and tools have matured and now provide significantly more primitive functions and fewer autocoding limitations. Definition of a better, top down DFG development process should eliminate the need to rework future DFG designs. Finally, extensions made to GEDAE for optimizing the BM4 memory usage and execution time should allow future applications to be developed, mapped and tested more efficiently.
The increase in system hardware/software integration effort of 3.7 manmonths was associated with two issues. First was the unanticipated requirement to integrate an early version of the system for the RASSP Final Technical Review. Originally, the BM4 prototype was scheduled to be completed prior to this review. Slippage in the development schedule necessitated an interim integration effort prior to the final system completion. This added approximately 1.5 manmonths to the system integration task. The second factor contributing to the increase was underestimating the amount of effort to develop and integrate the final acceptance test procedures and software. This second factor contributed the remaining 2.2 man months increase.
Termination of the custom board development effort resulted in a net 3.4 manmonth decrease in this effort. These savings resulted from the elimination of the final custom board detailed design and fabrication tasks. However, the custom board preliminary design activities actually exceeded the original plan by approximately 6 manmonths because the effort was broadened during the architecture codesign cycle to include programmable DSP custom board designs as well as the FPGA custom logic approach. In the end these increased efforts provided the necessary tradeoff data to allow ATL to select the most efficient implementation for the MSE subsystem.
Finally, the program management/case study effort increased 3.1 manmonths. This increase was totally attributable to the expanded case study documentation efforts. The case study effort was originally planned to require 5 manmonths but actually ended up requiring just less than 10 manmonths. On the other hand, the program management costs decreased by approximately 2 manmonths even though the schedule grew from nine to seventeen months.
The 8 month extension of the development schedule was the result of a number of factors. During the architecture codesign cycle the BM4 development effort was suspended for two months to address higher priority RASSP development and legacy documentation effort. In addition, during the detailed design and integration cycle, the effort was again interrupted to prepare for the RASSP Final Technical Review and resolve personnel resource conflicts with the Benchmark 3 development efforts. These two problems added an additional one month slippage. The remaining 5 months slippage was directly related to the HDI functional analysis and HDI DFG optimization efforts. The complexity of the HDI functional analysis task added three months to the planned completion date of the functional design, while the problems encountered in optimizing HDI DFG memory usage and execution time added 2 months. While future RASSP users will be subject to slippage caused by the complexity of the application, they should experience the schedule interruptions or DFG optimization delays faced by the BM4 development effort.
In summary, the increase in the level of effort and the schedule extensions were caused by a number of problems. The primary factors were: expanded virtual prototyping activities and tradeoff analyses; maturity issues associated with the virtual prototyping, autocoding and DSP board tools; and the complexity of the HighClass application. In the case of the virtual prototyping and tradeoff analyses many of the increases were the result of efforts, which were added to demonstrate key aspects of the RASSP process and will not be a factor for future developers. Similarly, extensions made to the Cosmos and GEDAE tools as a result of the BM4 development effort should eliminate many of the problems, which caused the growth of these efforts. Finally, lessons learned during the BM4 development effort and described in this case study should provide insights necessary for improving the RASSP rapid prototyping concepts, techniques and tools, and lead to significant savings on future signal processing development efforts.
Next:
Up: Case Studies Index
Previous:3.0 The RASSP Development Process Used to Attack the Problem
Approved for Public Release; Distribution Unlimited
Bill Ealy