Up:Case Study Index Previous:10 References

SAR Case Study

Appendix

A.1 Introduction to Data-Flow Methodology

What is data flow and how does the application developer design with it? The data-flow design philosophy has existed a long time, although it may not always have been called "data flow." Anyone who has used block diagrams to help explain or solve a problem has used something similar to data-flow techniques.

Think of a machine or a system that performs a function and can be explained by a block diagram, like a data modem. The first step in creating a block diagram is to functionally decompose the design. Figure A-1 shows an entire modem communications system containing a source, modulator, transmission line or "channel," demodulator, and display. Each box can be a subsystem, a single function, or a representation of some physical phenomenon like a communications transmission (atmosphere or fiber-optic cable).

Figure A- 1: Block diagram of a communication system.

In this diagram, the output of the source goes into the modulator, whose output is transmitted across a channel to a demodulator and then displayed. The output port of one function block is connected to the output port of the next block. Until the source starts to generate output, nothing occurs. Likewise, the activity of each block in the chain depends on the input of the previous block. Also, each functional block may have to wait until it receives a "certain amount" of information before it begins processing; this is called the threshold or "firing granularity." For example, if the modulator encodes more than one bit of information at a time - for example, two bits - it has to wait for two bits of data from the source. Another example is a Fast Fourier Transform (FFT). A radix-2 FFT must have 2^Ndata tokens before it can execute. Until the function's input port receives that much data, the function cannot execute. Therefore, the FFT has a threshold or firing granularity of 2^N.

For a function to execute, the output port must have a place to go to continue the data flow. This is an important idea in data-flow design. Data-flow graphs (DFGs) are "data driven." For a function to execute, all inputs must be available and there must be some place to put the output. This lends itself to the idea of information flowing through a system.

In a DFG environment, an application can be designed by installing some basic building blocks and then adding specifics in a hierarchical, top-down fashion. The user can design some hierarchical boxes that will be used later in larger graphs, bottom-up. Or an entire application can be designed in a single level.

A.1.1 Control-Driven Versus Data-Driven Computing

Conventional programming is based on the concept of being control-driven. Applications execute each instruction under sequential program control. Programs execute the instruction at a location indicated by a program counter. Execution continues in the sequence of instructions pointed to by the changing address locations in the program counter. An instruction executes only when its address is in the program counter. Each program instruction executes and saves the result in some memory location. Results are passed from instruction to instruction, primarily by reference to the memory addresses where the results are stored. An instruction executes in its turn and the program assumes that the correct data exists in currently addressed memory locations. Everything has to be ready when the program statement executes. In this sequential control-driven model, there is a single thread of control that is passed from instruction to instruction by the program counter.

In the parallel control-driven model, special parallel control operators are used to explicitly specify parallelism. These operators allow more than one thread of control to be active at a time and provide synchronization for these threads. In theory, there are some typical characteristics identified for the sequential and parallel control-driven models.

Data is passed by reference between instructions.
Control flow is implicitly sequential, but special control operators can be used explicitly for parallelism.
Program counters are used to sequence the execution of instructions in a centralized control environment.

Programs that have a control-driven computation organization have complete control over function sequencing. Synchronous computations are performed in the control-driven paradigm using centralized control. In a conventional programming methodology, a declarative language is used and an intelligent compiler is needed to generate the centralized control and optimization processes.

Data flow is based on the concept of data-driven computation. In a data-driven computing environment, functions execute when data is available. The main difference between control-driven and data-driven paradigms is that the program counter drives the former and data availability drives the latter.

In the data-flow approach, a functional programming environment generates a flow graph. The flow graph generates code with distributed control, which is mapped onto an architecture. Data-flow computations have a data-driven organization characterized by an inspection stage. Input/output queues are inspected for data availability and for a place to put results, accordingly. Once these criteria are satisfied, the data immediately execute if the processor is available.

The data-driven concept implies that many functions can be executed simultaneously and asynchronously. That is, functions can execute in parallel on an appropriate architecture. A high degree of implicit parallelism may exist in data-flow applications. Information in a data-flow application are packets of data tokens. A function is represented by the graphic on a canvas and destinations of its data are shown by connections. A data token is composed of a resultant value and many of these tokens are passed among the various resources in a data-flow application. Functionally decomposing any application into basic building blocks provides great flexibility in design and reuse and allows multiple designers to work on the same project. Also, the use of the "family" notation greatly increases a designer's ability to quickly implement large algorithms containing regular structures.

A.1.2 Data Flow Design

Data-flow programs are typically represented by directed graphs that show the flow of data between functions. Data-driven models also have some typical characteristics:

Intermediate or final results are passed directly as data between functions. Data is passed by value, not by reference.
There is no concept of shared data.
Program sequencing is constrained only by data dependency among functions.

A DFG is a directed graph whose nodes correspond to operators, and connections are pointers for forwarding data tokens. The graph demonstrates sequencing constraints (consistent with data dependencies) among instructions. In a data-flow application, DFGs represent the program. The firing rule of functions is based on data availability.

A DFG can be thought of as a collection of nodes, each node having input/output data ports, and a collection of arcs, connecting output ports to one or more input ports. A DFG operates by " firing" nodes that are " ready. " A node becomes ready when it has sufficient data on its input ports and enough room on its output ports to fire. When a node fires, it may " consume" data from one or more of its input ports and " produce" data on one or more of its output ports. When a node fires and executes its " function, " there may be some type of data transformation between input/output data. Data on an output port is transferred to the destination input ports of other nodes to which it is connected via arcs. When a node fires, it changes the state of the system by consuming and producing data. This, in turn, may make other nodes ready to fire. There may be many nodes that are ready to fire at a given time. The process of creating the sequence in which nodes fire is called " scheduling. " In multi-processor systems, the process of distributing nodes to different processors on which they will fire is called " mapping. " There are numerous scheduling and mapping strategies. The applicability of a given strategy depends greatly on the type of nodes of which the DFG is composed, target hardware on which the graph is to be scheduled, and type of communication paths on which the data may travel.

A.2 Autocoding the Synthetic Aperture Radar (SAR) Using GEDAE™

This section describes a step-by-step construction of the SAR application using the GEDAE™ (Graphical Entry Distributed Applications Environment) tool set. This work occurred one year after the SAR benchmark was originally completed. The resulting autocoded application achieved the same execution and memory efficiency as the hand- coded version but with about a 10X reduction in implementation time. The same GEDAE™SAR application was successfully remapped to three other architectures without modifying the original DFG. This was done by simply repartitioning and remapping the application to the new architecture. These remappings were accomplished in only a few hours.

A.2.1 Building the SAR Application

The SAR application had four parts: simulated source (in place of a real time source/sink), simulated display (instead of a real radar scope terminal), and two data-processing segments. The interesting part of the application design was building it in sections that could be concurrently executed; therefore, the distribution method was important.

This type of processing typically involved Single Instruction, Multiple Data (SIMD) calculations. The same small program code was executed again and again using different data. Instead of doing the calculations over and over, some of them were done concurrently. The number of concurrent processing pipes was the number of families in the application. The number of families could also be the number of processors the application would execute.

Start the construction of a graph named SAR by performing the following:

- sar_source
- sar_range
- sar_azimuth
- sar_sink

Figure A- 2: The top level data flow graph of the SAR application.

The SAR image processor creates a two-dimension image using the signal from an airplane-mounted radar pointed sideways toward the ground. One dimension, called range, is along the line-of-sight of the radar antenna. The second is parallel to the ground path of the airplane. The processing is done in two steps. First, in the range dimension the reflection signal from each radar pulse is processed. After processing some number (64 in this example) of pulse returns, the processed data sequences are lined up alongside the previous frame and azimuth processing is done in the other dimension. One half of the result in each azimuth sequence is saved for the output. For this vector-oriented processing the data must be transposed, or corner-turned, between the range and azimuth processing sections.

A.2.1.1 The SAR Source

Instead of having a real radar source, the input data was simulated by reading a formatted file of radar return-like data. The data was converted to vectors for each scan, which could be independently processed. The next part was to demultiplex each scan line down a different path to be processed concurrently. Because the file contained information about only one image instead of a continuous stream of images, developers used the image many times. Hence, the x_repeat box. This box was specially created to take as input a specified amount of data and continuously output that data. See Figure A-3.

Enter the image data by performing the following:
Open the hierarchical box sar_source.
Add the function boxes:

stream / source / x _ scanf
stream / complex / x _ repeat
vector / complex / x _ vx
vector / complex / vx _ demux
vector / complex / vx _ x

Add the data objects:

input const int C
input const int R
input const int D
local const int N = R*C / D
output out

Set the family index, Edit Set Families . . ., of the objects vx _ x, x _ repeat, and out to "I."

Figure A- 3: Simulating SAR input data.

x _ scanf>out x _ vx < in
const int C x _ vx < N
x _ vx>out vx _ demux < in
vx _ demux>out vx _ x < in
vx _ x>out x _ repeat < in
const int N x _ repeat < N
x _ repeat>out out

A.2.1.2 Synthetic Aperture Radar Range Processing

This part of the processing takes a complex FFT on a zero-padded fixed-size vector. After the FFT was calculated, developers cut the vector in preparation for the "matrix transpose" prior to azimuth processing. The bulk, or coarse granularity, of the transpose would be accomplished through the connections. These connections would be between different processors.

Process radar range by performing the following. Figure A- 4, SAR Range Processing, is the DFG that results from performing this effort.

embeddable /vector /complex /x _ vx
embeddable /vector /complex /vx _ multV
embeddable /vector /complex /vx _ zfill
embeddable /vector /complex /vx _ fft
embeddable /vector /complex /vx _ fft

input in
input const int N
input T
local const int n = log(N) /log(2.0)
local const int M = 1 < < n
local const int n1 = M < N ? n+1 : n
local const int M1 = 1 < < n1
local const int Pad = M1 - N
output out

Edit Set Families

out

Figure A- 4: SAR Range Processing.

Make the connections:

in x_vx<in
const int N x_vx<N
x_vx>out vx_multV<in
T vx_multV<V
vx_multV>out vx_zfill<in
const int Pad vx_zfill<Pad
vx_zfill>out vx_fft<in
vx_fft>out vx_cut<C
vx_cut>out out

A.2.1.3 Synthetic Aperture Radar Azimuth Processing

This segment applies a Taylor window and then completes the matrix transpose at a finer level. It puts the matrix pieces back together and divides them into vectors. An FFT is performed on the vectors with a specially designed AZimuth KERnel (azker coefficients) window applied. An inverse FFT is done before the vector is formatted for output. (Details have been left out. )

Process radar azimuth by performing the following. Figure A- 5, SAR Azimuth Processing, is the DFG that results from performing this effort.

embeddable/vector/complex/vx_mux
embeddable/matrix/complex/vx_mx
embeddable/matrix/complex/mx_trans
embeddable/delay
embeddable/matrix/complex/mx_adjoin
embeddable/matrix/complex/mx_vx
embeddable/vector/complex/vx_fft
embeddable/vector/complex/vx_multVX
embeddable/vector/complex/vx_ifft
embeddable/vector/complex/vx_subvec
embeddable/vector/complex/vx_x

Figure A- 5: SAR Azimuth Processing

Add the data objects:

input in
input const int R
input A
local const int one = 1
local const int zero = 0
local const int Rm1 = R-1
output out

Set the family index of input in to "i."
Make the connections:

in vx_ mux< in
vx_ mux> out vx_ mx< in
const int R vx_ mx< R
vx_ mx> out mx_ trans< in
mx_ trans> out mx_ adjoin< a
mx_ trans> out delay< in
const int one delay< n
delay> out mx_ adjoin< b
mx_ adjoin> out mx_ vx< in
mx_ vx> out vx_ fft< in
vx_ fft> out vx_ multVX< in
A vx_ multVX< VX
vx_ multVX> out vx_ ifft< in
vx_ ifft> out vx_ subvec< in
const int zero vx_ subvec< L
const int Rm1 vx_ subvec< U
vx_ subvec> out vx_ x< in
vx_ x> out out

A.2.1.4 The Synthetic Aperture Radar Display

Process radar display by performing the following. Figure A- 6, SAR Display (Sink) is the DFG that results from performing this effort.
Open the hierarchical box sar_sink.
Add the function boxes:

stream / complex / x_copy
vector / complex /x_vx
vector / complex /vx_mux

Add the hierarchical box: my_work/display_sar
Add the data objects:

input in
input const int C

Set the family index, Edit Set Families. . ., of the objects in, x_copy, and x_vx to "j."

Figure A- 6 : SAR Display (Sink)

Make the connections:

in x_copy < in
x_copy < out x_vx < in
const int C x_vx < N
x_vx > out vx_mux < in

Open the hierarchical box display_sar. Figure A- 7, SAR Sink Display, is the DFG that results from performing this effort.
Add the function boxes:

vector/complex/vx_x
stream/complex/x_mag
vector/s_v
vector/sink/v_disp

Add the data objects:

input in
input const int C

Figure A- 7 : SAR Sink Display

Make the connections as shown:

in vx_x < in
vx_x > out x_mag < in
x_mag > out s_v < in
const int C s_v < N
s_v > out v_disp < in

Close the hierarchical box display_sar.
In the sar_sink box finish making the connections.

vx_mux > out display_sar < in
const int C display_sar < C

A.2.1.5 Finishing the Top Level

Add the remaining data objects by performing the following (These data will define all the family sizes and operating parameters). Figure A- 8, completed SAR Application DFG, is the final SAR DFG that results from performing this effort.

local const int n = 2
local const int m = 2
local const int N = 1 < < n
local const int M = 1 << m
local const int C = 234
local const int R = 64
range i = 0..N-1
range j = 0..M-1
local const float taylor[] = . . .
local const complex azker[] = . . .

The local variable taylor[] is an array with 234 floating-point values. The local variable azker[] is an array with 128 complex number values. Instead of typing in 362 values in two arrays, developers have included in the FGlibraries/datafiles the files: " taylor_vals " and " azker_vals. " These are two ASCII files that can cut and paste the multiplicative window coefficients.

Figure A- 8 : Completed SAR Application DFG

Make the connections as shown:

const int C sar _ range < N
const int C sar_ source < C
const int R sar _ azimuth < R
const int R sar _ source < R
const int R sar _ sink < C
const int N sar _ source < D
float taylor [] sar _ range < T
complex azker[] sar _ azimuth < A
sar _ source > out sar _ range < in
sar _ range > out sar _ azimuth < in
sar _ azimuth > out sar _ sink < in

Double click on the family connection between the sar _ range and sar _ azimuth boxes. The connection pattern made is between a family of boxes indexed on "i" with a family output to a family of boxes indexed on "j" with a family input. This does not create an "all to all" connection scheme. Instead, the undefined family I/O indices are instantiated with the index of the box to which it is connected. The connection pattern is then between [j]i ' [i]j. Figure A- 9 the defined data-path pattern when this connection is made.

Figure A- 9: The j-th output of the i-th sar_range box is connected to the i-th input of the j-th sar_azimuth box.

A.2.2 Running the Graph

Run the graph by selecting Control Run . Figure A- 10, output of SAR Graph, displays the results of the SAR Processing that was performed on the input data file.

Figure A- 10 : Output of SAR Graph.

A.2.3 Distributing or "Embedding" the Graph

After the graph is built and tested, developers can look at distribution and performance on an " embedded " system. This section describes how to perform the techniques necessary to embed an application. There is no one way to get efficient performance, achieving that is application and hardware specific. Methods used to achieve higher performance are more important than performance itself.

It is instructive to see what schedule of events the application executes, and memory map it uses before changes are made. Select View Groups. Depending on the options chosen, the function box titles will change color to reflect its " group. " A group is the largest collection of connected embeddable boxes that can have their execution scheduled in the same " Static Schedule. " Each group can be partitioned for greater execution control.

Embed an application by performing the following:
Select the sar_range box or any box connected to sar_range with the same color.
Select Options Group Control... (GC) The " Group Control " window opens Figure A- 11.

Figure A- 11 : Group Controls Window

Select GC Display Schedules. . . . The " Static Schedule " of events that opens lists every function and every data transfer that is required for the execution of the application. Because the application has not been partitioned into smaller execution units, GEDAE™ executes, by default, the application on a single host processor. The application will execute on a single processor even if the host is made up of multiple processors. To take advantage of multiple internal processors, the application must be partitioned and the partitions mapped to separate processors.

Figure A- 12 : Before partitioning the application, a single schedule containing all the execution events must be created, by default, to run the graph.

Figure A- 12 is the schedule of events for this application. It has all the embeddable functions in a single partition executing on a single processor with one memory map. This single schedule should be used as a baseline when comparing performance of distributed execution. This is what happens when you do not distribute.

The execution times, " f_time " and " total_time, " are only valid after executing the graph with Utilities Enable Trace turned on.

A.2.3.1 Partitioning

Perform the following to partition the elements of a group forming smaller units and then have each unit execute on a different processor:

GC Partition Group. . .

Figure A- 13: The function boxes partitioned according to their family index.

sar_range

sar_azimuth

Options Set Partition. . .

A.2.3.2 Mapping

After partitioning the application, the next step is to select on which processor each partition will execute. In this dialog, assign a " processor number " corresponding to the partition name. The processor number is a specific designation from the " embedded/embedded_config " file and it is a hardware-specific name listed under listed under " Processor_Desc. "

Select which processor each partition will execute by performing the following:

Select GC Map Partitions. . .

Figure A- 14 : Map Partition Window

A.2.3.3 Transfer methods

The transfer methods are those paths used to send data between functions that are in different partitions that are executing on different processors. The number of these transfers is dependent on the number of connections that break the " partition boundary " on the graph. Any time a function box in one partition must send its output to a function in another partition, the action constitutes crossing the partition boundary. The different types of methods available to transfer data along these connections are dependent on the hardware on which the application is running.

Cross a partition boundary by performing following:

GC Set Transfer Methods. . .

Figure A- 15 : Transfer Methods Selection Window

The available data paths that can be used are set in the configuration file "embedded/embedded_config." This file details the types of communication paths available to the system in the "Communication_Desc" section.

A.2.3.4 Schedule

View some results by performing the following (Now that all of the performance enhancing techniques and characteristics have been entered:

GC Display Schedules

Figure A- 16: Group O Static Schedule

A.2.3.5 Embedding

Section A.2.3 was a prelude to actually embedding and running an application on multiple processors. Because a schedule was created without selecting any particular operating characteristics, Figure 6.2-5, developers could have skipped Sections A.2.3.1 through A.2.3.4. As simple as GEDAE™ has made it to partition and map an application, its even easier to run the graph on multiple processors once partitions are assigned to processors (Ref. Section A.2.3.2).

Once a single embeddable function box in a graph is used, a group is created and its execution characteristics can be selected. When selections are made, a schedule of execution events is created. When the graph is run, the static schedule of events is used to generate C -language program code. The code is compiled using the native compiler of the machine on which that code will execute. When GC Run on Embedded, a separate polled process, is started whose function is to perform the events in the schedule.

Up:Case Study Index Previous:10 References

Page Status: in-review, January 1998 Dennis Basara