|
Verilog Technical TidbitsAbstractThis is a collection of articles from the Technical Tidbits series in the OVI OpenExchange.Avoiding Race ConditionsOne of the most difficult categories of bugs in Verilog models are those caused by race conditions. A model with a race condition is one which is sensitive to the order in which the simulator executes events, but that order is not well-defined. That is, if two events occur at the same instant of simulated time, the behavior of the model is different depending upon which is executed first, but the order of event execution is not defined by the language semantics.It is relatively easy to write code which is sensitive to event ordering, especially when writing behavioral models. Debugging models with race conditions is often difficult, because sometimes they work as intended, and sometimes they do not. Their behavior can change due to seemingly unrelated changes in other parts of the model, or by being run by different Verilog simulators (or different re leases of the same simulator). A simple example of a race is:
In this case, there is an order dependency between setting x to 1 (at time 0) an d displaying its value (also at time 0). The results of this model are not defined by the semantics of Verilog. In Verilog-XL, if you reverse the two initial statements, you will g et different results. A somewhat more interesting variation of the above is:
In this case, there is a race between setting the value of x to 1 at time 0 (lin e 2) and waiting for the rising edge of x (line 4). Again, the language semantic s do not def ine whether or not a rising edge event will be detected at time 0. Zero-delay models provide many opportunities for creating race conditions.
A com mon one is the use of clocked registers as follows:
always @(posedge clk) (1)
q1 = d1;
(2)
always @(posedge clk) (3)
q2 = q1;
(4)
It is indeterminate which of the two events will occur first, so the values of q 1 and q2 might end up being the same, or they might end up being different. Furthermore, it is possible for the events to be executed in one order at one clock cycle and in the other order in another clock cycle. A more subtle variation of the above register problem is the following:
always @(posedge clk) (1)
#5 q1 = d1;
(2)
always @(posedge clk) (3)
#5 q2 = q1;
(4)
At first glance, one would think that by introducing a delay, any races
in the m odel would be eliminated. How ever, the race is still on here,
and for the same reason. On the rising edge of clk, the events of lines
2 and 4 are both schedule d to occur, but the order is indeterminate. The
correct solution to the above pr oblem is:
always @(posedge clk) (1)
q1 = #5 d1; (2)
always @(posedge clk) (3)
q2 = #5 q1; (4)
If the model is written this way, there is no race, and the two registers
act like a pipeline. However, consider the fol lowing optimization:
always @(posedge clk) begin (1)
q1 = #5 d1;
(2)
@(d1) ;
(3)
end
(4)
always @(pos edge clk) begin (5)
q2 = #5 q1;
(6)
@(q1) ;
(7)
end
(8)
The additional line of code in each register prevents the execution
of the assignment of d to q in cycles where the input data has not changed.
This optimization can often make a significant difference in execution
time. However, this has introduced a new race at line 7. q1 will change
at clk+5, and the @(q1) will be executed at clk+5. If q1 changed first,
then q2 will not change on the following cycle. A solution to this problem
uses the non-blocking assign:
always @(posedge clk) begin (1)
q1 <= #5 d1;
(2)
@(d1) ;
(3)
end
(4)
always @(posedge clk) begin (5)
q2 <= #5 q1;
(6)
@(q1) ;
(7)
end
(8)
This construction will in fact always work correctly. The moral of these examples is that race condi tions occur if a data value chang es at the same instant of simulated time that it is sampled. Sometimes it can be dif ficult to recognize simultaneous use and change of a model element. Quick and Easy Test VectorsA large percentage of time spent in R TL and behavioral level modeling is writing and "shaking out" the individual modules of some larger model. Testing is often done "by hand" -- single-stepping through the code to make sure the main control flow works, and by writing a handful of test vectors.The Verilog method for handling test vectors that need to be hand-written is inadequate in many cases. The current way of setting up test vectors is to read them from a text file using $readmemb or $readmemh. These system routines read a file into a memory array. The file format is very rigid in that only binary or h ex numbers can be used. If hex numbers are used, they may not align in convenient fields, e.g. a 3-bit field followed by a 6-bit field. Here is a technique for a different way of supplying and formatting the test vectors that must be written by hand. The test vectors themselves are contained in a module, perhaps called "vectors". A single vector is divided into convenient fields; each field is stored in its own register array. Example:
module vect;
...
// maxsize can be a parameter or a
// `define
reg [4:0] op [0:m axsize];
reg [31:0] operand [0:maxsize];
reg [9:0] displacement [0:maxsize];
Macros are then used to format the fields into a convenient and readable
form:
`define vectr1 {op[i], operand[i], displacement[i]} =
`define vector i=i+1; `vectr1
integer i;
initial begin
i = 0;
//
op
operand
displacement
`vectr1 {5'h12,
32'h12345678, 10'h274;
`vectr1 {5'h04,
32'h000054ea, 10'h0;
`vectr1 {5'h10,
32'h05008e19, 10'h12e;
`vectr1 {5'h1e,
32'h000054ea, 10'h1a0;
Also, techniques can be used that modify pre vious values:
`vector {5'h1e, operand[i-1] << 2, 16'h4283};
// shift last operand by 2
...
repeat(5)
`vector {5'h1e, operand[i-1]+1, 16'h4283};
// inc oper. by 1, 5 tim es
...
This sequence implements a walking 1: 1, 10, 100, 1000, etc.
`vector {5'h1e, 32'h1, 10'habc}
repeat(31)
`vector {5'h1e, operand[i-1]<<1, displacement[i-1]};
...
This sequence implements a growing 1: 1, 1 1, 111, 1111, etc.
`vector {5'h1e, 32'h1, 10'habc} repeat(31) `vector {5'h1e,(operand[i-1]+1<<1)-1,
displacement[i-1]}; ...
In a driver module, often written as a shell script that instantiates
the module under test, the register array is read:
module driver;
... // instantiate the vector module
vector vect;
integer iteration;
reg [4:0] op;
reg [31:0] oper;
reg [10:0] disp;
...
op = vect.op[iteration];
oper = vect.operand[iteration];
disp = vect.displacement[iteration];
iteration = iteration + 1;
...
Naturally, there is no reason why the vector module and the test driver module cannot be in the same file. Also, care must be taken to ensure that the bit lengths agree. Once one test jig is written, it is very easy to retool it for other modules. Double Vision
Often it is convenient to use two windows: one for typing commands and one for v iewing long traces. To do this, open two windows, one with scrollong enabled an d the other without. In the scrollable window, type "tty" to get the device name of that window . The response will be something like "/dev/ttyp3". In the no n-scrollable window, type the Verilog command line appended with "| tee / dev/t typ3". The output will appear in both windows, but only the non-scrollable windo w will allow command input. (There is no reason why the command window can not be scrollable, too, it is just not necessary.) Ways to write a state machineA common use of Verilog is to express the functionality of a finite-state machine, or fsm. These, of course, are common in logic, and are especially useful since they can be synthesized by today's logic synthesizers. In this column, we will present several different ways of writing a state machine. The point is to illustrate different ways of expressing the same behavior, and to highlight the essential characteristics.We will assume a three-state machine with one state variable and one input. The behavior of this state machine is described by the following table: Procedural code - Version 1The state machine can be represented by procedural code in Verilog as follows:Note that the state of the fsm is determined by both where the program counter is (state 0) and by a state variable (state 1 and 2). Note also that this representation is a synchronous one, where the state only changes on the rising clock edge and the input is assumed to not change at the clock edge. We can further note that this is pretty ungainly code, and not very easy to understand. Version 2A more straightforward representation is as follows. In this style there is an explicit state variable. Each state transition begins at the top of the always loop, so there is no state information contained in the program counter.Again, this assumes that the input has changed at a time other than the rising edge of the clock, and it is being sampled at the rising edge. Note that this co de is executed on every rising clock edge, regardless of whether or not the input has changed. Note also that this is necessary for correct behavior when in states 1 and 2, though not in state 0. Version 3It is often desirable to split the operation of the state machine into two parts , one part which computes the next state, and the other which changes the state variable. This would be written as follows:This version has split the operation into state change and new state computation . When written this way , the new state computation is only done on those cycles where the input or the state changes. That is, if there are many cycles where t he fsm is in state 0 and the input remains constant at 0, the state computation will not be done. Note however, that the state updating always takes place. Version 4A stylistic variation of the previous version is as follows:The advantage of this is that the operation of the state computation is clear. I t is also apparent, when written this way, that the next variation is equivalent . Declarative code - Version 5Note that newstate must be declared as a wire. We can see from the declarative version of the state machine, there are some other variations which are possible. One is: This version is somewhat more efficient than the previous one which uses a function, since it avoids the function call overhead, but not all new state function s are simple enough to fit into this formulation. Version 6This variation is more efficient than the preceding one if state 0 occurs with input 0 a significant percentage of the time. If that is not the case, it will b e slightly more expensive. Note again how dependent this code is on the assumption that input does not change at the same time as the rising edge of clock. Cascaded State MachinesThe preceding style is fine for a single state machine, but what about the case where the output of one state machine is the input to the next? In that case, so me care must be taken to make sure that the input of each fsm does not change at the same time as the state variable is^Lupdated. To address that problem, consider the following two fsms: This set of fsms violates the assumption that the input to the second fsm, state 1, changes on the rising clock edge. This can be fixed as follows: where #delay is equal to some value less than the clock period. There are a variety of ways to accomplish the same thing, but all of them have t he property that they cause another event to occur after the rising clock edge. Here are some possible solutions: Using a delay off the rising clock edge: Using the falling clock edge: Of these choices, the first one using the continuous assignment is probably the clearest. The last one, using the fall ing edge of the clock is the most robust , since it does not depend on the relationship between delay and the clock period. There is another solution which does not involve the introduction of a new variable, and that is to delay updating the state variable: This solution has the advantage that no signals change on the clock edge, which is usually dangerous. Notice that the value of newstate is sampled at the clock edge, and that is what is used to update the current state, even though new stat e may change during delay. This could also be written as: Question: why does this formulation require a non-blocking assign, rather than a normal blocking assign with an intra-assignment delay? We can observe that this formulation causes StateFunc to be evaluated twice during the cycle, once when state changes and once when input changes. In many cases , it is worthwhile to avoid doing the computation twice. The following formulation will do the trick (similar to Version 4, but using the falling clock edge): This works if delay is less than half the clock cycle, and input changes before the falling clock edge. ConclusionWe can generalize about the above styles as follows.* compute the new state from the current state and the input * update the current state with the value of the new state when sampled at the c lock edge * make sure inputs do not change on the clock edge By following these rules, races will be avoided and the logic will be fairly clear and relatively efficient to simulate. A Better Method for Viewing Simulation Wave FormsCadence Design System's Verilog-XL simulator provides two methods for viewing si gnals as graphical wave forms. The most popular method is based on the $gr_wave s command. A lesser known but much more productive method is based on a wave form viewing tool used in conjunction with the $dumpvars command. This article de scribes the techniques for using a wave form viewing tool, such as Design Accele ration's Signalscan, in conjunction with the Verilog $dumpvars command. Other Verilog simulators, such a Chronologic Simulation's VCS, also provide the $dumpva rs feature to support the wave form viewing methods described. Other wave form viewing tools both commercial and in-house can also support this methodology.Traditional Method ($gr_waves)The traditional Verilog wave form viewing method utilizes the Verilog $gr_waves command. This method provides rudimentary signal display, but has some signific ant drawbacks. The $gr_waves method requires a designer to enter every signal h e might want to view into a $gr_waves command before starting the simulation. I t is not possible to view a signal if it was not entered into a $gr_waves comman d. Often a designer will need to view additional signals to locate a design pro blem. W ith the $gr_waves methodology he is forced to re-run the simulation af ter adding additional $gr_waves entries. This can make locating a design problem an extremely lengthy process. If a long simulation must be run several times to observe all the necessary signals, a designer's productivity will be severely reduced.A Better MethodA better method to view signals as wave forms is available when Verilog is used in conjunction with a wave form viewing tool. The Verilog $dumpvars command is used to generate a value change dump (vcd) f ile which stores the data for a large number of signals -- often 1000 times more signals than with $gr_waves. In fact, it is common to create a vcd file which contains all the signals in a sys tem design or chip design.Once the vcd file has been created it can be displayed with the wave form viewing tool. The f ile can also be examined and displayed while the simulation is proceeding if post processing is not acceptable. In the wave form viewing tool a much larger selection of signals is available to the designer. This makes it less likely he will have to re-run a simulation to obtain additional signals if he discovers a design problem. Instead, the signals can be added to the waveform display based on the data already in the vcd file. VCD File GenerationThe Verilog $dumpvars command is used to create the vcd file for the wave form viewing tool. Execution of a $dumpvars command causes the simulator to create t he vcd file and store signal transition information into the file as the simulat ion proceeds. Two common methods are used to generate a vcd file containing a design's signals. The $dumpvars command can be placed in the top most module of t he design's hierarchy as shown below:The $dumpvars command tells Verilog to store signal transitions for all signals in the current module and all mod ules below the current module into the vcd f ile. The $dumpvars command can also be placed in a separate module to avoid clut tering the top most module. This is shown below:
The <top_most_module_name> is the name of the highest module in the design's hierarchy. Both of these methods cause the Verilog simulator to create a vcd file containing all the signals in the design. Note that the $dumpvars command does not require the specification of any signal names. Instead signals are specified by their modules and their hierarchy. This eliminates the very tedious task of entering every signal name as is necessary with $gr_waves. After the simulation has proceeded to a point of interest the wave form viewing tool can be used to display the vcd file contents and view the simulation result s. PerformanceVerilog simulators will normally encounter a performance penalty when storing data for wave form view ing. A performance penalty is observed for both the $gr_ waves and the $dumpvars methods. Different simula tors will obviously have different performance characteris tics. The following data was obtained for comparison between $gr_waves and $dumpvars using Cadence Design Systems Verilog-XL simulator. The simulator was run on a Sun 4 Sparcstation II and the execution times of the two different wave form display methods were recorded. The results show a significant decrease in simulation time when using $dumpvars in conjunction w ith a wave form viewing tool. See table 1.Two different designs were benchmarked. Both designs were synthesizable Register Transfer Level (RTL) models which included a small amount of behavioral logic to provide stimulus. Test Case one and two were simulations of design #1. Tes t Case three and four were simulations of design #2. Test Case one compares th e two wave form viewing methods with an equal number of stored signals. The results show $gr_waves to be significantly slower than $dumpvars (72 seconds versus 9 seconds). Test Case two compares the two methods again with the $dumpvars method storing many more signals, but still performing significantly better. Te st Case three is a comparison of the two viewing methods with a similar number of stored wave forms using design #2. Test Case four is indicative of the most common use of the $dumpvars method, and once again the $dumpvars method is much faster. A very lar ge number of signals are stored using $dumpvars (19055 for $dumpvars versus 312 for $gr_waves) yet even in this extreme case the performanc e of the simulation with $dumpvars is better. Performing a simulation without storing any signals for wave form viewing will generally obtain the best simulation time. A typical slow down when using $dumpvars, in Cadence Designs Systems Verilog-XL is 10% to 30%. This assumes that a significant portion of the signals (20-60%) in the design are being stored into the vcd file. The slow down, however, is design dependent and this rule of thumb estimate is based on synthesizable R TL type^L designs. Note that other simulators may have drastically different performance variations when using a $dump vars feature. VCD File SizeThe wave form viewing tool used in conjunction with $dumpvars requires a vcd file stored on disk. This file can grow quite large if a long simulation history or an extremely large design is simulated. Some sample sizes are shown in the chart below:
Test Case one is a short simulation which requires little disk space for the vcd f ile. Test Case two is a long simulation storing a similar number of signals i n the vcd file. Its disk space requirements are larger (28.4 MB). Test Case three is a relatively short simulation which stores a large number of signals (8 6199). Its disk space requirements are modest at 6.4 MB. T est Case four is a short simulation which stores an even larger number of signals. It requires mod est amounts of disk space (13.8 MB). Test Case five stores a large number of si gnals and simulates for a significant amount of time. Its disk space requirements, at 298 MB, are rel atively large. Although, vcd files of greater than 0.5 Gigabytes are used in some design environments, it is generally optimal to keep vcd files below 100 MB in size. Larger sizes tend to slow the wave form viewing tool significantly. When simulating very large designs it may not be reasonable to create a vcd file which contains every signal. The disk space requirements may grow too large. T he $dumpvars command, however, provides convenient methods for selecting the sig nals to store in the vcd file. One format of the $dumpvars command is: The <number_of_levels> is the number of levels of hier archy below the <module_n ame> to store in the vcd file. The <number_of_levels> should be set to zero to s tore all modules and signals below <module_name> (an infi- nite number of levels ). A designer would have available all the signals which connect the chips of the s ystem and all the signals in the chip he is designing. He would not, however , have available the signals in any other chips in the system design. The second technique for reducing vcd file size is to limit the depth of the vcd file dump. It is rare that a designer needs to examine the internal workings o f a gate or flip-flop. Often statements like: can be used to eliminate the internal workings of flip-flops or gates
from the v cd file. This statement will limit vcd storage to four levels
of hierarchy. E liminating the lowest level of the hierarchy can reduce
the vcd file size by mor e than 50%.
Note 1: $freeze_waves was used to reduce display update time for the $gr_waves method. Note 2: The computer contained enough physical memory such that swapping did no t occur for any of the simulations. TABLE 1. Performance Comparison Between $gr_waves and $dumpvarsConclusionWe have described an alternative to $gr_waves for viewing wave forms in Verilog simulators. This better method utilizes an additional wave form viewing tool in conjunction with the Verilog $dumpvars command. The alternative method provide s for storage of many more signals and for faster simulation times (in Cadence Design System's Verilog-XL). These factors both create a higher productivity environment for designers. |
Copyright Rajesh Bawankule 1997-2013 |