Home
Dec >>
<< Oct
News Index
2002
Jan Feb Mar
Apr May Jun
Jul Aug Sep
2001
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
2000
Apr Aug Sep
Oct Nov Dec
Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research
GR CPUs
XSOC
Launch Mail
Circuit Cellar
LICENSE
README
XSOC News
XSOC Talk
Issues
xr16
XSOC 2.0
XSOC2 Log
CNets
CNets Log
|
|
|
|
Today, for a change of pace, I'm reading Microsoft's
ECMA submissions
of the .NET Platform technologies (C# and Common Language Infrastructure).
|
|
Today we're celebrating this site's 100,000th page view.
Thank you for your interest.
Three years ago I
noted
that
multiprocessors-on-an-FPGA had become feasible; later I sketched
a strawman FPGA-array-supercomputer.
The last pages of my
Computer Architecture Education workshop paper
also discuss FPGA multiprocessors.
More recently I wrote that even in an era of embedded hard processor cores,
arrays of soft CPU cores will still
play an important role.
This evening, as a lark, I designed an 8-way MP-on-a-chip.
I took the work-in-progress
GR0000 16-bit RISC core,
which is now floorplanned as 8 rows by 6 columns of CLBs, plus
two block RAMs (which provide a byte-addressable 16-bit wide
1 KB dual-ported embedded instruction/data memory), and simply instanced
eight of them as 2 rows by 4 columns of processors plus memories,
in the smallest Virtex-E part (XCV50E), which provides 16 rows by 24
columns of CLBs and 4 rows by 4 columns of block RAMs.
Here's the floorplan; diagonal stripes denote hand-floorplanned primitives:
Here's the device utilization data. As you can see below,
it's a very tight fit, but thanks to the floorplanning, the design
placed and routed in three minutes.
Design Summary:
Number of errors: 0
Number of warnings: 10
Number of Slices: 729 out of 768 94%
Number of Slices containing
unrelated logic: 0 out of 729 0%
Number of Slice Flip Flops: 288 out of 1,536 18%
Total Number 4 input LUTs: 1,392 out of 1,536 90%
Number used as LUTs: 1,136
Number used for Dual Port RAMs: 256
(Two LUTs used per Dual Port RAM)
Number of bonded IOBs: 16 out of 94 17%
Number of Tbufs: 768 out of 832 92%
Number of Block RAMs: 16 out of 16 100%
Number of GCLKs: 1 out of 4 25%
Number of GCLKIOBs: 1 out of 4 25%
Number of RPM macros: 8
Total equivalent gate count for design: 291,768
This design will have a guaranteed-never-to-exceed performance of
8x50 "MIPS". Of course, this is currently a 100% useless MP-on-a-chip,
with no interprocessor interconnect, no external I/O, no spare
programmable logic for custom instructions/function units, etc., but
it stands as proof-by-example of the feasibility of FPGA MPSoCs and further
illustrates the utility of simple,
compact, floorplanned processor cores.
(To be clear, even the uniprocessor GR0000 is not yet up and
running in hardware, nor are its lcc-xr16-derived tools finished yet.)
|
|
Yesterday, I did some more work on the
GR0000 implementation. Recall it's
a new, space optimized 16-bit RISC for Virtex/Spartan-II. Early
(work-in-progress) implementation results look very promising: 50 MHz
in 50 CLBs, in half of an XC2S15-5.
Today, I have been investing in The Knowledge.
As I wrote earlier, each multiplexer in a
processor datapath merits close scrutiny. In 4-LUT FPGAs, a 2-1 multiplexer
wastes as much area as a 16-bit register or an adder/subtractor.
Therefore it is imperative for the FPGA CPU core designer to find
circuit structures that minimize the number of muxes. Alas, some muxes
are unavoidable. Consider a processor's program sequencing unit, which
determines the next value of the program counter, PC.
- Usually PC is incremented by 2 (or 4).
- Sometimes (taken branches), a short relative branch displacement is sign-extended
and added to PC.
- Sometimes (jumps, calls, returns), PC is loaded with an effective address.
How shall we implement this? A naive approach is to write
if (jump)
pc_nxt = eff_addr;
else if (branch)
pc_nxt = pc + sign-ext({br_disp,1'b0});
else
pc_nxt = pc + 2;
which is two adders, and two 2-1 muxes or one 3-1 mux.
However, we can save area by forming a simpler mux and sharing an adder:
pc_disp = sign_ext(branch ? {br_disp,1'b0} : 2);
pc_nxt = jump ? eff_addr : pc + pc_disp;
That's one small 2-1 mux, one adder, and one 2-1 mux. Better. Can
we do better still?
Consult the knowedge. Is there an efficient circuit structure in
Virtex that implements that last equation? Or put another way, can
o = add ? (a + b) : k;
be implemented in one logic cell per bit?
I tried Synplicity 6.0, but it emits an adder and a mux, e.g. two
LUTs per bit. I looked into the Xilinx F2.1i libraries, in particular,
at the 8-bit loadable counter CC8CLE, and it builds something funky using
the Virtex MUXCY and XORCY carry-chain resources, but it too seems to
require two LUTs per bit.
Still, it looked possible... I stared for a while at the Virtex
slice architecture schematic, and in particular at the MULT_AND, MUXCY,
and XORCY resources. And then I did indeed figure out how to implement
an add-mux circuit in just one LUT per bit! Here's how.
(If you want to make sense of this commentary, I recommend you follow
along with your own copy of the Virtex slice architecture schematic.)
First, a brief review of the Virtex slice architecture. A slice has two
4-LUTs, as well as two copies of carry-logic primitives -- a MUXCY and XORCY, and
the multiplier primitive -- MULT_AND. In a regular a+b adder, it is
usually configured that the 4-LUT computes a[i]^b[i], the MUXCY
generates carry-out c[i], propagating either a[i] if
a[i]^b[i]==0 or c[i-1] if a[i]^b[i]==1. The XORCY
computes a[i]^b[i]^c[i-1] as desired.
Now then, to this happy arrangement, we wish to add two additional inputs,
add and k[i]. It is potentially feasible because,
besides a[i] and b[i], there are two unused inputs on each 4-LUT.
To generate a sum and carry per LUT, we must still use the
MUXCY and XORCY structures. Therefore, to pass the constant k through
when add==0, the carry at every bit must be 0. But if we use the
4-LUT to compute
o[i] = add&(a[i]^b[i]) | ~add&k[i];
then when add==0 and a[i]==1 and k[i]==0, the MUXCY might still
propagate c[i]=a[i]==1 and we will incorrectly generate a carry-out
that will cause the next more significant bit's XORCY to toggle k[i+1] into
~k[i+1].
That's where MULT_AND comes in. The MULT_AND primitive was provided to
help implement compact multipliers. In a multipler, a x b, if the current
bit of the multiplier b[i] is 0, we add nothing (0 times the multiplicand) to
the product. If it is 1, we add one times the multiplicand to the product.
In Virtex, the MULT_AND is provided so, when b[i]==0 and a[i]==1,
instead of passing a[i] through to the MUX_CY (and then to the
carry-out c[i]), we instead AND them together a[i]&b[i] and
pass 0 so carry-out remains 0.
Using this structure, each bit of a parallel multiplier can be written
as approximately
prod[j][i] = b[i] ? prod[j-1][i] + a[i-j] : prod[j-1][i]
This said, our goal of
o = add ? (a + b) : k;
is now in sight. The trick is to use MULT_AND to zero the a[i] input
to the MUXCY when add is false.
Here's the source code for one bit of the circuit
(synthesis directives omitted):
module addmux1(add, ci, a, b, k, sum, co);
input add, ci, a, b, k;
output sum, co;
wire add_a, lut;
addmux_lut lut_(.add(add), .a(a), .b(b), .k(k), .o(lut));
MULT_AND and_(.I0(add), .I1(a), .LO(add_a));
MUXCY_L cy_(.S(lut), .DI(add_a), .CI(ci), .LO(co));
XORCY xor_(.LI(lut), .CI(ci), .O(sum));
endmodule
module addmux_lut(add, a, b, k, o);
input add, a, b, k;
output o;
assign o = add&(a^b) | ~add&k;
endmodule
Does this work? I haven't verified it yet. It looks good though.
I am very pleased to find this. It will shave at least one, and perhaps
two, logic-cells per bit from GR0000, XR processors, and so forth.
This same idea can be modified for other 4-input adder/mux circuits. For
example, it is also possible to do a minimal ALU in one column of LUTs:
o = add ? a + b : a ~& b;
Finally, it is worth pointing out one problem with this construction.
In a conventional adder/mux, the mux latency is obviously one LUT delay.
In this new construction this latency will (incorrectly) appear to a static timing
analyzer to be up to one n-bit adder delay.
|
|
Craig Matsumoto, EE Times:
Xilinx deal puts Synopsys in FPGA flow.
Coming soon: a flow that takes C or SystemC and puts out gates.
'But generally, Synopsys expects the C-to-RTL design flow to yield the
same circuits as a Verilog/VHDL-based flow, if not better, because both
will hand off their RTL data to the same synthesis tools. "There is no
difference in quality of results, because the synthesis is the same,"
Kunkel said.'
Apparently EDA vendors hope to realize higher prices
on higher-end FPGA SoC tools... Meanwhile, CNets development is
on hold while I pursue other opportunities.
|
|
Xilinx Aligns with Industry Leaders To Announce Platform FPGA Initiative.
A "must read": [some emphasis added]
"Empower! ... Embedded PowerPC 405 microprocessor cores from IBM will operate at
300 MHz to offer over 420 Dhrystone MIPs of performance ... Additionally,
embedded soft processor cores and high performance external interfaces
ensure that designers can implement a wide variety of custom solutions."
"XtremeDSP ... For high performance DSP applications, the Xilinx XtremeDSP solutions
will support over 600 billion multiply accumulate cycles (MACs)/sec,
more than 100 times faster than the industry's leading embedded DSP
processor core. The Virtex-II architecture includes fully distributed
registers and RAM for efficient FIR filters, up to 3.5 Mbits embedded
dual-port RAM for data buffering and embedded 18x18 multipliers for high
performance MAC functions."
"SystemIO ... RocketIO gigabit serial interfaces will deliver
unprecedented bandwidth for networking and communications systems for
Platform FPGAs. Embedded 3.125Gbps SkyRail CMOS serial transceivers
licensed from Conexant Systems, Inc. will support 10 Gbit Ethernet,
OC-192, InfinibandTM and XAUI interface standards."
As Steve Ballmer used to say (and probably still does),
"It's a great time to be us."
|
|
We now have our first link from the
Xilinx site.
Go to the IP Center, and click on
Processor Products
and there we are. Thanks Xilinx!
Peter Clarke, EE Times:
ASM, Philips build 70-nanometer gate. Excellent. I had read that
SiO2 won't cut it as the gate insulator at those geometries, and
now they've found something 1.1 nm thick that's a million times better!
Twenty-five million FPGA gates, here we come!
Over on
Scripting News, I comment on why Microsoft has a successful
component software ecology (and why Unix and open-source software don't.)
assembling variable-length instructions
Today I am working to finish up the long-delayed xr32 tools story.
The specs, core, SoC support, compiler, and simulator are done, but
I have to do some more work on the assembler.
For xr16, things are simple. Any reference to a symbol is always going
to be a 16-bit address, and so the immediate instruction
(addi, lw, etc.) is always going to require an immediate prefix.
And any call to a function, all of which are 16-byte aligned, will
require a single call instruction.
For xr32, things are more complicated. In xr32, more than one imm prefix
is permitted, in order to build up a larger immediate constant, even
on the call instruction:
lw r1,0x5678 -> imm 0x567 ;; lw r1,8(r0)
lw r1,0x2345678 -> imm 0x567 ;; imm 0x234 ;; lw r1,8(r0)
lw r1,0x12345678 -> imm 0x567 ;; imm 0x234 ;; imm 0x001 ;; lw r1,8(r0)
call 0x5670 -> call 0x5670
call 0x2345670 -> imm 0x567 ;; call 0x234
call 0x12345670 -> imm 0x567 ;; imm 0x234 ;; call 0x001
So now a symbol's address determines what instruction sequence addresses it.
But you don't know the address of anything until all the code and data
has been assembled.
This is rather like the long-standing XSOC
Issue #3, which is that far branch displacements are not implemented.
I'm going to fix that one today, too.
This is an old chestnut of the assembler and linker world.
One good strategy is to do a first pass of assembly, building
a table of symbolic references ("fixups"), assuming all references
are short and emitting the shortest code sequence possible.
Then loop over the whole program image, resolving every fixup. If a fixup
site makes a reference to a symbol that is "too far" away, you must insert
instructions to implement the long reference. Inserting code moves the
following code down, which may make some formerly short references long,
which may require yet more instructions, etc. Eventually the whole thing
settles down to a fixed point, no insertions occur, and you can stop.
One additional complication: the xr call instruction requires the called
function to be 16-byte aligned. Therefore all functions are 16-byte aligned.
If you apply the 16-byte alignment padding early, it will be invalidated
as soon as you insert any instructions in some other function that
precedes it.
If you wait until fixup resolution achieves a fixed-point, and then apply
function alignment padding, the padding that is inserted can make certain
address references long, requiring more passes of fixup resolution.
If you are not careful, you can insert some padding, insert something ahead
of that, insert some more padding, insert something, etc., and the padding
per function grows without bound.
Therefore, when we 16-byte-align functions, we must satisfy an invariant
that always there are only 0-14 bytes of padding before any function.
This invariant must be maintained by careful code in the code-inserter,
which must now be function-boundary aware.
|
|
FCCM 2001, April 30-May 2, 2001,
will be held in Rohnert Park, CA this year due to construction at the Napa site.
It's that time of year again. Time to send a check to Xilinx for a year
of maintenance on Alliance Standard.
For as long as there has been an Alliance product, Alliance Standard
has targeted the entire family of Xilinx devices. No more. Now Xilinx
has introduced a new price tier, Alliance Elite. If want to target
(or even experiment with targeting) devices larger than an VirtexE-1000,
you need Elite.
I was quoted a yearly maintenance fee on Elite of about triple that
of Standard. Expensive enough to make you think twice.
Since I don't currently need to target a device larger than a V1000E,
and since (as I understand it) upgrading to Elite would put my Alliance
license into the realm of time based licensing, I'll pass on Elite for now.
Bottom line, Elite will keep at least this one customer from "kicking
the tires" on larger devices -- and so I probably won't be publishing
any reports on how many and how fast an array of processors
will run in a XCV3200E anytime soon.
but the good news is...
The free Xilinx WebPack ISE, which includes synthesis, simulation,
and place-and-route tools for Spartan-II and Virtex V300E devices
is now
available.
All totaled, it's a big download -- over 100 MB.
|
FPGA CPU News, Vol. 1, No. 5
Back issues: Oct, Sep, Aug, Apr.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.
|