Home
Nov >>
<< Sep
News Index
2002
Jan Feb Mar
Apr May Jun
Jul Aug Sep
2001
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
2000
Apr Aug Sep
Oct Nov Dec
Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research
GR CPUs
XSOC
Launch Mail
Circuit Cellar
LICENSE
README
XSOC News
XSOC Talk
Issues
xr16
XSOC 2.0
XSOC2 Log
CNets
CNets Log
|
|
|
|
Mike Rutenberg reports XSOC is discussed in
c't magazine this month!
> Subject: XSOC has made the press
>
> You are in the latest issue of the (excellent!) c't magazine. The
> magazine URL is below, though I am not sure the article is on the web
> site. The article is on OpenSource-Hardware, and it mention you
> specifically and your processor, along with a number of others. One
> thing it says is that the License only allows non-commercial use,
> implying that there are no options for using the system for other
> things.
Well, that's right, XSOC/xr16 is not currently licensed for commercial use.
Never say never.
The Tao of Static Timing Analysis
One quote that I overlooked in yesterday's fine EE Times piece on tools:
"Both Sevcik and Greenfield said that with device performance a primary
concern, customers are also using static timing analysis tools in high-end
FPGA design."
I believe that all FPGA designers should use static
timing analysis. As I learned from Philip Freidin,
functional simulation plus static timing analysis
is necessary and sufficient
which leaves little to no value in doing timing simulation, back annotated or not.
(Here follows my brief enthusiast's take -- which doesn't do justice to Philip's eloquent and convincing exposition.)
First, I assume you're using a robust synchronous design method. No
asynchronous feedback circuits, no gated clocks. All state is registered
in flip-flops (and other synchronous primitives like RAMs), passes through a cloud of combinational logic,
and sinks some flip-flops (RAMs, etc.). All FFs are clocked by common
clocks on low-skew clock lines. Clock-enables, not gated clocks,
determine whether each FF is enabled on each cycle. Ideally all FFs
are clocked on the rising clock edge, but sometimes you also have to
clock on the falling edge too. Ideally all circuits take one cycle,
but sometimes some take more than one cycle. Ideally all inputs and
outputs to the chip are registered in FFs.
In this paradigm, a cycle-by-cycle functional simulation
will suffice to verify the logical correctness of your design.
Once your design is logically correct in all circumstances, it remains
to determine whether it also satisfies its performance (timing) requirements.
That's where static timing analysis comes in.
A single almost-automatic static timing analysis run can quickly
establish that a design meets its timing constraints.
Given a placed-and-routed design, where every primitive element of logic
and interconnect has been determined, a static timing analyzer like
Xilinx's TRCE first determines the worst-case delays on
every net and logic element. Then, using this data, TRCE traces
every circuit path from every FF to every FF to determine worst-case delays
from a given clock edge to another given clock edge, and this reveals
the maximum guaranteed safe operating frequency over a range of
supply voltages and temperatures.
A static timing analysis is necessary and sufficient because it explores
every path between each group of FFs in the design, and (together with
a cycle-based functional simulation) covers the design even if you
overlook some test vector for some path, and even if some
elements are operating exceptionally fast. This 100% coverage is
difficult to achieve with a timing simulation.
(I don't mean to oversell static timing analysis. But I like it. It does
require that you apply minimal timing specifications to your design. Sometimes
as little as a net CLK period = 15 ns; will suffice. Some designs,
with some multi-cycle paths, multiple clock domains, etc., require much more
extensive time specifications. Sometimes it's hard to get the tools to
report 100% coverage or to trust them when they do. Even so, STA can
achieve better coverage than timing simulation, and for less time and effort.)
(This little commentary also does not touch on the other applications
of static timing analysis during prototyping and design, to gain early
insight into feasible circuit designs, track slack between groups
of circuits as the design evolves (another Freidin technique), etc.)
Since functional simulation plus static timing analysis is necessary
and sufficient, how can we explain the false allure of timing simulation, or
the notion that only some designers are using static timing analysis?
- "I'd never thought of it."
- "I'd never heard of it."
- "I don't know how to write timespecs."
- "I've heard of it, but I (my manager) like (insists upon) timing simulation."
- "I don't/won't/can't trust the static timing analyzer so I do functional simulation, as well, to get more confidence."
|
|
Today, I am investing some time working on
the Knowledge.
I can now build RPMs in Verilog in Synplify: /* synthesis xc_props="RLOC=RyCx" */ seems to works well. Alas, the slice
coordinate system added with Virtex is awkward. I have to produce
two versions of some of my RPMs -- one with a _s0 suffix, one with a
_s1 suffix. Within the lowest level modules, the suffix controls whether
to apply "RLOC=RyCx.S0" or "RLOC=RyCx.S1".
Unfortunately, this attribute propagates upwards to higher level modules, and
so I must write addsub16_s0 and addsub16_s1 and so forth.
Factoids:
- a 32-bit add/sub RPM, sinking and sourcing adjacent registers, floorplanned as 16Rx1C of slices, runs at about 150 MHz in a 2S50-5: Tcko + net + 16*Tbyp + Tccky < 7 ns. With care, a pipelined 32-bit RISC in a -5 part should hit 100 MHz.
- to floorplan it as 8Rx2C of slices instead, adds about 2.7 ns to
the cycle time, due to the extra vertical net to route carry-out[15]
from the top of the first 16-bit adder column to the bottom (carry-in port) of the second one, plus the extra Tbxcy. Ouch, so much for that idea.
- a 64-bit ripple-carry adder, 32x1 slices, has a cycle time under 11 ns.
- as does a 64-bit carry-select add/sub implemented with 3 32-bit adders and a mux.
Michael Santarini, EE Times:
Can tools keep up with programmable silicon? Highly recommended.
Indeed, tools and methodology issues are probably a greater barrier to
widespread deployment of FPGA systems than the increasingly abundant and
cheap programmable logic itself.
For example, floorplanning attributes (RLOCs and so forth) have a
different attribute syntax in the Synopsys, Exemplar, and Synplicity
synthesis tools!
'Most sources agreed that HDLs have replaced the "schematic-sauruses"
-- those who hand-tweaked gates and flip-flops to get the maximum
performance out of FPGAs. Philip Freidin, a longtime schematic-saurus,
said that he has begun to incorporate HDLs into his methodologies.'
'"It isn't because I wanted to do it; it is because customers demand
it," said Freidin, who specializes in high-performance FPGA design
at Fliptronics (Sunnyvale, Calif.). "The issue simply comes down to
design time."'
Peter Clarke, EE Times:
Panel ponders cost of programmable system-on-chip.
|
|
Hiro Higuma, Martin Won, Altera, for ChipCenter:
Building Configurable Network Processors.
"Programmable logic-based packet processing functions offer many of the
same flexibility and time-to-market benefits of off-the-shelf network
processors. Further, programmable logic can provide better performance
by utilizing dedicated hardware for specific packet processing functions."
"... Finally, the recent availability of 32-bit RISC soft cores for
programmable logic adds another level of capability to these devices,
affording users greater usability and the option to rapidly develop
custom multiprocessor designs."
It's a great application for dozens of compact (200-500 logic cell)
soft CPU cores per FPGA. See also Soft cores.
|
|
One evening last August, I designed a simple RISC processor, even
simpler and smaller than xr16. Like xr16, it is designed to be
the target of lcc
and has 16-bit instructions, 16-bit data, and 16 registers.
Unlike xr16, (but like Brian Davis's
YARD-1A
(description)),
it is a 2-operand architecture, is not pipelined, and uses a single
bank of dual-port RAM for the register file. Initially it will use the
Virtex-family block-RAM as the instruction store.
The goals are to provide a simple, fully embeddedable MCU, comparable to
KCPSM but C programmable, and to advance the agenda of demystifying
processor design and encouraging student and enthusiast experimentation.
(In retrospect, the pipelining of xr16 is good for performance but
detracts from its simplicity.)
I will be presenting this CPU/SoC design at an upcoming design conference.
Like XSOC/xr16, it is "disclosed source", licensed under the XSOC LICENSE
agreement.
Taking a page from the "literate programming" community,
the write-up of the design is the design. Using Microsoft Word, I
save the document as text and filter that to extract the Verilog source.
Here is the current work-in-progress in PDF.
(The processor is mostly sketched out but it surely doesn't compile yet.)
"Processor and SoC design is not rocket science ... To prove the point,
this paper and accompanying 50-minute class will present the complete
design and implementation of a streamlined, but real, system-on-a-chip
and FPGA RISC processor."
More FPGA CPUs
Universidad de Valladolid (Spain):
The uP1232 8-bit fpga-processor. 8-bit CISC, 32 registers, 200 XC4000E CLBs.
PLD processors
by Jeung Joon Lee:
Reliance-1,
PopCorn-1,
Acorn-1.
|
|
Craig Matsumoto, EE Times:
How Cisco beat chip world to net.
Cisco's in-house network processor designs.
"Like most network processor designs, the Toaster is
parallel-pipelined. "Each column you can think of as doing a system
function with separate memories," allowing for better I/O bandwidth in
any column, Nellenbach said. ... Kerr's main concern was in maximizing
the possible number of 32-bit lookups per second, which meant getting
the memory interface right ... So Jennings' team went with eight memory
interfaces, all connecting to synchronous DRAM."
Commentary: some designs (not necessarily this one) are inevitably constrained
by pin-bandwidth to external memory and I/O, regardless of whether the
internals are hard gates or programmable logic.
|
|
Brian Dipert, EDN:
Cunning circuits confound crooks. Nice survey article on PLD
design security.
"With otherwise-SRAM-based FPGAs, for example, adding nonvolatile memory
for unique device identifiers might be cost-prohibitive. Instead, Xilinx
is including a hard-wired Triple-DES decryption block, along with two
sets of 56-bit key registers and dedicated battery-backup supply-voltage
inputs for only those registers, on its upcoming Virtex-II FPGAs. Xilinx
estimates that the registers alone consume only nanoamps of current,
orders of magnitude lower than if the entire device needed to be
battery-powered. Xilinx's approach not only prohibits device cloning,
but also prevents unwanted rogue bit streams, such as viruses, from
being downloaded to the part in this increasingly network-connected world."
|
|
This afternoon I resumed the Virtex port of XSOC/xr16/xr32 (XSOC2 Log)
and am now (finally) running XSOC/xr16 in my XESS XSV-300 prototyping board.
Today's work involved several compromises.
Since this board does not have a tool to pre-load the SRAM,
I modified the XSOC design to provide a 256x16 boot ROM in a block RAM.
I further modified the new fully synchronous MEMCTRL so that instruction
fetches from this block RAM signal RDY in the same cycle.
Just as with the XSOC/xr16 kit for XS40 boards, the design currently
includes a bitmapped VGA display, using the DMA engine in the xr16 CPU core
as the video address counter. (With a 50 MHz dot clock, it refreshes
the display at 120 Hz!)
Alas the XSV's two 16-bit SRAM banks both lack byte-write-enables.
For the time being I am using just one byte-wide bank of SRAM. Later I
will modify MEMCTRL to perform read-modify-write accesses for byte
stores to RAM,
Using a modified version of xr16 (replacing the double-cycled single-port
RAMs with dual-port RAMs), we get a design that TRCE reports will run
at 60 MHz in a V300-5. (Not floorplanned yet.)
Total size of the design, including MEMCTRL and VGA, is about 400 logic cells.
The design runs fine at 33 MHz. At 50 MHz, the program runs fine,
but accesses to the external SRAM frame buffer fail. I will therefore
modify the memory controller to insert a wait state on each external
SRAM access. That done, I should be able to tune up the core design up to
67 MHz in short order, motivating integrated instruction and data caches...
Craig Matsumoto, EE Times:
Adaptive Silicon preps FPGA core for ASICs.
|
|
Rich Katz (NASA)'s site
"dedicated to the design and use of programmable and quick-turn
technologies for space flight applications."
MP-on-a-chip
Peter Clarke, EE Times:
Beefy parallel processor packs 128 cores. Pact GmbH's
XPP-128, a 128 CPU MP-on-a-chip, 12.8B MAC/s at 100 MHz.
"The XPP is a mixture of a parallel-processing array with an
interconnect architecture similar in style to that of an FPGA. But
Vorbach said the second crucial element is the transparent run-time
reconfiguration technology that dynamically controls the processing
resources. Vorbach said this technology automatically makes changes to
the array interconnect, assigning processes to clusters of processors
based on internal or external events."
Based upon the sketchy description of the programming model,
a pure data-parallel SIMD or a pure MIMD would seem easier to program.
Peter Glaskowsky, Microprocessor Report: PACT Debuts Extreme Processor:
Reconfigurable ALU Array Is Very Powerful -- And Very Complex.
The article provides excellent detail on the design, including its
programmable interconnect, and thoughtful analysis of its prospects.
Factoids -- 400 mm^2 in 0.21 micron, 32 256x32 embedded SRAMs, 8 SDRAM
channels, 1,521 contact BGA!
XPP reminds me of the Univ. of Washington's
RaPiD configurable computing architecture.
We may build similar things in big Virtex-Es. (With all that block RAM
for local scratchpad RAMs or caches, it's a natural.)
See Multiprocessors, Supercomputers, Soft cores, Using block RAM, and of course,
Danny Hillis'
The Connection Machine.
ARC Tangent
Peter Clarke, EE Times:
ARC Tangent extends configurability to the system level.
'However the company is not introducing high-end features such as out-of-order instruction execution, conditional execution or speculative execution of branches. "The philosophy is still to keep the core simple, the base core gate count is still only about 17,000 gates," said Hakewill.'
An excellent philosophy.
Simple is beautiful.
|
|
Over on the
fpga-cpu
mailing list, I posted two messages on how to build
really compact RISC cores, down around 150 logic cells,
half the size of xr16 (and at least twice as slow).
In the first,
I start with a ultra-minimalist datapath, and add features one-by-one
to improve performance.
In the second,
I subtract microarchitectural features one-by-one from xr16 to explore
what savings might be realized, and at what cost in performance.
|
|
Ken Chapman, Xilinx, App Note XAPP213:
8-Bit Microcontroller for Virtex Devices.
If I may be permitted to quote so extensively, I'll let this
superb app note speak for itself:
"The Constant (k) Coded Programmable State Machine (KCPSM) presented in
this application note is a fully embedded 8-bit microcontroller macro for
the Virtex and Spartan-II devices. The module is remarkably small
at just 35 CLBs, less than half of the smallest Spartan XC2S15 device,
and virtually free in an XCV2000 device by consuming less than 0.37%
of the device CLBs."
"This KCPSM provides 49 different instructions, 16 registers, 256
directly and indirectly addressable ports, and a maskable interrupt
at 35 million instructions per second (MIPs). This performance exceeds
that of traditional discrete microcontroller devices, making the KCPSM
a cost-attractive solution for data processing as well as control
algorithms."
"Fully embedded including the program memory, the KCPSM can be
connected to many other functions and peripherals tuned to a specific
design. Processing distributed over multiple KCPSM processors within a
single device is suitable for applications such as neural networks." ...
"When a processor is completely embedded within an FPGA, no I/O
resources are required to communicate with other modules in the same
FPGA. Additionally, system design flexibility is included along with
savings on PCB requirements, power consumption, and EMI. Whenever a
special type of instruction is required, it can be created in hardware
(other CLBs) and connected to the KCPSM as a kind of coprocessor. Indeed,
there is nothing to prevent a coprocessor from being another KCPSM
module. In this way, even the 256-instruction program length is not
a limitation."
See also this app note by Chapman from almost six years ago.
Dynamic Microcontroller in an XC4000 FPGA. Nice work, and a nice
prior art reference. (Ah, XBLOX, those were the days. I designed my
first 3-D rendering system with XBLOX.)
This new app note articulates many of the potential advantages
of compact soft CPU cores. I feel strongly that small soft CPU
cores will prove to be indispensible, both standalone in low-cost device
SoCs, and as channel processors and smart peripherals to hard CPU cores
in the forthcoming "hard CPU + PLD" hybrid devices. See also Soft cores
and the theme of my Circuit Cellar articles.
What do I consider small? Not 950 or 1100 or 1700 logic
cells. Certainly not 3000. By small, I mean cores like this
excellent assembler-programmable KCPSM (35 CLBs => ~140 logic cells) or the
integer-C-programmable xr16 (~300 logic cells).
A Spartan-II-150 is a terrible thing to waste.
See also Simple is beautiful.
Ericcson's Erlang FPGA CPU
From the
Sixth International Erlang/OTP User Conference,
Robert Tjärnström and Peter Lundell,
Ericsson Telecom:
ECOMP - an Erlang Processor
(PowerPoint).
"An Erlang processor has been built in an FPGA (i.e. programmable
hardware). The JAM compiler has been changed to generate native code which
allows Erlang programs to be run directly on the processor without any
OS and with improved performance."
This is an interesting LIW architecture that does concurrent
real-time garbage collection in hardware. Also has ~20 cycle hardware
process switching.
Modeled in Erlang, implemented in VHDL, prototyped in an RC1000-PP with
an XC40150XV. Results: a speedup of 30 (cycle per cycle) while decreasing
power by more than an order of magnitude.
Compared to what, they didn't specify...
LEON SPARC update
Over on the
LEON SPARC mailing list,
Jiri Gaisler
announces
version 2.2 beta, which uses AMBA AHB/APB internal buses.
Another milestone.
|
|
Xilinx Adds FPGA Support to Free Web Design Tools. Yahoo!
"Xilinx, Inc. today announced full support of the entire Spartan®-II FPGA family as well as the 300,000 system gate Virtex(TM) XCV300E FPGA in the WebPACK ISE(TM) tool suite. ...
The next release of WebPACK ISE, V3.2i WP1.0, which contains the added FPGA support of Spartan-II and the Virtex XCV300E device is scheduled for release in mid-October 2000."
These tools and parts are more than adequate for all manner of
sophisticated processor cores, multiprocessors, DSPs, etc. As the
barriers to entry fall away, we're in for some serious innovation,
serious products, and some serious fun.
[updated 00/08/05] Murray Disman, ChipCenter,
Xilinx Offers Free FPGA Design Tools.
IP business models [revised 00/10/04]
Here is my take on some cogent analysis on IP business models
from Tom Cantrell of Circuit Cellar. As he writes in
Excalibur Marks the Spot,
"Remember
that the cost of any chip is comprised of two parts-what it costs to make,
sell, and support the silicon, and the value of the design (i.e., IP). As a
silicon supplier, Altera has the ability to hide the IP cost in the chip
price. Independent IP providers have no such luxury, short of messy and
unpopular royalty schemes. Also, the free IP news may perk the interest of
lawyers, similar to how MP3 got the recording industry riled up. I look
forward to reading the fine print in the Nios license."
In the old days, chip vendors were also the IP developers and the EDA
tools developers. Nowadays, we have specialized fab companies (TSMC),
IP companies (ARM, MIPS, Gray Research LLC :-) ), and tools companies
(Mentor, Cadence, etc.), and combinations of these (Intel). You can buy
IP bundled with hardware (Intel), bundled with your tools (EDA companies),
or separately (IP providers).
Enter the FPGA vendors (Xilinx, Altera). They have an opportunity to seize
upon a unique business model.
Take Altera Nios. The Nios development kit is relatively inexpensive
(~$1000) and they will supposedly issue you a license to use the Nios core
in Altera FPGAs for $0. The more instances you make, the more
programmable logic they sell. I suppose they make up the cost of developing,
testing, supporting, documenting, etc. the IP in device sales
(which also simplifies the accounting).
This business model gives these vendors a giant, almost unassailable advantage over
third party IP vendors. The latter can never compete on price, because the
FPGA vendors can always price their IP down to $0 and happily make up any
lost revenue with further sales of CLBs. Therefore a third party IP vendor
can only compete on value, quality, and innovation. For example, in the
Altera CPU cores market, which includes the $0 Nios core family, one can
only compete with a different value proposition, perhaps instruction set
compatibility with a legacy ISA, or perhaps by offering a core which is
dramatically smaller and faster than Nios. In the latter case, if your core
uses $2 less programmable logic than the FPGA vendor's does, then it may
have a value of $2/unit to a customer. Or not. It also depends on which
vendor(s) establish a larger value chain of experts, plug-ins, etc.
In pricing their cores, FPGA vendors may also consider the customer
lock-in value of a key piece of IP (such as a processor or
on-chip bus protocol). Once a customer has designed against such
a key facility or interface, it will be extremely costly to change
horses later.
Perhaps FPGA vendors have a vested interest in giving away
the largest cores that the market will bear, so as to sell more CLBs.
There are two problems with this idea: driving away "cores partners",
and competition with other device vendors.
Giving away: Free IP in a market may act to reduce customers' value model for IP --
"I'll be darned if I'm going to pay PrettyGoodCores Inc. $1000 for a UART
license (even with validation and support) when I can get a whole processor
soft core and an on-chip bus license from my vendor for nothing!"
If FPGA vendors give away enough free cores, the end effect could be
to discourage pure IP vendors from contributing to that device vendor's
value chain, reducing the supply of device optimized cores,
hence design wins, hence device sales.
Largest cores: Secondly, the FPGA vendors must compete with each other for design wins,
and if one vendor has an excellent (fast, compact) set of cores they may sell
fewer CLBs per design win, but may be able to win new designs
from the competition.
It's an interesting conundrum.
Eventually there will also be a number of suppliers of high-quality free IP.
This will drive down the price of "me too" equivalent-quality commercial IP
except when propped up by artificial means. Even in that world, I think
there will still be an interesting market for unique or highly-optimized
commercial IP.
By the way, the "hello world" message I wrote to my LCD panel in my
project at the Altera Nios Hands-on Workshop read:
Such a great business model.
|
FPGA CPU News, Vol. 1, No. 4
Back issues: Sep, Aug, Apr.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.
|