Wishbone B.4 Pipelined Bus Master Interfaces ============================================ Agenda - 3-State Shared Buses - Wishbone Bus Standards - Supported Interconnect Fabrics - Motivation for switching from B.3 to B.4 for Kestrel Project - Circuit Description of KCP53010's Load/Store Unit. - Error/Retry Handling 3-state shared buses - (block diagram of Kestrel-1) - Cheap, easy, almost ideal for single master systems. - But what if you wanted to support multiple bus masters? - Remember, SMP systems were just coming to market, and people wanted to play! - Early FPGAs allowed 3-state buses on-chip, but was extremely fickle. - "If you can't afford to blow it up, you can't afford to use it." -- Robert Grossblatt - Too easy to short-circuit multiple connected logic cells, damaging a $100+ chip in the process. - Slow, due to RLC-time constants. FPGA Vendors Drop 3-State Support - Best practices evolved - Avoid 3-state buses on-chip like the plague. - Use split data-in, data-out buses. - OR-together common signals, each selectively masked by a bus arbiter. - FPGA vendors saw the trend, dropped support for on-chip 3-state buses. - Many ad-hoc interconnect methods evolved. - How to support soft-core IP marketplace? Standards needed! - Many standards evolved, some open, some proprietary. - AXI - AHB - Wishbone - OCP - ... more! Wishbone - Open-source interconnect standard for open-source or proprietary hardware. - Authored by company called Silicore, Inc.; now overseen by LibreCores. - Two (slightly incompatible) modes: B.3 (aka B.4 Standard) and B.4 Pipelined. - Both share common traits - Easy to understand! If you know how the 6502 bus works, you already know most of Wishbone B.3. - Very high performance! - A 68000-like asynchronous handshake protocol allows very fast (single-cycle) transfers. - It also supports very slow devices in the same bus clock domain. - Exceptionally low overhead! - Wishbone B.3 takes only a small handful of LUTs to implement on an FPGA. - Wishbone B.4 takes a bit more than B.3, but it's still quite compact compared to a full AXI implementation. - 1/3 to 1/2 the wire count of AXI3/AXI4 or OCP. - Clear separation of concerns - Clear distinction between masters, slaves, and "interconnect." - B.4 implies further separation of bus commands and bus responses. Wishbone B.3 - Illustration of B.3/B.4 "Standard Cycle" read and write cycles, with wait states. - Single-cycle operation is possible by asserting ACK_I during same cycle as STB_O. - Observe: Bus must hold state until transaction finishes. - I generally won't discuss standard cycles further in this talk. Wishbone B.4 Pipelined - Illustration of B.4 read and write cycles, with wait states. - Single-cycle operation still possible by asserting ACK_I during same cycle as STB_O. - Observe: Emphasis on bus *commands* and *responses* instead of bus *state*. - Dumb Wishbone interconnects are no less efficient with B.4 than B.3. - Smart interconnects can re-use resources when bus isn't in use on a cycle-by-cycle basis. - Simultaneous reads and writes are possible, saving cycles. - More importantly, though, it fits contemporary FPGA fabrics much better than B.3. Wishbone Interconnect Fabrics - Shared Bus - Data Flow - Crossbar Switch - Packet Switch Interconnects: Shared Bus - No 3-state logic; use AND/OR logic instead. - Semantically equivalent to 3-state shared bus. - Smallest resource usage. Interconnects: Data Flow - All point-to-point logic. - Fun fact: Almost exactly how Hypertransport works. Interconnects: Crossbar Switch - Allows independent masters to talk to independent slaves concurrently with single-cycle resolution. - Each slave has its own bus arbiter. - FPGA resource-hungry; lots of MUXes, so lots of LUTs. - CPU instruction buses need not be routed to I/O. - I/O can be constrained to specific memory channels. Interconnects: Packet Switch - VERY ADVANCED. Theoretically possible, but no known implementation yet. - Unknown resource usage, but it can't be small. Probably on par with a crossbar. - I might be the first person to think of applying B.4 in a packet-oriented manner. - Should be smaller than a crossbar switch at some critical point, but I don't know where that is. - Relies on some "creative" interpretation of the B.4 standard. Example: Kestrel-2 - S16X4A CPU - 3 16KB BRAMs - 2 dedicated memories for CPU - 1 shared with MGIA - Wishbone B.3 - 1 GPIA, 1 KIA Example: Kestrel-3 Test Mule - KCP53000 CPU - Furcula bus arbiter to convert Harvard architecture to Von Neumann. - Furcula bus to Wishbone bus bridge. - 64-bit to 16-bit Wishbone bus bridge. - 5500 LUTs for CPU alone; estimated 2000 LUTs for bridges alone. - Remainder of circuit identical to Kestrel-2. Example: Future Kestrel-3 Test Mule - KCP53010 CPU - Return of the 5-stage instruction pipeline! - Entire bus interconnect designed to be 16-bit Wishbone from the outset. - Significant LUT reduction anticipated: currently 620 LUTs for I and D ports. - External SRAM Bridge - SIA Serial I/O core The KCP53010's Load/Store Unit - 64-bit address, data, command inputs - 64-bit data out, data valid strobe - 16-bit Wishbone interface - Most of the 310 LUTs for a bus master implementation is due to 64-bit ports. The Response Circuit - When building B.4 Pipelined masters, consider the response circuit first. - (show circuit block diagram) - Depending on transfer size, ONE of rDFF 3, 1, or 0 is set true. - rDFFs form shift register tracking receive-side bus state. - rDFFs shift towards bit 0. - Whatever is in rDFF 0 falls off the end. - Bus cycle (CYC_O) is complete (negates) when all rDFFs are 0. - Received data registered iff ACK_I /\ CYC_O. rDFF determines which part to store. - Note that data is unconditionally registered; it may not be valid, though. The Command Circuit - (show circuit block diagram) - Depending on transfer size, ONE of cDFF 3, 1, or 0 is set true. - STB_O is asserted for as long as any cDFF is set. - Like rDFFs, cDFFs form a shift register. - It shifts iff CYC_O /\ !STALL_I. - Data and addresses are MUXed to Wishbone data out according to which cDFF is set. - NOTE: cDFFs will almost certainly expire before rDFFs; this is natural! Wishbone Error/Retry Signals - If we wanted to support errors, ERR_I must have the following effects: - Zero all cDFFs and rDFFs, thus immediately terminating the bus cycle. - Send a trap to the KCP53010, allowing the CPU to handle the bus fault. - If we wanted to support automated retries, RETRY_I must have these effects: - Reset all cDFFs and rDFFs to their original start-of-cycle state. - If random back-off is desired, INTERNALLY assert STALL_I for a random number of cycles.