Skip to main content

·43 mins

GfxVGA Version 1.5 — Design Document #

Target hardware: Existing V1 board (Spartan-6, 16-bit async SRAM, 5-6-5 VGA output)
Purpose: Incremental refactoring of the working V1 codebase. Same feature set, same hardware. The goal is to clean up the architecture while keeping a working system at every stage — not a from-scratch rewrite.
Serves as the foundation for V2 (Spartan-7, 32-bit SRAM, 8-8-8 RGB, HDMI), which will be a clean start on new hardware where bring-up is unavoidable anyway.


1. Goals and Non-Goals #

Goals #

  • Clean, modular VHDL with well-defined inter-module interfaces
  • All inter-module types defined in package files before any module is written
  • Command-FIFO-based drawing engine decoupled from CPU interface
  • Consolidated, configurable tile layer engine (replacing 4 fixed instances)
  • Proper CDC boundaries documented and enforced
  • SRAM arbiter with a clean priority interface (replacing grown-in-place V1 version)
  • VGA pixel pipeline with explicit, documented latency compensation
  • Distributed register ownership: each subsystem module owns its own register decode
  • Text character map moved to SRAM (pointer-based); font glyphs remain in pre-initialised BRAM
  • Designed for low resource usage: target <75% LUT on XC6SLX25 (V1 was at ~98%)
  • Forward-compatible: V2 changes are widening/extending, not restructuring

Non-Goals for V1.5 #

  • New drawing primitives beyond V1 set (add in V2)
  • 320×240 mode changes beyond what V1 supports (validate, don’t redesign)
  • HDMI output (V2)
  • 200MHz clock domain (V2)
  • Amiga-style co-processor (plan register space, implement in V2)
  • Sub-16bpp framebuffer drawing modes (V2)
  • Text rendering as a TileLayer mode (design is documented in §9.5; implement in V2 once foundation is stable)
  • Per-column Y scroll table (register space reserved in tile layer block; implement if a specific effect requires it, but not at the cost of core functionality — global X scroll per row already covers most raster-bar use cases)

2. Clock Architecture #

V1.5 uses three clock domains from a single DCM/PLL (same as V1):

Domain Clock Source Used by
PIX 25 MHz PLL ÷ 4 VGA pixel output, line buffer read, compositor
SYS 100 MHz PLL SRAM arbiter, tile prefetch, sprite engine, drawing engine
CPU 50 MHz PLL ÷ 2 CPU bus state machine, register file

CDC crossings (all explicit, none implicit):

Crossing Signal type Method
SYS → PIX HBlank start pulse Toggle + 3-FF sync + edge detect
SYS → PIX VBlank end pulse Toggle + 3-FF sync + edge detect
SYS → PIX Line buffer swap Toggle + 3-FF sync
PIX → SYS Prefetch request Toggle + 3-FF sync + edge detect
CPU → SYS Register write 2-FF sync on all control signals
SYS → CPU DTACK Combinational from SYS flag, registered in CPU domain

The existing ClockVGAtoSYS.vhd implements the SYS↔PIX crossings and should be preserved/extended.
Rule: No signal crosses a clock domain boundary without appearing in a CDC module port list.


3. Package Structure #

The seven packages below define the target type system for V1.5. During refactoring, modules migrate to these types incrementally — GfxVGA_pkg.vhd is not removed until the last module referencing it has been updated. A module’s port list should contain only types from these packages — no raw std_logic_vector for structured data.

gfx_types_pkg.vhd    Core types: pixel_t, colour_t, coord_t, screen_pos_t
vga_pkg.vhd          VGA timing records, pipeline stage types
sram_pkg.vhd         SRAM request/ack interface types
draw_pkg.vhd         Drawing command record and opcode enum
tile_pkg.vhd         Tile layer config, tile attribute type
sprite_pkg.vhd       Sprite attribute type
vdp_pkg.vhd          Internal VDP signal buses, line buffer types

3.1 Core Types (gfx_types_pkg.vhd) #

-- 5-6-5 pixel colour (V1.5). Widened to 8-8-8 in V2 by changing this package only.
type colour_t is record
    r : std_logic_vector(4 downto 0);
    g : std_logic_vector(5 downto 0);
    b : std_logic_vector(4 downto 0);
end record;
constant COLOUR_BLACK     : colour_t := (others => (others => '0'));
constant COLOUR_TRANSPARENT : colour_t := ...;  -- implementation-defined sentinel

-- Screen coordinate. Signed to allow off-screen drawing.
type coord_t is record
    x : signed(15 downto 0);
    y : signed(15 downto 0);
end record;

-- Unsigned screen position (display domain, always on-screen)
type screen_pos_t is record
    x : unsigned(9 downto 0);   -- 0..639
    y : unsigned(8 downto 0);   -- 0..479
end record;

-- A pixel with compositor metadata attached
type pixel_t is record
    colour      : colour_t;
    priority    : unsigned(2 downto 0);  -- 0 = lowest, 7 = highest
    transparent : std_logic;
    valid       : std_logic;
end record;
constant PIXEL_NULL : pixel_t := (...);

3.2 SRAM Interface (sram_pkg.vhd) #

The interface is burst-oriented, not word-oriented. This is a hard requirement driven by timing budget: fetching 16-bit words individually would exceed the HBlank budget for 4 tile layers + 48 sprites. The existing V1 design already relies on 4-word burst reads for tiles and sprites; V1.5 makes this a first-class interface concept.

-- Maximum burst length supported by SRAMControl.
-- 4 words matches tile row and sprite row data granularity exactly.
-- 8 words is used for bitmap prefetch to reduce per-word overhead on long sequential reads.
constant SRAM_MAX_BURST : integer := 8;

-- Issued once per transfer. burst_len=0 means 1 word; burst_len=7 means 8 words.
type sram_req_t is record
    valid      : std_logic;
    rnw        : std_logic;                   -- '1' = read, '0' = write
    addr       : unsigned(19 downto 0);       -- start address (word-addressed)
    burst_len  : unsigned(2 downto 0);        -- 0..7 → 1..8 words
    ube_n      : std_logic;                   -- byte enables for writes
    lbe_n      : std_logic;
end record;

-- For burst writes, write data is provided word-by-word via a companion stream.
-- The sram_req_t initiates the burst; subsequent words use sram_wdata_t.
type sram_wdata_t is record
    valid  : std_logic;
    data   : std_logic_vector(15 downto 0);
end record;

-- Returned for both reads and write completion.
-- For reads: data_valid pulses once per word (one new rdata word each cycle it is high).
-- done pulses on the same cycle as the final data_valid for reads; one cycle after
-- the last write ack for writes. Clients must not re-assert valid until done is seen.
type sram_ack_t is record
    data_valid : std_logic;                   -- read data present this cycle
    rdata      : std_logic_vector(15 downto 0); -- one 16-bit word, valid when data_valid='1'
    done       : std_logic;                   -- last word of burst completed
end record;

3.2.1 Burst Data Flow #

For read bursts, rdata and data_valid pulse once per cycle for each word in the burst. The client accumulates words into a local shift register:

-- Example: 4-word (64-bit) accumulator for tile/sprite prefetch
signal burst_buf : std_logic_vector(63 downto 0);
signal word_cnt  : unsigned(1 downto 0);

if sram_ack.data_valid = '1' then
    burst_buf <= sram_ack.rdata & burst_buf(63 downto 16);  -- shift in from MSB
    word_cnt  <= word_cnt + 1;
end if;
if sram_ack.done = '1' then
    -- burst_buf now holds all 4 words; process tile/sprite row data
end if;

For 8-word bitmap bursts, the client uses a 128-bit accumulator. The accumulator feeds the line buffer write engine, which converts 16-bit SRAM words to the line buffer entry format (palette index or direct colour) and writes them sequentially.

Write bursts: sram_wdata_t.valid and .data must be asserted one cycle after sram_req_t.valid for the first word, then each subsequent cycle for remaining words. The client is responsible for holding the data stream valid. sram_ack.done signals completion of all writes.

3.2.2 Timing Budget (100 MHz SYS clock, 640×480) #

One full line period = 800 pixel clocks = 3,200 SYS cycles.

Async SRAM page-mode access cost at 100 MHz (~10 ns SRAM):

  • First word of any burst: ~3 cycles (address setup + access time)
  • Each subsequent word (same page): ~2 cycles
Burst length Total cycles Cycles/word
1 word 3 3.0
4 words 9 2.25
8 words 17 2.1

Bitmap fetch words per line by mode:

Words per line = (display_width × bpp) / 16. The table below shows the worst and representative cases. 640×480 16bpp is the worst case; all other modes are cheaper. Sub-16bpp drawing is a V2 feature, but sub-16bpp display (palette-indexed framebuffer) is supported in V1.5 and benefits from the reduced prefetch cost.

Bitmap mode Width bpp Words/line 8-word bursts Cycles
Hires RGB (worst case) 640 16 640 80 1,360
Hires 8bpp indexed 640 8 320 40 680
Hires 4bpp indexed 640 4 160 20 340
Hires 2bpp indexed 640 2 80 10 170
Hires 1bpp mono 640 1 40 5 85
Lowres RGB 320 16 320 40 680
Lowres 8bpp indexed 320 8 160 20 340
Lowres 4bpp indexed 320 4 80 10 170

The budget table below uses the worst case (640×480 16bpp). Any lower bit depth or 320-pixel mode reduces bitmap cost proportionally, freeing more headroom for the drawing engine.

Budget with all clients active (640×480 16bpp, 4 tile layers, 48 sprites) — worst case:

Client Burst len Bursts/line Cycles
4 tile layers (320-px logical) 4 words 80 total 720
Sprites (48 max per line) 4 words 48 432
Bitmap prefetch (640-px 16bpp) 8 words 80 1,360
Text char map prefetch 4 words 20 180
Total display 2,692 (84%)
Headroom for drawing engine ~508 cycles

Bitmap uses 8-word bursts to reduce the per-word overhead versus 4-word bursts (would cost 1,440 cycles for the same 640 words). Tiles and sprites use 4-word bursts to match their natural data granularity (one sprite row = 16 pixels × 4bpp = 64 bits = 4 words).

Bitmap must NOT use a full-line single transaction. A 640-word monolithic fetch would consume ~2,560 cycles, blocking all other SRAM clients for 80% of the line period. The bitmap prefetch controller issues 8-word bursts and yields between them, interleaving with tile and sprite access. Total bandwidth consumed is identical; the difference is fairness.

Client burst length assignments:

Client burst_len value Words Rationale
Tile prefetch 3 (4 words) 4 8 pixels × 4bpp per burst
Sprite prefetch 3 (4 words) 4 One sprite row exactly
Bitmap prefetch 7 (8 words) 8 Reduce overhead on long sequential reads
Text char map 3 (4 words) 4 4 character entries per burst
Drawing engine spans 0..7 1..8 Depends on span width
CPU DMA 0 (1 word) 1 CPU is waiting for DTACK

All SRAM clients present a sram_req_t to the arbiter. The arbiter routes to SRAMControl which drives the physical bus. Nothing else.

3.3 Drawing Commands (draw_pkg.vhd) #

type draw_op_t is (
    DRAW_NOP,
    DRAW_PIXEL,
    DRAW_LINE,
    DRAW_RECT_OUTLINE,
    DRAW_RECT_FILL,
    DRAW_TRIANGLE_OUTLINE,
    DRAW_TRIANGLE_FILL,
    DRAW_BLIT,          -- rectangular region copy from SRAM
    DRAW_EXPAND,        -- expand 16-bit 1bpp bitmap to pixels using colour0/colour1
    DRAW_FILL_SPAN      -- internal: drawing engine emits these to SRAM writer
);

type draw_mode_t is (
    MODE_REPLACE,
    MODE_XOR,
    MODE_OR,
    MODE_AND,
    MODE_TRANSPARENT    -- skip if source matches transparency colour
);

type draw_cmd_t is record
    op       : draw_op_t;
    mode     : draw_mode_t;
    colour0  : colour_t;    -- foreground / primary colour
    colour1  : colour_t;    -- background / secondary colour (patterns, expand)
    x0, y0   : signed(15 downto 0);
    x1, y1   : signed(15 downto 0);
    x2, y2   : signed(15 downto 0);  -- triangle third vertex / blit WxH
    src_addr : unsigned(19 downto 0); -- blit source
    pattern  : std_logic_vector(15 downto 0); -- inline dash/expand pattern
end record;

DRAW_EXPAND takes a 16-bit 1bpp bitmap in pattern, a position (x0,y0), and a pixel count, expanding each bit to colour0 (bit=1) or colour1 (bit=0). This is the primary mechanism for hardware text/icon rendering: the CPU queues one EXPAND command per glyph row into the command FIFO, then continues preparing the next character while the drawing engine renders independently.

OR and AND modes follow the same read-modify-write path as XOR, implemented in the centralised SRAM writer (see §10.2).

The CPU fills a register window and writes a COMMIT register to push one draw_cmd_t into the command FIFO. The drawing engine pops commands and processes them entirely in the SYS domain. The CPU does not directly drive any drawing state machine signals.


4. VGA Pixel Pipeline #

This is the recommended starting point for the new project. Get a test pattern on screen before adding any other subsystem.

4.1 Pipeline Stages #

All stages clock at 25 MHz (PIX domain). The pipeline runs one pixel per clock.

Stage  Signal             Description
─────  ─────────────────  ─────────────────────────────────────────────────────
  0    H_cnt, V_cnt       Pixel counters (raw, including blanking)
  1    lb_addr            Line buffer read address = H_cnt - PIPELINE_DEPTH
       lb_sel             Which double-buffer half to read from
  2    lb_data_raw        Line buffer BRAM output (1-cycle registered read)
  3    pal_addr           Palette index extracted from lb_data_raw (if indexed mode)
  4    pal_colour         Palette BRAM output (1-cycle registered read)
       sprite_colour      Sprite line buffer output (registered)
  5    comp_out           Compositor selects highest-priority non-transparent pixel
  6    vga_out            Registered to I/O pins with sync signals

PIPELINE_DEPTH = 5 (for the above 6-stage pipeline, stages 1-5 after the counter).
HSYNC and VSYNC are passed through a 5-stage shift register at 25 MHz so they arrive at the output register in the same cycle as the pixel they correspond to.

-- Sync delay shift register (in VGATiming or a wrapper)
signal hsync_pipe : std_logic_vector(PIPELINE_DEPTH-1 downto 0);
signal vsync_pipe : std_logic_vector(PIPELINE_DEPTH-1 downto 0);
process(clk_pix)
begin
    if rising_edge(clk_pix) then
        hsync_pipe <= hsync_pipe(PIPELINE_DEPTH-2 downto 0) & hsync_raw;
        vsync_pipe <= vsync_pipe(PIPELINE_DEPTH-2 downto 0) & vsync_raw;
    end if;
end process;
vga_hsync_o <= hsync_pipe(PIPELINE_DEPTH-1);
vga_vsync_o <= vsync_pipe(PIPELINE_DEPTH-1);

The line buffer read address is similarly offset:

lb_read_x <= H_cnt - PIPELINE_DEPTH;   -- wraps in blanking, gated by H_active

This approach needs no FIFO and no clock domain above 25 MHz for the pixel path. PIPELINE_DEPTH is a package constant so changing the pipeline depth in one place automatically corrects the sync delay everywhere.

4.2 Compositor #

The compositor (Stage 5) selects the output pixel from several input sources:

Input sources (each a pixel_t with priority and transparent flag):
  - sprite_pix       (from sprite line buffer, SYS domain written, PIX domain read)
  - tile1_pix .. tileN_pix  (from tile line buffers)
  - bitmap_pix       (from bitmap/framebuffer line buffer)
  - text_pix         (from text engine line buffer)
  - background_pix   (solid colour fallback, never transparent, lowest priority)

Selection rule: highest .priority among non-transparent pixels wins. For equal priority: sprite > tile (left-to-right in port list, deterministic).

function compositor_select(
    sources : pixel_array_t   -- array of pixel_t, ordered for tie-breaking
) return pixel_t is
    variable best : pixel_t := PIXEL_NULL;
begin
    for i in sources'range loop
        if sources(i).valid = '1' and sources(i).transparent = '0' then
            if best.valid = '0' or
               unsigned(sources(i).priority) > unsigned(best.priority) then
                best := sources(i);
            end if;
        end if;
    end loop;
    return best;
end function;

This function lives in vdp_pkg.vhd. Adding a new layer means adding one element to the array — the compositor logic doesn’t change.

4.3 Palette Lookup #

At Stage 3/4, if the active mode uses indexed colour (8bpp palette mode):

  • The line buffer stores 8-bit palette indices, not direct colour values
  • The palette BRAM is read with that index
  • The output is a colour_t (5-6-5 for V1.5)
  • In direct colour mode (5-6-5 stored in line buffer), stages 3/4 are bypassed

For V2 the palette entry widens from 16-bit to 24-bit. Only colour_t and the palette RAM width change — the pipeline structure is identical.

4.4 Double-Buffered Line Buffers #

Each tile/sprite layer has a line buffer pair (two BRAMs):

  • Buffer A: being read by the PIX domain compositor (current display line)
  • Buffer B: being written by the SYS domain prefetch engine (next line)

At HBlank end: the buffers swap. The swap signal crosses SYS→PIX via the toggle CDC pattern.

SYS writes to "write_buf" pointer (alternates 0/1 each line)
PIX reads from "read_buf" pointer (= ~write_buf, synchronised through toggle CDC)

The PIX domain never writes line buffers. The SYS domain never reads line buffers after the swap signal is issued. This eliminates all dual-clock BRAM complexity in the line buffers — each BRAM port is used by exactly one clock domain.


5. Prefetch Architecture #

Prefetch is entirely in the SYS (100 MHz) domain. The PIX domain only triggers it.

5.1 Prefetch Trigger #

The ClockVGAtoSYS module delivers two prefetch pulses to the SYS domain:

  • prefetch_vga_start_sys: start fetching tile/sprite data for VGA-domain line N+1
  • prefetch_log_start_sys: start fetching logical (source) line content for the line buffer

For 640×480: logical line = VGA line. Prefetch starts at the beginning of HBlank (one VGA line ahead of display).

For 320×240 (pixel doubled):

  • One logical line covers two VGA lines
  • Prefetch starts two VGA lines ahead (prefetch_vga_start_sys is issued on the second-to-last visible line of the previous logical line)
  • The line buffer holds 320 entries; the PIX domain pixel-doubles by using H_cnt(9 downto 1) (dividing horizontal address by 2) as the line buffer index
  • Vertical doubling: same line buffer is used for two consecutive VGA lines (no swap at the halfway point)

The mode (640×480 or 320×240) is a register in the SYS domain. VGA timing is always 640×480 at 25 MHz at the output — the mode only affects addressing and buffer indexing.

5.2 Prefetch Controllers #

Each data source has its own prefetch controller:

Controller Triggers on Outputs to
TilePrefetch prefetch_log_start_sys SRAM arbiter req
BitmapPrefetch prefetch_log_start_sys SRAM arbiter req
TextPrefetch prefetch_log_start_sys SRAM arbiter req (char map) + FontRAM read
SpriteLineSel prefetch_vga_start_sys Sprite attrib BRAM

All prefetch controllers present sram_req_t to the arbiter. They use req/ack internally and are unaware of each other. The arbiter handles contention.


6. Consolidated Tile Layer Architecture #

6.1 Single Generic TileLayer Module #

Replace the four separate tile generators with one TileLayer entity:

entity TileLayer is
    generic (
        LAYER_ID    : integer := 0;
        TILEMAP_W   : integer := 64;   -- tiles per row in the tile map
        TILEMAP_H   : integer := 64;   -- tiles per column
        TILE_W      : integer := 8;    -- pixels per tile (horizontal)
        TILE_H      : integer := 8     -- pixels per tile (vertical)
    );
    port (
        clk_sys     : in  std_logic;
        reset_n     : in  std_logic;
        -- Configuration (from register file, SYS domain)
        cfg         : in  tile_layer_cfg_t;  -- see tile_pkg.vhd
        -- Prefetch trigger
        fetch_start : in  std_logic;
        fetch_line  : in  unsigned(8 downto 0);
        -- SRAM interface
        sram_req    : out sram_req_t;
        sram_ack    : in  sram_ack_t;
        -- Line buffer output (written by SYS, read by PIX via double buffer)
        lb_wr_addr  : out unsigned(9 downto 0);
        lb_wr_data  : out std_logic_vector(15 downto 0);  -- see lb_entry_t in vdp_pkg
        lb_wr_en    : out std_logic;
        lb_swap     : out std_logic   -- toggle signal, PIX domain syncs this
    );
end TileLayer;

tile_layer_cfg_t holds scroll X/Y, enable, base addresses, tile size override, and critically: a priority value (unsigned(2 downto 0)). This priority is passed into the pixel_t.priority field of every pixel this layer produces, allowing the compositor to arbitrate without any fixed wiring.

tile_layer_cfg_t also holds a transparent_idx field (unsigned(3 downto 0)): the palette index that the renderer treats as transparent. Default value 0 preserves V1 behaviour. Setting a different index frees palette entry 0 as a usable colour. Each layer has its own transparent_idx — layers can use different transparent indices independently. The renderer sets pixel_t.transparent = '1' when the pixel’s palette index equals cfg.transparent_idx; the compositor never sees raw palette indices, only the flag.

Important: With transparent_idx configurable, palette index 0 is no longer reserved. Firmware ported from V1 that relies on index 0 being transparent should verify the register default or explicitly write 0 to LAYER_CTRL.transparent_idx.

colour_mode = "00" is the V1.5 native format: 4bpp indexed colour (4 pixels packed per 16-bit word, nibble extraction). "01" reserved for future 8bpp indexed.

Line buffer sizing: The line buffer depth is derived from TILE_W * TILEMAP_W * 2 (double-buffered). Standard tile mode (8×40 = 320 entries × 2 = 640) fits in one RAMB18. A 640-pixel-wide mode (8×80 = 640 entries × 2 = 1280) requires a RAMB36. This is handled automatically by synthesis when the BRAM is declared with the derived depth — no structural change to TileLayer is needed. The LAYER_ID generic identifies which BRAM primitive ISE should infer.

6.2 Tile Attribute Format (V1-compatible) #

The tile map entry is a single 16-bit word per tile — a deliberate constraint that gives the 68000 peak-performance single-cycle writes into the tile map. The format is carried forward from V1 unchanged:

Tile attribute word (16-bit):
  [15:13]  Palette select (3-bit, palette bank 0..7)
  [12]     V-flip
  [11]     H-flip
  [10:0]   Tile index (11-bit, 0..2047 — 2048 tiles)

Priority in V1.5 is layer-level only, set via tile_layer_cfg_t.priority. There is no per-tile priority override field in the tile attribute word; all pixels from a layer carry the same priority value through the compositor.

V2 per-tile priority path: switching to a 32-bit tile map entry (two consecutive 16-bit words) would free 16 bits for priority override, extended tile index, and other attributes without altering the tile graphics data format or SRAM layout. The CPU would write two words per tile update — still within the 68000 bus model and acceptable given that full-tile-map updates are done at frame boundaries, not per-pixel.


7. Scroll Tables #

7.1 X Scroll Table (existing, carried forward from V1) #

Each tile layer has a per-row X scroll lookup that allows independent horizontal scroll positions for each display line (raster-bar scroll, wavy effects). This already exists in V1 and is carried forward unchanged. The table is stored in a BRAM local to each tile layer (or shared with configurable base offset). No SRAM bandwidth is consumed.

7.2 Y Scroll Table (deferred — not a V1.5 implementation target) #

Decision: Per-column Y scroll is reserved for V2 or as an optional add-on if a specific game effect requires it. The register space in tile layer block offset +7 (YSCROLL_CTRL) is reserved; the table is not implemented in V1.5.

Rationale: Per-column Y scrolling is primarily a demo-scene effect (Sega Mega Drive-style column waves). Game code can achieve most useful vertical scroll effects with global Y scroll per layer. The implementation cost in SRAM bandwidth is non-trivial (see below), and it is not worth trading core reliability for.

Design notes preserved for V2 reference:

BRAM registered-output reads require 2 clock cycles on real hardware (not 1 as simulation shows) when address computation involves offset addition. This was confirmed in V1 — tile maps were originally BRAM-based and moved to SRAM, improving CPU random- access performance by eliminating indirect write overhead. The same constraint applies to any BRAM-based scroll table: addresses must be computed a cycle ahead.

SRAM bandwidth for per-column Y scroll: global Y scroll keeps tile map reads sequential and burst-able (~960 SYS cycles per layer). Per-column scatter breaks burst grouping; worst-case scatter per layer is similar in cycle count but consumes the remaining headroom when all 4 layers are active. A one-layer-at-a-time restriction would be required on this hardware. On V2 (wider SRAM bus, higher clock ceiling) the constraint relaxes. A VBLANK bulk-copy DMA (SRAM → BRAM tile map shadow) would eliminate scatter entirely at the cost of one-frame update latency — acceptable for game code and a clean V2 solution.

7.3 Priority Interaction Between Layers and Sprites #

V1 model (carried forward in V1.5):
Tile layers have fixed compositor positions (layer 0 = bottom, layer 3 = top of tile stack). Sprites use a 4-bit priority field (0-15) for sprite-to-sprite Z-ordering: a higher value draws over a lower value at the same pixel. The PixelCompositor maps the sprite priority value to a position relative to the tile layers. The per-layer register priority (tile_layer_cfg_t.priority) is new in V1.5 and is introduced during the TileLayer consolidation stage (Stage 7); until then the compositor retains the V1 fixed tile ordering.

Source V1 priority model V1.5 refactor target
Background fill Lowest (fallback) Unchanged
Tile layer 0 Fixed position 0 Configurable via tile_layer_cfg_t
Tile layer 1 Fixed position 1 Configurable via tile_layer_cfg_t
Tile layer 2 Fixed position 2 Configurable via tile_layer_cfg_t
Tile layer 3 Fixed position 3 Configurable via tile_layer_cfg_t
Sprites 4-bit per-sprite (0-15) Retained; maps via compositor
Overlay/text Above all Unchanged

Sprite priority field (4-bit, 0-15): stored in sprite_attr_t.priority, bits [62:59] of the 64-bit attribute word. Within a scanline, higher priority sprites overwrite lower priority sprites in the sprite line buffer during rendering. Bits [58:57] of the attribute word are unused and reserved.

Sprite transparent index: A single global SPRITE_TRANS_IDX register (4-bit, default 0) in the Sprite register group sets the palette index treated as transparent across all sprites. This matches the per-layer transparent_idx approach for tiles. Per-sprite transparent index is not supported in V1.5 (the 64-bit attribute word has no spare bits for it without breaking the V1 format); a global register covers the common case cleanly.

Full priority unification (all sources compared via pixel_t.priority) is a later refactoring milestone or V2 target — it requires compositor changes that touch the display output and should not be attempted until all other stages are stable.


8. Sprite Engine #

The V1 sprite engine consists of three modules that are retained and cleaned up in V1.5:

Module Role
SpriteAttribRAM Dual-port BRAM, 256 × 64-bit attribute words
SpriteLineSelector Scans all 256 sprites each HBlank; writes up to 48 visible entries to SpriteLineRAM
SpriteLineRenderer Reads SpriteLineRAM, fetches tile data from SRAM, writes pixels to sprite line buffer

Sprite attribute word (64-bit) — V1 format, preserved for CPU compatibility:

[63]       enable
[62:59]    priority    4-bit, 0-15 (sprite Z-order; higher overwrites lower)
[58:57]    unused      reserved, must be written as 0
[56]       flip_x
[55]       flip_y
[54:51]    palette     4-bit, 0-15 palette bank
[50:40]    tile_index  11-bit, 0-2047 (same tile format as tile layers, 4bpp)
[39:30]    x           10-bit signed, logical pixel coordinate (-512..511)
[29:20]    y           10-bit signed, logical pixel coordinate (-512..511)
[19:0]     reserved

Sprites are fixed 16×16 pixels, 4bpp, operating in 320×240 logical space (same as tile layers). Maximum 48 sprites per scanline; the selector silently drops extras beyond that limit.

SpriteLineRAM intermediate entry (36-bit):

[35]       flip_x
[34:31]    priority    (carried from attribute)
[30:29]    unused
[28:25]    row         4-bit (0-15; flip-Y applied by selector)
[24:15]    x           10-bit signed
[14:11]    palette     4-bit
[10:0]     tile_index  11-bit

The sprite_pkg.vhd type sprite_attr_t and sprite_line_entry_t formalise both formats with to_/from_ conversion functions, replacing the raw bit-slice access currently scattered across the modules.


9. Text Subsystem #

The text engine is retained as a separate module in V1.5. The primary change from V1 is moving the character map from a dedicated BRAM to SRAM. The font glyph BRAM is retained. See §9.5 for the V2 path to integrate text as a TileLayer mode.

9.1 Text Buffer in SRAM #

The text character map is stored in SRAM at a configurable base address, pointed to by TEXT_BASE registers. This matches the framebuffer and tile map model.

Benefits over the V1 BRAM character map:

  • Scroll is a pointer increment: advance TEXT_BASE by one row’s worth of character entries. No data movement. This is the critical improvement for EmuTOS terminal operation.
  • Direct CPU writes: the CPU addresses text positions as *(text_base + row*cols + col). No index/data register protocol. The 68000 can write to any character position in one bus cycle.
  • Multiple text screens: swap TEXT_BASE to show a different text page, same as framebuffer page flipping.
  • DMA on text buffer: CMD_MEM_FILL clears the screen in one drawing engine command; CMD_MEM_COPY copies text regions.
  • Frees one RAMB18 that was previously dedicated to the text character map.

The text prefetch reads a row of character entries from SRAM each HBlank using the same arbiter mechanism as tile prefetch. For an 80-column display, this is 80 word reads per line — well within the HBlank budget.

9.2 Text Character Map Entry (16-bit per character position) #

Per-character colour is a fundamental text mode feature: terminal emulators set individual character colours via ANSI escape codes, and many text UI conventions (errors in red, headings highlighted) depend on it. Global fg/bg registers cannot provide this. The attribute format therefore embeds colour per character:

Text map word (16-bit):
  [15]     Blink          Per-character foreground blink
  [14:12]  Background     3-bit palette index (0..7)
  [11:8]   Foreground     4-bit palette index (0..15)
  [7:0]    Character      Character code 0..255

This matches the classic CGA/EGA text attribute encoding and is compatible with the V1 text attribute format. Colour indices reference the main palette (same 256-entry palette used by tiles and sprites).

Blink and background colour interaction (same as CGA): when blink is not used, bit 15 can be treated as a 4th background colour bit, extending background to 16 colours (palette entries 0..15). The text engine always respects bit 15 as blink; firmware that wants 16 bg colours simply never uses blink. This is a firmware convention, not a mode switch — no register controls it.

Bold/italic are not hardware attributes. They require a separate bold or italic font to be loaded — a software concern. Attribute bits for these are not allocated; the hardware cannot implement them without font support that is outside the VDP’s scope.

9.3 Font Glyphs in BRAM (pre-initialised) #

Font glyph data (pixel patterns) remains in a dedicated BRAM:

  • 256 characters × 16 rows × 1 word/row = 4,096 words (8 KB)
  • Stored as 16-bit words: upper 8 bits = row pixels, MSB = leftmost pixel; lower 8 bits spare
  • Pre-initialised at FPGA configuration using BRAM init attributes: display works at power-on before the CPU has executed any initialisation code
  • CPU can upload a custom font via the FONT_ADDR / FONT_DATA register interface (with auto-increment, same as V1); the font BRAM owns this interface directly
  • BRAM read is single-cycle, no SRAM arbiter access required for font data

This is the existing FontRAM module, reframed as an internal resource of the text engine rather than a global BRAM managed by RegisterControl.

9.4 Text Engine Register Block #

The text engine owns its own register group (Group 0x2) in the distributed register scheme:

Offset Register Contents
+0 TEXT_BASE_L Character map SRAM base address, lower 16 bits
+1 TEXT_BASE_H Character map SRAM base address, upper 4 bits
+2 TEXT_CURSOR [15:8] cursor row, [7:0] cursor column
+3 TEXT_CURSOR_ATTR Cursor colour: [7:4] background index (0..7), [3:0] foreground index (0..15)
+4 TEXT_SCROLL_X Horizontal fine scroll (0..7, sub-character pixel shift)
+5 TEXT_SCROLL_Y Vertical scroll (row count, 0..N)
+6 TEXT_CTRL [0]: enable; [1]: cursor enable; [3:2]: blink rate (frames per half-period); [5:4]: columns (00=80, 01=40)
+7 spare

9.5 Hardware Cursor #

The text engine tracks cursor position from TEXT_CURSOR. When the rendered character position matches and cursor is enabled, the cursor overlay is applied: the character cell is drawn using the colours from TEXT_CURSOR_ATTR rather than the character’s own attribute byte, effectively highlighting it. Blink is driven by the same frame counter that controls character blink (same rate, same TEXT_CTRL blink field).

Cursor style (underline vs full-block) is controlled via TEXT_CTRL. The cursor does not need a separate BRAM or sprite; it is a conditional colour substitution during character rendering.

The existing hardware cursor sprite (Group 0xA registers) is retained independently for pointer/GUI cursor use and is no longer the primary text cursor mechanism in V1.5.

9.6 V2 Path: Text as TileLayer Mode #

For V2, text rendering can be integrated as a colour_mode = 1bpp variant of the generic TileLayer entity:

  • TILE_H = 16 generic covers 8×16 characters directly; memory footprint per tile is identical to a 4bpp 8×8 tile (16 words each)
  • TILEMAP_W = 80 for 640-pixel text — line buffer becomes 80×8 = 640 pixels wide, double-buffered to 1280 entries, requiring a RAMB36 (vs RAMB18 for 320-pixel tile layers). XC6SLX25 has 26 RAMB36 equivalents; with V1.5 BRAM savings this is affordable.
  • The attribute word is reinterpreted in 1bpp mode using the same V1.5 per-character encoding: [15]=blink, [14:12]=bg palette index (0..7), [11:8]=fg palette index (0..15), [7:0]=character code; transparent_idx is not applicable (1bpp uses palette indices per-character, not a single transparency key)
  • Font glyphs live in a BRAM local to the layer (same as V1.5 FontRAM, just re-owned)
  • Cursor position and style become fields in tile_layer_cfg_t for text mode
  • Removes the separate TextGen module entirely

This path is deferred to V2 to avoid touching the compositor and display pipeline while the V1.5 foundation is still being established.


10. Drawing Engine #

10.1 Command FIFO #

The CPU writes drawing parameters into a register window and asserts a COMMIT register. This pushes a draw_cmd_t into a BRAM-based FIFO. The drawing engine (SYS domain) pops and processes commands independently. The CPU can poll a FIFO_FULL status bit or be interrupted when the FIFO has space.

CPU side (50 MHz)              Drawing Engine (100 MHz)
┌──────────────────┐           ┌──────────────────────────┐
│  Register file   │──write──▶ │  Command FIFO (BRAM)     │
│  PARAM_X0..X2    │           │  (dual-port: CPU writes, │
│  PARAM_Y0..Y2    │           │   engine reads)          │
│  PARAM_COLOUR    │           └────────────┬─────────────┘
│  PARAM_OP/MODE   │                        │ draw_cmd_t
│  REG_COMMIT ─────┼──push─────▶           ▼
│  REG_STATUS ◀────┼──busy/full────  DrawDispatch
└──────────────────┘                  │         │
                                      ▼         ▼
                                  LineOp    TriangleOp ...
                                      │         │
                                      └────┬────┘
                                           ▼
                                     Span FIFO
                                           │
                                           ▼
                                     SRAM writer (req/ack)

Key benefit: The CPU fills the FIFO non-blocking and continues preparing the next command while the drawing engine renders. This is the primary performance improvement for operations that require many small commands — such as rendering a screen of text using DRAW_EXPAND (one command per glyph row, queued without stalling the CPU).

10.2 Two Output Paths from the Drawing Engine #

The drawing engine produces two types of SRAM write requests depending on the operation:

Path A — Span write (filled rectangles, triangle fill, filled polygon):

type draw_span_t is record
    y        : unsigned(8 downto 0);
    x_start  : unsigned(9 downto 0);
    x_end    : unsigned(9 downto 0);
    colour   : colour_t;
    mode     : draw_mode_t;
end record;

The span FIFO decouples geometry computation from SRAM write throughput. Rectangle fills emit one span per row. Triangle fills use a fixed-point edge walker (left and right edge positions per scanline) that emits spans row by row as edges are stepped. The SRAM writer converts a span into sequential word-aligned writes, issuing burst requests where the span is wide enough to benefit.

Path B — Pixel write (lines, individual pixel primitives, DRAW_EXPAND):

type draw_pixel_t is record
    x      : unsigned(9 downto 0);
    y      : unsigned(8 downto 0);
    colour : colour_t;
    mode   : draw_mode_t;
end record;

Bresenham line drawing generates scattered pixels that do not form horizontal runs. DRAW_EXPAND generates 16 sequential pixels from one command (one per bit of the pattern word); these are adjacent horizontally but arrive as individual pixel writes. Both go through the pixel write path directly to the SRAM writer as individual word read-modify-write cycles.

Both paths converge at the SRAM writer, which is the sole location that handles draw modes. Mode handling was distributed across V1 drawing modules; in V1.5 it is in one place:

  • MODE_REPLACE: write directly (span: burst write; pixel: single write)
  • MODE_XOR / MODE_OR / MODE_AND: read-modify-write (single word per operation)
  • MODE_TRANSPARENT: skip write if source pixel matches transparency colour

The span path uses the SRAM burst interface (burst_len up to 3) for wide spans. The pixel path always uses burst_len=0 (single word). A shallow synchronous write FIFO between the geometry engine and SRAM writer allows burst assembly for consecutive addresses — particularly effective for solid-colour hline spans.

10.3 Canvas Space, Drawbase, and Stride #

Drawing coordinates are in canvas space, not screen space. The canvas is an abstract pixel grid whose origin (0,0) maps to a physical SRAM address given by the DRAW_BASE register. The V1 drawbase register already provides this; V1.5 makes it explicit and adds a companion DRAW_STRIDE register.

SRAM address = DRAW_BASE + (y * DRAW_STRIDE) + x_byte_offset

DRAW_STRIDE is the width of the canvas in 16-bit words. In normal operation this equals the screen width in words (e.g., 320 for 640-pixel 5-6-5 display), but for off-screen canvases (textures, back-buffer of different size) it can differ.

This means:

  • Drawing to the front buffer: DRAW_BASE = DISPLAY_BASE, DRAW_STRIDE = screen_stride
  • Drawing to a back buffer: DRAW_BASE = back_buffer_addr, same stride
  • Screen flip: swap DISPLAY_BASE to point at the completed back buffer
  • Drawing to an off-screen texture: DRAW_BASE and DRAW_STRIDE set for that texture

All coordinates in draw_cmd_t are canvas-relative (signed(15 downto 0)) so the same command works regardless of where the canvas is located in SRAM. Commands with coordinates entirely outside the clip window are discarded in DrawDispatch before any SRAM access.

Clipping is applied against CLIP_X0/Y0/CLIP_X1/Y1 registers, which are also in canvas space. The default clip window is (0,0) to (canvas_width-1, canvas_height-1).

10.4 Pattern System #

Two pattern mechanisms serve different use cases:

Inline 16-bit pattern (lines, DRAW_EXPAND):
The draw_cmd_t.pattern word is a 1D bitmask. For lines, each bit selects the colour of the corresponding pixel along the line (bit 15 = first pixel). When pattern = X"0000", all pixels use colour0 (solid draw). For DRAW_EXPAND, the pattern word IS the source bitmap — each bit expands to a full pixel using colour0 (bit=1) or colour1 (bit=0).

PatternRAM (rectangles, area fills):
A 64-entry × 16-row × 16-bit RAM holds 2D fill patterns. Each entry is a 16×16 tile pattern indexed by (x mod 16, y mod 16) or relative to the shape’s origin. The PatternRAM is pre-initialised at FPGA configuration with the standard EmuTOS fVDI pattern set; remaining entries are available for application upload at runtime.

Both mechanisms produce the same output contract to the SRAM writer: a pattern_set signal (1 = use colour0, 0 = use colour1). The SRAM writer applies colours without knowing which pattern source was used.

The PatternRAM is owned by the drawing engine module. CPU access (upload/readback) is via the Pattern register group (Group 0xD), handled by the drawing engine’s distributed register interface.


11. SRAM Arbiter #

11.1 Interface #

The arbiter accepts requests from N clients, all presenting sram_req_t, and forwards one at a time to the physical SRAM controller. The physical controller is entirely internal to SRAMControl.

entity SRAMarbiter is
    generic (
        N_CLIENTS   : integer := 6;  -- extensible without touching SRAMControl
        MAX_BURST   : integer := 8   -- must match SRAM_MAX_BURST in sram_pkg
    );
    port (
        clk_sys     : in  std_logic;
        reset_n     : in  std_logic;
        -- Client interfaces (indexed 0..N_CLIENTS-1)
        req_in      : in  sram_req_array_t(0 to N_CLIENTS-1);
        ack_out     : out sram_ack_array_t(0 to N_CLIENTS-1);
        -- Physical SRAM (to SRAMControl)
        phys_req    : out sram_req_t;
        phys_ack    : in  sram_ack_t
    );
end SRAMarbiter;

11.2 Priority Assignment #

Client index Client Priority class
0 Tile prefetch (all layers, round-robined internally) Medium
1 Sprite prefetch Medium
2 Bitmap prefetch Medium
3 Text prefetch (character map reads) Medium
4 Drawing engine span writer Low (background)
5 CPU DMA (direct CPU memory access) High (but rare)

Priority class is a generic on each client’s arbiter slot. Within the same priority class, round-robin scheduling prevents starvation. CPU DMA is high priority because the CPU is waiting for DTACK, but it rarely asserts. Display prefetch is medium — it has a time budget (HBlank) but is not as urgent as CPU.

Key improvement over V1: Adding a new SRAM client means incrementing N_CLIENTS and adding one slot. The existing SRAMControl physical layer does not change.


12. Register File and CPU Interface #

12.1 Register Map Organisation #

The V1.5 register map uses a 7-bit word address (128 registers), decoded as:

  • addr[6:3] = 4-bit group select (16 groups)
  • addr[2:0] = 3-bit register within group (8 registers per group)

Note: The V1.5 register map is incompatible with V1. This is deliberate — the V1 map grew without structure and mixed unrelated concerns in the same address group. The new map gives every logical subsystem a clean, independent address block. Firmware targeting V1.5 hardware must use the new addresses.

Group Address range Block Notes
0x0 0x00–0x07 System Status, control (display mode, enables, display_blank), interrupt, SRAM page
0x1 0x08–0x0F Display Framebuf base L/H, linescroll base L/H, border colour
0x2 0x10–0x17 Text engine Text base L/H, fg/bg colour, scroll, cursor, ctrl
0x3 0x18–0x1F Font RAM Font addr, font data (auto-increment); spare
0x4 0x20–0x27 Palette Palette addr, data, ctrl0, ctrl1; spare
0x5 0x28–0x2F Tile layer 0 Tile base, map base, scroll X/Y, ctrl, xscroll addr/data
0x6 0x30–0x37 Tile layer 1 (same structure)
0x7 0x38–0x3F Tile layer 2 (same structure)
0x8 0x40–0x47 Tile layer 3 (same structure)
0x9 0x48–0x4F Sprites Sprite base L/H, index, pos X/Y, tile, attrib, transparent index
0xA 0x50–0x57 Hardware cursor Enable, X, Y, addr, data, col1/2/3
0xB 0x58–0x5F Drawing Base L/H, stride, status, color0/1, mode, pattern
0xC 0x60–0x67 Draw params Param 0..6, CMD_COMMIT
0xD 0x68–0x6F Pattern + clip Pattern addr, pattern data; CLIP_X0/Y0/X1/Y1
0xE 0x70–0x77 SRAM map Page ctrl; reserved for co-processor (see §14.1)
0xF 0x78–0x7F Board ID Version, type ID

System block detail (Group 0x0):

Offset Register Notes
+0 STATUS Bit 0: draw engine not busy; bit 1: vblank; bit 2: line int
+1 CONTROL Bits [2:0]: display mode; [5:3]: bpp; [8]: cursor_en; [9]: blink_en; [10]: mode_80col; [11]: sprite_en; [13:12]: scanline_ctrl; [14]: text_overlay; [15]: display_blank
+2 INTERRUPT_CTRL Individual interrupt enable/mask bits (see §12.3)
+3 LINEINT_Y Line interrupt scanline position
+4 SRAM_PAGE_CTRL SRAM page select
+5–+7 spare

CONTROL[15] — display_blank: When set to 1, the VGA output drives black on all colour channels while HSYNC and VSYNC continue normally (monitor stays synchronised, no re-acquire delay on unblank). The drawing engine and all prefetch continue running. This is a display-output gate only — not a pause. Typical use: set blank before loading the framebuffer, update FRAMEBUF_BASE, then clear blank to reveal a complete frame without visible loading artefacts.

Drawing block detail (Group 0xB):

Offset Register Notes
+0 DRAW_BASE_LO Canvas SRAM base address, lower 16 bits
+1 DRAW_BASE_HI Canvas SRAM base address, upper 4 bits
+2 DRAW_STRIDE Canvas width in 16-bit words (default = screen stride)
+3 DRAW_STATUS Bit 0: FIFO full; bit 1: FIFO empty; bit 2: busy
+4 DRAW_COLOR0 Primary / foreground colour (5-6-5)
+5 DRAW_COLOR1 Secondary / background colour (5-6-5)
+6 DRAW_MODE Draw mode (replace/xor/or/and/transparent)
+7 DRAW_PATTERN Inline 16-bit pattern (lines, DRAW_EXPAND)

Draw params block detail (Group 0xC):

Offset Register Notes
+0 PARAM_0 Command parameter registers (opcode, coords, etc.)
Written before CMD_COMMIT
+6 PARAM_6
+7 CMD_COMMIT Write to push draw_cmd_t into FIFO

Pattern + clip block detail (Group 0xD):

Offset Register Notes
+0 PATTERN_ADDR PatternRAM write address (auto-increment on data write)
+1 PATTERN_DATA PatternRAM write data (16-bit); increments PATTERN_ADDR
+2 CLIP_X0 Clip window left edge (canvas space)
+3 CLIP_Y0 Clip window top edge (canvas space)
+4 CLIP_X1 Clip window right edge (canvas space)
+5 CLIP_Y1 Clip window bottom edge (canvas space)
+6–+7 spare

Tile layer block detail (one set per layer, Groups 0x5–0x8):

Offset Register Notes
+0 TILE_BASE_L Tile data SRAM base address, lower 16 bits
+1 TILE_BASE_H Tile data SRAM base address, upper 4 bits
+2 MAP_BASE_L Tile map SRAM base address, lower 16 bits
+3 MAP_BASE_H Tile map SRAM base address, upper 4 bits
+4 SCROLL_X Global X scroll for this layer
+5 SCROLL_Y Global Y scroll for this layer
+6 LAYER_CTRL [0]: enable; [2:1]: colour mode; [5:3]: priority; [9:6]: transparent_idx (default 0)
+7 XSCROLL_ADDR Indirect address for per-row X scroll table write

(Y scroll indirect data and YSCROLL_CTRL deferred to V2 — see §7.2)

12.2 Distributed Register Ownership #

In V1.5, each subsystem module owns its own register decode, configuration FFs, and any RAM access (text BRAM, font BRAM, sprite attrib BRAM, pattern RAM, palette RAM). The central RegisterControl module is reduced to a thin bus router:

  1. Group decode: RegisterControl decodes addr[6:3] (4-bit group) and drives a qualified req to exactly one subsystem module per transaction.
  2. Read-data mux: RegisterControl selects which subsystem’s data_out to return based on the active group — a shallow 1-of-16 mux on 4 bits, not the deep nested case structure of V1.
  3. ACK arbitration: Only the addressed module ever has its req asserted; its ack is returned directly. No collision is possible.
  4. Bus timing: RegisterControl retains the cpureg_read_pending gate to prevent a new transaction starting while a pipelined BRAM read is in flight.

Each subsystem module presents a standard bus port:

-- Standard register sub-port (all subsystems)
reg_cmd      : in  unsigned(2 downto 0);     -- sub-address within group
reg_data_in  : in  std_logic_vector(15 downto 0);
reg_rw       : in  std_logic;
reg_req      : in  std_logic;                -- asserted only when group matches
reg_ack      : out std_logic;
reg_data_out : out std_logic_vector(15 downto 0)

Benefits of distributed ownership:

  • Eliminates the backward dependency where V1 RegisterControl reached directly into other modules’ BRAMs (TextGen, FontRAM, SpriteAttribRAM, PatternRAM, PaletteRAM)
  • Hot per-line registers (tile scroll X/Y, draw mode) live inside the consuming module with no propagation delay to a central file
  • Adding a new subsystem requires: (a) implementing its reg_* port, and (b) adding one entry to the RegisterControl group mux. No other module changes.
  • VDPController port list shrinks significantly — no tile/sprite/draw config signals need to be routed through RegisterControl

Modules owning registers in V1.5:

Module Group(s) RAM access owned
RegisterSystem 0x0 None (FFs only)
RegisterDisplay 0x1 None (FFs only)
TextGen 0x2, 0x3 FontRAM BRAM (write path)
PaletteCtrl 0x4 PaletteRAM BRAM
TileLayer (×4) 0x5–0x8 Per-layer scroll table BRAM
SpriteEngine 0x9 SpriteAttribRAM BRAM
HardwareCursor 0xA Cursor sprite BRAM
DrawEngine 0xB, 0xC, 0xD PatternRAM BRAM
RegisterControl (router only) None
BoardID 0xF None (constants)

12.3 Interrupt Sources #

V1.5 defines a clean interrupt structure rather than a single irq_n_o:

Bit Source Use case
0 VBlank Frame sync, animation update
1 HBlank (configurable line) Raster effects, scroll update
2 Drawing FIFO empty CPU knows drawing is complete
3 Drawing FIFO has space CPU can submit next command
4–7 Reserved V2: DMA complete, co-processor

Each bit is individually maskable. The single irq_n_o pin is the OR of unmasked active bits.


13. Refactoring Stages #

The V1 working system is the starting point. Each stage produces a bitfile that can be tested on hardware immediately — the working display output is the regression test. Stages are ordered from lowest to highest risk.

Stage 0 — Display blank bit (quick win, no structural change) #

Files: RegisterControl.vhd, VGA output stage
Changes:

  • Add display_blank port output from RegisterControl (bit 15 of CONTROL register)
  • In VGA output stage: gate RGB channels to zero when display_blank = '1'; HSYNC/VSYNC continue normally

Test: Set bit, verify screen goes black with monitor remaining in sync. Clear bit, verify display resumes. Entirely self-contained change, zero risk to other functionality.

Stage 1 — Library and language cleanup (no logic change) #

Files: all V1 modules
Changes:

  • Replace STD_LOGIC_UNSIGNED with NUMERIC_STD throughout
  • Fix powerup process non-standard if condition and rising_edge construct
  • Fix DualPortRAM / FontRAM single-process dual-write (causes BRAM inference failure; split into two separate processes using the standard TDP template)

Test: Synthesise each module after change. Bitfile must be functionally identical to V1.

Stage 2 — Package migration #

Files: all V1 modules, GfxVGA_pkg.vhd
Changes:

  • New packages (gfx_types_pkg, vga_pkg, sram_pkg, draw_pkg, tile_pkg, sprite_pkg, vdp_pkg) are already written and sit alongside GfxVGA_pkg.vhd
  • Migrate modules one at a time to import from new packages; GfxVGA_pkg removed last
  • Update pixel_t usage to new definition

Test: After each module migration, synthesise and confirm display output unchanged.

Stage 3 — CDC formalisation #

Files: VDPController.vhd, ClockVGAtoSYS.vhd
Changes:

  • Audit every signal that crosses a clock domain boundary in VDPController
  • Any crossing not already routed through ClockVGAtoSYS is moved there
  • Remove any ad-hoc 2-FF chains that live inline in VDPController

Test: Synthesise with timing constraints active. Check ISE timing report for unconstrained cross-domain paths. Display output must be unchanged.

Stage 4 — Register file decomposition #

Files: RegisterControl.vhd, all subsystem modules
Changes:

  • Adopt new V1.5 register map (Group 0x0–0xF, see §12.1)
  • Implement distributed register ownership: each subsystem module gains its reg_* port
  • RegisterControl becomes a thin bus router (group decode + 1-of-16 read mux)
  • Modules previously accessing external BRAMs via RegisterControl now own those BRAMs directly (FontRAM → TextGen; PaletteRAM → PaletteCtrl; PatternRAM → DrawEngine; SpriteAttribRAM → SpriteEngine)
  • Expected saving: ~300 LUT slices on XC6SLX25 from eliminating deep decode trees

Test: All CPU register reads and writes must produce same behaviour. DTACK timing must be unchanged. Verify resource report shows LUT reduction. Note: firmware must be updated to use V1.5 register addresses.

Stage 5 — Text SRAM migration #

Files: TextGen.vhd, text engine register interface
Changes:

  • Replace BRAM character map with SRAM pointer (TEXT_BASE register)
  • Add text prefetch controller (reads character map from SRAM each HBlank)
  • TextGen module owns font BRAM directly (was accessed via RegisterControl)
  • Hardware cursor moved from external hardware cursor sprite registers into TextGen
  • Character attribute word updated to V1.5 format (blink, reverse video, attribute bits)

Test: Text display must be correct. Scroll by incrementing TEXT_BASE pointer and verify correct screen content. Cursor must blink at configured rate. Custom font upload via FONT_ADDR/FONT_DATA must update display.

Stage 6 — SRAM arbiter interface #

Files: SRAMControl.vhd, VDPController.vhd and all SRAM clients
Changes:

  • Adopt sram_req_t / sram_ack_t / sram_wdata_t port types throughout
  • Replace grown-in-place arbiter with clean N_CLIENTS generic design (see §11)
  • Existing SRAM timing and burst protocol unchanged — only interface type changes
  • Add text prefetch as client index 3

Test: Bitmap display and drawing engine must produce identical output. SRAM timing must remain within spec.

Stage 7 — TileLayer consolidation #

Files: TileGenerator.vhd (×4), new TileLayer.vhd
Changes:

  • Write generic TileLayer entity using tile_layer_cfg_t port (see §6.1)
  • Replace TileGenerator instances one at a time in VDPController
  • Per-layer priority register introduced via tile_layer_cfg_t.priority
  • Carry forward per-row X scroll table unchanged

Test: Replace one layer at a time; confirm tile display correct after each substitution. Scroll behaviour must match V1. Once all four replaced, delete TileGenerator.vhd.

Stage 8 — Sprite engine cleanup #

Files: SpriteAttribRAM.vhd, SpriteLineSelector.vhd, SpriteLineRenderer.vhd
Changes:

  • Adopt sprite_attr_t and sprite_line_entry_t from sprite_pkg; replace raw bit-slice access with to_/from_ conversion functions
  • Fix stale comment in SpriteAttribRAM (bits [59:58] documented as “layer priority” but decoded as unused by selector; clarify as reserved)
  • No change to the 64-bit attribute format or sprite rendering behaviour

Test: Sprites must appear in identical positions with identical graphics to V1.

Stage 9 — Drawing engine command FIFO #

Files: Drawing.vhd and related
Changes:

  • Decouple CPU from drawing state machine via BRAM command FIFO (see §10.1)
  • Add DRAW_STRIDE register and canvas-space addressing (see §10.3)
  • Centralise draw mode handling (XOR/OR/AND/transparent) in a single SRAM writer
  • Add burst writes for SOLID mode hline spans (shallow write FIFO, BRAM-based)
  • Add DRAW_EXPAND command (16-bit bitmap expand to pixels via colour0/colour1)
  • Add clip window registers and clipping in DrawDispatch
  • Fix known V1 bugs: triangle fill using stale line_pattern; redundant DONE state

Test: All drawing operations (rect, line, fill, blit, expand) must produce correct output. CPU must not stall while FIFO has space. FIFO-full behaviour must be verified. Verify OR and AND modes produce correct results. Verify burst writes are faster than V1 single-word path.

Stage 10 — Compositor unification (optional, or defer to V2) #

Files: PixelCompositor.vhd
Changes:

  • Unify all display sources to produce pixel_t with priority and transparent fields
  • Replace the V1 fixed-priority compositor with the compositor_select function from vdp_pkg (pure priority comparator, no special-case wiring)

Test: Visual output must match V1 at default priority settings. This is the highest-risk stage and may be cleanest to defer to V2.


14. V2 Forward Compatibility #

The following V1.5 decisions are made specifically to make V2 changes localised:

V2 change V1.5 preparation
32-bit SRAM bus sram_pkg.vhd uses record — widen wdata/rdata in one place; burst_len already 3-bit so no interface change needed
8-8-8 RGB colour colour_t in gfx_types_pkg.vhd — widen fields in one place
25.175 MHz pixel clock PIPELINE_DEPTH constant drives all timing — no structural change
HDMI output PIX domain produces colour_t and sync; HDMI encoder is a new consumer
200 MHz sprite/draw domain Span FIFO and line buffer write ports designed for cross-domain; add clock port and CDC
More tile layers N_CLIENTS in arbiter, add TileLayer instances and compositor array entry
Per-tile priority Switch tile map to 32-bit entry; second word holds priority override + extended tile index
Text as TileLayer mode TILE_H=16 generic, 1bpp colour mode, cursor in tile_layer_cfg_t — see §9.6; RAMB36 required for 640-pixel line buffer
640-pixel tile layers (hires mode) Line buffer depth is TILE_W * TILEMAP_W * 2; synthesis picks RAMB18 or RAMB36 automatically from the derived constant
Sub-16bpp drawing Drawing engine SRAM writer extended to handle sub-word pixel packing; canvas space addressing unchanged
Distributed register ownership V1.5 already distributes ownership; V2 adds modules by extending the group mux only

14.1 Amiga-style Co-processor (V2, Register Space Reserved) #

Registers Group 0xE (0x70–0x77) are reserved for a display co-processor. The concept:

  • A BRAM holds a list of instructions: (WAIT_H, WAIT_V, REG_ADDR, REG_DATA)
  • The co-processor executes in the PIX domain, watching H/V counters
  • When (H_cnt, V_cnt) matches a WAIT condition, it writes REG_DATA to REG_ADDR
  • This allows palette changes, scroll changes, layer enable/disable mid-frame
  • Requires a secondary write path into the register file from the PIX domain (CDC: PIX→SYS for register writes, with the co-processor having highest register priority)

Plan the register write path in V1.5 RegisterControl to accept writes from two sources (CPU and co-processor) with the co-processor winning ties.


15. V1 Issues Addressed in V1.5 #

V1 issue V1.5 resolution
DualPortRAM/FontRAM BRAM inference failures (both writes in one process) All RAMs use split-process TDP template; resource usage drop expected
CDC boundaries informal / undocumented All crossings explicit in §2; CDC module is sole crossing point
SRAM arbiter grew in-place with features Clean N_CLIENTS generic arbiter with burst interface; adding client = adding one slot
SRAM interface word-oriented; burst structure implicit in clients Explicit burst_len (3-bit, 1..8 words) in sram_req_t; burst data flow documented (data_valid/rdata per word, done on last); client accumulator pattern defined
Bitmap prefetch uses full-line single transaction (~2,560 cycles), blocking all SRAM clients Split into 8-word bursts interleaved with other clients; same bandwidth, 84% utilisation including all display layers
4 separate TileGenerator instances, fixed compositor priority Single generic TileLayer entity; priority via per-layer register
Compositor priority fixed in hardware wiring All sources produce pixel_t with priority field; compositor is a pure priority comparator
Sprite priority fixed above all tile layers Per-sprite priority in attribute table (4 bits); default=5 reproduces V1 behaviour
Drawing engine directly driven by CPU register state Command FIFO decouples CPU from drawing state machines; CPU fills queue and continues
Drawing modes (XOR etc.) scattered across drawing modules Centralised in SRAM writer, applied to both span and pixel paths
Only XOR supported as a logical drawing operation OR and AND added to draw_mode_t; centralised SRAM writer handles all three
Line drawing incorrectly mapped to span path Separate pixel-write path for Bresenham lines; span path for fills only
Drawing coordinates in screen space only Canvas space with DRAW_BASE + DRAW_STRIDE; screen-space drawing = set base to display base
No hardware clipping Clip window registers (canvas space); clipping in DrawDispatch before SRAM access
No burst writes for hline spans Shallow write FIFO enables burst assembly for consecutive addresses; large speedup for fills
No 1bpp bitmap expand command DRAW_EXPAND added; enables hardware text/icon rendering via FIFO with no CPU stall
Triangle fill uses stale line_pattern (bug) Fixed: line_pattern forced to X"0000" in FILLTRI command setup
Text character map in dedicated BRAM Moved to SRAM with pointer register; scroll = pointer increment; direct CPU write access
Text BRAM accessed via RegisterControl (backward dependency) TextGen owns its font BRAM and register interface directly
RegisterControl monolithic — all register state in one 970-line module Distributed ownership: each subsystem module owns its register decode and config FFs
RegisterControl reaches into other modules’ BRAMs Eliminated — each BRAM is owned and accessed only by its module
Register map ad-hoc, groups mixing unrelated concerns New blocked map (Groups 0x0–0xF); each group maps to one subsystem module
No framebuffer blank during loading CONTROL[15] = display_blank blacks RGB output while maintaining sync
STD_LOGIC_UNSIGNED (non-standard library) throughout Replace with NUMERIC_STD throughout
powerup process uses non-standard if condition and rising_edge construct Replace with standard synchronous reset pattern
Interrupt is a single wire Structured interrupt register with individual mask bits
Transparent colour hardcoded to palette index 0 for all tile layers and sprites Per-layer transparent_idx in tile_layer_cfg_t (4-bit, default 0, backward compatible); palette index 0 is no longer reserved once a non-zero transparent index is configured. Global SPRITE_TRANS_IDX register covers sprites.
Register file as flat FF array (~350-450 slices on S6, ~9-12% of device) Distributed ownership moves FFs into consuming modules; eliminates centralised decode tree

Document version 1.3 — June 2026
Target: GfxVGA V1.5, XC6SLX25, ISE 14.x
Forward reference: GfxVGA V2, XC7S50, Vivado 2025.x