GfxVGA Version 1.5 — Design Document #
Target hardware: Existing V1 board (Spartan-6, 16-bit async SRAM, 5-6-5 VGA output)
Purpose: Incremental refactoring of the working V1 codebase. Same feature set, same
hardware. The goal is to clean up the architecture while keeping a working system at every
stage — not a from-scratch rewrite.
Serves as the foundation for V2 (Spartan-7, 32-bit SRAM, 8-8-8 RGB, HDMI), which will
be a clean start on new hardware where bring-up is unavoidable anyway.
1. Goals and Non-Goals #
Goals #
- Clean, modular VHDL with well-defined inter-module interfaces
- All inter-module types defined in package files before any module is written
- Command-FIFO-based drawing engine decoupled from CPU interface
- Consolidated, configurable tile layer engine (replacing 4 fixed instances)
- Proper CDC boundaries documented and enforced
- SRAM arbiter with a clean priority interface (replacing grown-in-place V1 version)
- VGA pixel pipeline with explicit, documented latency compensation
- Distributed register ownership: each subsystem module owns its own register decode
- Text character map moved to SRAM (pointer-based); font glyphs remain in pre-initialised BRAM
- Designed for low resource usage: target <75% LUT on XC6SLX25 (V1 was at ~98%)
- Forward-compatible: V2 changes are widening/extending, not restructuring
Non-Goals for V1.5 #
- New drawing primitives beyond V1 set (add in V2)
- 320×240 mode changes beyond what V1 supports (validate, don’t redesign)
- HDMI output (V2)
- 200MHz clock domain (V2)
- Amiga-style co-processor (plan register space, implement in V2)
- Sub-16bpp framebuffer drawing modes (V2)
- Text rendering as a TileLayer mode (design is documented in §9.5; implement in V2 once foundation is stable)
- Per-column Y scroll table (register space reserved in tile layer block; implement if a specific effect requires it, but not at the cost of core functionality — global X scroll per row already covers most raster-bar use cases)
2. Clock Architecture #
V1.5 uses three clock domains from a single DCM/PLL (same as V1):
| Domain | Clock | Source | Used by |
|---|---|---|---|
| PIX | 25 MHz | PLL ÷ 4 | VGA pixel output, line buffer read, compositor |
| SYS | 100 MHz | PLL | SRAM arbiter, tile prefetch, sprite engine, drawing engine |
| CPU | 50 MHz | PLL ÷ 2 | CPU bus state machine, register file |
CDC crossings (all explicit, none implicit):
| Crossing | Signal type | Method |
|---|---|---|
| SYS → PIX | HBlank start pulse | Toggle + 3-FF sync + edge detect |
| SYS → PIX | VBlank end pulse | Toggle + 3-FF sync + edge detect |
| SYS → PIX | Line buffer swap | Toggle + 3-FF sync |
| PIX → SYS | Prefetch request | Toggle + 3-FF sync + edge detect |
| CPU → SYS | Register write | 2-FF sync on all control signals |
| SYS → CPU | DTACK | Combinational from SYS flag, registered in CPU domain |
The existing ClockVGAtoSYS.vhd implements the SYS↔PIX crossings and should be preserved/extended.
Rule: No signal crosses a clock domain boundary without appearing in a CDC module port list.
3. Package Structure #
The seven packages below define the target type system for V1.5. During refactoring,
modules migrate to these types incrementally — GfxVGA_pkg.vhd is not removed until the
last module referencing it has been updated. A module’s port list should contain only types
from these packages — no raw std_logic_vector for structured data.
gfx_types_pkg.vhd Core types: pixel_t, colour_t, coord_t, screen_pos_t
vga_pkg.vhd VGA timing records, pipeline stage types
sram_pkg.vhd SRAM request/ack interface types
draw_pkg.vhd Drawing command record and opcode enum
tile_pkg.vhd Tile layer config, tile attribute type
sprite_pkg.vhd Sprite attribute type
vdp_pkg.vhd Internal VDP signal buses, line buffer types
3.1 Core Types (gfx_types_pkg.vhd) #
-- 5-6-5 pixel colour (V1.5). Widened to 8-8-8 in V2 by changing this package only.
type colour_t is record
r : std_logic_vector(4 downto 0);
g : std_logic_vector(5 downto 0);
b : std_logic_vector(4 downto 0);
end record;
constant COLOUR_BLACK : colour_t := (others => (others => '0'));
constant COLOUR_TRANSPARENT : colour_t := ...; -- implementation-defined sentinel
-- Screen coordinate. Signed to allow off-screen drawing.
type coord_t is record
x : signed(15 downto 0);
y : signed(15 downto 0);
end record;
-- Unsigned screen position (display domain, always on-screen)
type screen_pos_t is record
x : unsigned(9 downto 0); -- 0..639
y : unsigned(8 downto 0); -- 0..479
end record;
-- A pixel with compositor metadata attached
type pixel_t is record
colour : colour_t;
priority : unsigned(2 downto 0); -- 0 = lowest, 7 = highest
transparent : std_logic;
valid : std_logic;
end record;
constant PIXEL_NULL : pixel_t := (...);
3.2 SRAM Interface (sram_pkg.vhd) #
The interface is burst-oriented, not word-oriented. This is a hard requirement driven by timing budget: fetching 16-bit words individually would exceed the HBlank budget for 4 tile layers + 48 sprites. The existing V1 design already relies on 4-word burst reads for tiles and sprites; V1.5 makes this a first-class interface concept.
-- Maximum burst length supported by SRAMControl.
-- 4 words matches tile row and sprite row data granularity exactly.
-- 8 words is used for bitmap prefetch to reduce per-word overhead on long sequential reads.
constant SRAM_MAX_BURST : integer := 8;
-- Issued once per transfer. burst_len=0 means 1 word; burst_len=7 means 8 words.
type sram_req_t is record
valid : std_logic;
rnw : std_logic; -- '1' = read, '0' = write
addr : unsigned(19 downto 0); -- start address (word-addressed)
burst_len : unsigned(2 downto 0); -- 0..7 → 1..8 words
ube_n : std_logic; -- byte enables for writes
lbe_n : std_logic;
end record;
-- For burst writes, write data is provided word-by-word via a companion stream.
-- The sram_req_t initiates the burst; subsequent words use sram_wdata_t.
type sram_wdata_t is record
valid : std_logic;
data : std_logic_vector(15 downto 0);
end record;
-- Returned for both reads and write completion.
-- For reads: data_valid pulses once per word (one new rdata word each cycle it is high).
-- done pulses on the same cycle as the final data_valid for reads; one cycle after
-- the last write ack for writes. Clients must not re-assert valid until done is seen.
type sram_ack_t is record
data_valid : std_logic; -- read data present this cycle
rdata : std_logic_vector(15 downto 0); -- one 16-bit word, valid when data_valid='1'
done : std_logic; -- last word of burst completed
end record;
3.2.1 Burst Data Flow #
For read bursts, rdata and data_valid pulse once per cycle for each word in the burst.
The client accumulates words into a local shift register:
-- Example: 4-word (64-bit) accumulator for tile/sprite prefetch
signal burst_buf : std_logic_vector(63 downto 0);
signal word_cnt : unsigned(1 downto 0);
if sram_ack.data_valid = '1' then
burst_buf <= sram_ack.rdata & burst_buf(63 downto 16); -- shift in from MSB
word_cnt <= word_cnt + 1;
end if;
if sram_ack.done = '1' then
-- burst_buf now holds all 4 words; process tile/sprite row data
end if;
For 8-word bitmap bursts, the client uses a 128-bit accumulator. The accumulator feeds the line buffer write engine, which converts 16-bit SRAM words to the line buffer entry format (palette index or direct colour) and writes them sequentially.
Write bursts: sram_wdata_t.valid and .data must be asserted one cycle after
sram_req_t.valid for the first word, then each subsequent cycle for remaining words.
The client is responsible for holding the data stream valid. sram_ack.done signals
completion of all writes.
3.2.2 Timing Budget (100 MHz SYS clock, 640×480) #
One full line period = 800 pixel clocks = 3,200 SYS cycles.
Async SRAM page-mode access cost at 100 MHz (~10 ns SRAM):
- First word of any burst: ~3 cycles (address setup + access time)
- Each subsequent word (same page): ~2 cycles
| Burst length | Total cycles | Cycles/word |
|---|---|---|
| 1 word | 3 | 3.0 |
| 4 words | 9 | 2.25 |
| 8 words | 17 | 2.1 |
Bitmap fetch words per line by mode:
Words per line = (display_width × bpp) / 16. The table below shows the worst and representative cases. 640×480 16bpp is the worst case; all other modes are cheaper. Sub-16bpp drawing is a V2 feature, but sub-16bpp display (palette-indexed framebuffer) is supported in V1.5 and benefits from the reduced prefetch cost.
| Bitmap mode | Width | bpp | Words/line | 8-word bursts | Cycles |
|---|---|---|---|---|---|
| Hires RGB (worst case) | 640 | 16 | 640 | 80 | 1,360 |
| Hires 8bpp indexed | 640 | 8 | 320 | 40 | 680 |
| Hires 4bpp indexed | 640 | 4 | 160 | 20 | 340 |
| Hires 2bpp indexed | 640 | 2 | 80 | 10 | 170 |
| Hires 1bpp mono | 640 | 1 | 40 | 5 | 85 |
| Lowres RGB | 320 | 16 | 320 | 40 | 680 |
| Lowres 8bpp indexed | 320 | 8 | 160 | 20 | 340 |
| Lowres 4bpp indexed | 320 | 4 | 80 | 10 | 170 |
The budget table below uses the worst case (640×480 16bpp). Any lower bit depth or 320-pixel mode reduces bitmap cost proportionally, freeing more headroom for the drawing engine.
Budget with all clients active (640×480 16bpp, 4 tile layers, 48 sprites) — worst case:
| Client | Burst len | Bursts/line | Cycles |
|---|---|---|---|
| 4 tile layers (320-px logical) | 4 words | 80 total | 720 |
| Sprites (48 max per line) | 4 words | 48 | 432 |
| Bitmap prefetch (640-px 16bpp) | 8 words | 80 | 1,360 |
| Text char map prefetch | 4 words | 20 | 180 |
| Total display | 2,692 (84%) | ||
| Headroom for drawing engine | ~508 cycles |
Bitmap uses 8-word bursts to reduce the per-word overhead versus 4-word bursts (would cost 1,440 cycles for the same 640 words). Tiles and sprites use 4-word bursts to match their natural data granularity (one sprite row = 16 pixels × 4bpp = 64 bits = 4 words).
Bitmap must NOT use a full-line single transaction. A 640-word monolithic fetch would consume ~2,560 cycles, blocking all other SRAM clients for 80% of the line period. The bitmap prefetch controller issues 8-word bursts and yields between them, interleaving with tile and sprite access. Total bandwidth consumed is identical; the difference is fairness.
Client burst length assignments:
| Client | burst_len value | Words | Rationale |
|---|---|---|---|
| Tile prefetch | 3 (4 words) | 4 | 8 pixels × 4bpp per burst |
| Sprite prefetch | 3 (4 words) | 4 | One sprite row exactly |
| Bitmap prefetch | 7 (8 words) | 8 | Reduce overhead on long sequential reads |
| Text char map | 3 (4 words) | 4 | 4 character entries per burst |
| Drawing engine spans | 0..7 | 1..8 | Depends on span width |
| CPU DMA | 0 (1 word) | 1 | CPU is waiting for DTACK |
All SRAM clients present a sram_req_t to the arbiter. The arbiter routes to
SRAMControl which drives the physical bus. Nothing else.
3.3 Drawing Commands (draw_pkg.vhd) #
type draw_op_t is (
DRAW_NOP,
DRAW_PIXEL,
DRAW_LINE,
DRAW_RECT_OUTLINE,
DRAW_RECT_FILL,
DRAW_TRIANGLE_OUTLINE,
DRAW_TRIANGLE_FILL,
DRAW_BLIT, -- rectangular region copy from SRAM
DRAW_EXPAND, -- expand 16-bit 1bpp bitmap to pixels using colour0/colour1
DRAW_FILL_SPAN -- internal: drawing engine emits these to SRAM writer
);
type draw_mode_t is (
MODE_REPLACE,
MODE_XOR,
MODE_OR,
MODE_AND,
MODE_TRANSPARENT -- skip if source matches transparency colour
);
type draw_cmd_t is record
op : draw_op_t;
mode : draw_mode_t;
colour0 : colour_t; -- foreground / primary colour
colour1 : colour_t; -- background / secondary colour (patterns, expand)
x0, y0 : signed(15 downto 0);
x1, y1 : signed(15 downto 0);
x2, y2 : signed(15 downto 0); -- triangle third vertex / blit WxH
src_addr : unsigned(19 downto 0); -- blit source
pattern : std_logic_vector(15 downto 0); -- inline dash/expand pattern
end record;
DRAW_EXPAND takes a 16-bit 1bpp bitmap in pattern, a position (x0,y0), and a pixel
count, expanding each bit to colour0 (bit=1) or colour1 (bit=0). This is the primary
mechanism for hardware text/icon rendering: the CPU queues one EXPAND command per glyph
row into the command FIFO, then continues preparing the next character while the drawing
engine renders independently.
OR and AND modes follow the same read-modify-write path as XOR, implemented in the
centralised SRAM writer (see §10.2).
The CPU fills a register window and writes a COMMIT register to push one draw_cmd_t into
the command FIFO. The drawing engine pops commands and processes them entirely in the SYS domain.
The CPU does not directly drive any drawing state machine signals.
4. VGA Pixel Pipeline #
This is the recommended starting point for the new project. Get a test pattern on screen before adding any other subsystem.
4.1 Pipeline Stages #
All stages clock at 25 MHz (PIX domain). The pipeline runs one pixel per clock.
Stage Signal Description
───── ───────────────── ─────────────────────────────────────────────────────
0 H_cnt, V_cnt Pixel counters (raw, including blanking)
1 lb_addr Line buffer read address = H_cnt - PIPELINE_DEPTH
lb_sel Which double-buffer half to read from
2 lb_data_raw Line buffer BRAM output (1-cycle registered read)
3 pal_addr Palette index extracted from lb_data_raw (if indexed mode)
4 pal_colour Palette BRAM output (1-cycle registered read)
sprite_colour Sprite line buffer output (registered)
5 comp_out Compositor selects highest-priority non-transparent pixel
6 vga_out Registered to I/O pins with sync signals
PIPELINE_DEPTH = 5 (for the above 6-stage pipeline, stages 1-5 after the counter).
HSYNC and VSYNC are passed through a 5-stage shift register at 25 MHz so they arrive at
the output register in the same cycle as the pixel they correspond to.
-- Sync delay shift register (in VGATiming or a wrapper)
signal hsync_pipe : std_logic_vector(PIPELINE_DEPTH-1 downto 0);
signal vsync_pipe : std_logic_vector(PIPELINE_DEPTH-1 downto 0);
process(clk_pix)
begin
if rising_edge(clk_pix) then
hsync_pipe <= hsync_pipe(PIPELINE_DEPTH-2 downto 0) & hsync_raw;
vsync_pipe <= vsync_pipe(PIPELINE_DEPTH-2 downto 0) & vsync_raw;
end if;
end process;
vga_hsync_o <= hsync_pipe(PIPELINE_DEPTH-1);
vga_vsync_o <= vsync_pipe(PIPELINE_DEPTH-1);
The line buffer read address is similarly offset:
lb_read_x <= H_cnt - PIPELINE_DEPTH; -- wraps in blanking, gated by H_active
This approach needs no FIFO and no clock domain above 25 MHz for the pixel path. PIPELINE_DEPTH is a package constant so changing the pipeline depth in one place automatically corrects the sync delay everywhere.
4.2 Compositor #
The compositor (Stage 5) selects the output pixel from several input sources:
Input sources (each a pixel_t with priority and transparent flag):
- sprite_pix (from sprite line buffer, SYS domain written, PIX domain read)
- tile1_pix .. tileN_pix (from tile line buffers)
- bitmap_pix (from bitmap/framebuffer line buffer)
- text_pix (from text engine line buffer)
- background_pix (solid colour fallback, never transparent, lowest priority)
Selection rule: highest .priority among non-transparent pixels wins.
For equal priority: sprite > tile (left-to-right in port list, deterministic).
function compositor_select(
sources : pixel_array_t -- array of pixel_t, ordered for tie-breaking
) return pixel_t is
variable best : pixel_t := PIXEL_NULL;
begin
for i in sources'range loop
if sources(i).valid = '1' and sources(i).transparent = '0' then
if best.valid = '0' or
unsigned(sources(i).priority) > unsigned(best.priority) then
best := sources(i);
end if;
end if;
end loop;
return best;
end function;
This function lives in vdp_pkg.vhd. Adding a new layer means adding one element to the
array — the compositor logic doesn’t change.
4.3 Palette Lookup #
At Stage 3/4, if the active mode uses indexed colour (8bpp palette mode):
- The line buffer stores 8-bit palette indices, not direct colour values
- The palette BRAM is read with that index
- The output is a
colour_t(5-6-5 for V1.5) - In direct colour mode (5-6-5 stored in line buffer), stages 3/4 are bypassed
For V2 the palette entry widens from 16-bit to 24-bit. Only colour_t and the palette RAM
width change — the pipeline structure is identical.
4.4 Double-Buffered Line Buffers #
Each tile/sprite layer has a line buffer pair (two BRAMs):
- Buffer A: being read by the PIX domain compositor (current display line)
- Buffer B: being written by the SYS domain prefetch engine (next line)
At HBlank end: the buffers swap. The swap signal crosses SYS→PIX via the toggle CDC pattern.
SYS writes to "write_buf" pointer (alternates 0/1 each line)
PIX reads from "read_buf" pointer (= ~write_buf, synchronised through toggle CDC)
The PIX domain never writes line buffers. The SYS domain never reads line buffers after the swap signal is issued. This eliminates all dual-clock BRAM complexity in the line buffers — each BRAM port is used by exactly one clock domain.
5. Prefetch Architecture #
Prefetch is entirely in the SYS (100 MHz) domain. The PIX domain only triggers it.
5.1 Prefetch Trigger #
The ClockVGAtoSYS module delivers two prefetch pulses to the SYS domain:
prefetch_vga_start_sys: start fetching tile/sprite data for VGA-domain line N+1prefetch_log_start_sys: start fetching logical (source) line content for the line buffer
For 640×480: logical line = VGA line. Prefetch starts at the beginning of HBlank (one VGA line ahead of display).
For 320×240 (pixel doubled):
- One logical line covers two VGA lines
- Prefetch starts two VGA lines ahead (
prefetch_vga_start_sysis issued on the second-to-last visible line of the previous logical line) - The line buffer holds 320 entries; the PIX domain pixel-doubles by using
H_cnt(9 downto 1)(dividing horizontal address by 2) as the line buffer index - Vertical doubling: same line buffer is used for two consecutive VGA lines (no swap at the halfway point)
The mode (640×480 or 320×240) is a register in the SYS domain. VGA timing is always 640×480 at 25 MHz at the output — the mode only affects addressing and buffer indexing.
5.2 Prefetch Controllers #
Each data source has its own prefetch controller:
| Controller | Triggers on | Outputs to |
|---|---|---|
| TilePrefetch | prefetch_log_start_sys |
SRAM arbiter req |
| BitmapPrefetch | prefetch_log_start_sys |
SRAM arbiter req |
| TextPrefetch | prefetch_log_start_sys |
SRAM arbiter req (char map) + FontRAM read |
| SpriteLineSel | prefetch_vga_start_sys |
Sprite attrib BRAM |
All prefetch controllers present sram_req_t to the arbiter. They use req/ack internally
and are unaware of each other. The arbiter handles contention.
6. Consolidated Tile Layer Architecture #
6.1 Single Generic TileLayer Module #
Replace the four separate tile generators with one TileLayer entity:
entity TileLayer is
generic (
LAYER_ID : integer := 0;
TILEMAP_W : integer := 64; -- tiles per row in the tile map
TILEMAP_H : integer := 64; -- tiles per column
TILE_W : integer := 8; -- pixels per tile (horizontal)
TILE_H : integer := 8 -- pixels per tile (vertical)
);
port (
clk_sys : in std_logic;
reset_n : in std_logic;
-- Configuration (from register file, SYS domain)
cfg : in tile_layer_cfg_t; -- see tile_pkg.vhd
-- Prefetch trigger
fetch_start : in std_logic;
fetch_line : in unsigned(8 downto 0);
-- SRAM interface
sram_req : out sram_req_t;
sram_ack : in sram_ack_t;
-- Line buffer output (written by SYS, read by PIX via double buffer)
lb_wr_addr : out unsigned(9 downto 0);
lb_wr_data : out std_logic_vector(15 downto 0); -- see lb_entry_t in vdp_pkg
lb_wr_en : out std_logic;
lb_swap : out std_logic -- toggle signal, PIX domain syncs this
);
end TileLayer;
tile_layer_cfg_t holds scroll X/Y, enable, base addresses, tile size override, and
critically: a priority value (unsigned(2 downto 0)). This priority is passed into
the pixel_t.priority field of every pixel this layer produces, allowing the compositor
to arbitrate without any fixed wiring.
tile_layer_cfg_t also holds a transparent_idx field (unsigned(3 downto 0)):
the palette index that the renderer treats as transparent. Default value 0 preserves V1
behaviour. Setting a different index frees palette entry 0 as a usable colour. Each layer
has its own transparent_idx — layers can use different transparent indices independently.
The renderer sets pixel_t.transparent = '1' when the pixel’s palette index equals
cfg.transparent_idx; the compositor never sees raw palette indices, only the flag.
Important: With transparent_idx configurable, palette index 0 is no longer
reserved. Firmware ported from V1 that relies on index 0 being transparent should
verify the register default or explicitly write 0 to LAYER_CTRL.transparent_idx.
colour_mode = "00" is the V1.5 native format: 4bpp indexed colour (4 pixels packed per
16-bit word, nibble extraction). "01" reserved for future 8bpp indexed.
Line buffer sizing: The line buffer depth is derived from TILE_W * TILEMAP_W * 2
(double-buffered). Standard tile mode (8×40 = 320 entries × 2 = 640) fits in one RAMB18.
A 640-pixel-wide mode (8×80 = 640 entries × 2 = 1280) requires a RAMB36. This is handled
automatically by synthesis when the BRAM is declared with the derived depth — no structural
change to TileLayer is needed. The LAYER_ID generic identifies which BRAM primitive ISE
should infer.
6.2 Tile Attribute Format (V1-compatible) #
The tile map entry is a single 16-bit word per tile — a deliberate constraint that gives the 68000 peak-performance single-cycle writes into the tile map. The format is carried forward from V1 unchanged:
Tile attribute word (16-bit):
[15:13] Palette select (3-bit, palette bank 0..7)
[12] V-flip
[11] H-flip
[10:0] Tile index (11-bit, 0..2047 — 2048 tiles)
Priority in V1.5 is layer-level only, set via tile_layer_cfg_t.priority. There is no
per-tile priority override field in the tile attribute word; all pixels from a layer carry
the same priority value through the compositor.
V2 per-tile priority path: switching to a 32-bit tile map entry (two consecutive 16-bit words) would free 16 bits for priority override, extended tile index, and other attributes without altering the tile graphics data format or SRAM layout. The CPU would write two words per tile update — still within the 68000 bus model and acceptable given that full-tile-map updates are done at frame boundaries, not per-pixel.
7. Scroll Tables #
7.1 X Scroll Table (existing, carried forward from V1) #
Each tile layer has a per-row X scroll lookup that allows independent horizontal scroll positions for each display line (raster-bar scroll, wavy effects). This already exists in V1 and is carried forward unchanged. The table is stored in a BRAM local to each tile layer (or shared with configurable base offset). No SRAM bandwidth is consumed.
7.2 Y Scroll Table (deferred — not a V1.5 implementation target) #
Decision: Per-column Y scroll is reserved for V2 or as an optional add-on if a
specific game effect requires it. The register space in tile layer block offset +7
(YSCROLL_CTRL) is reserved; the table is not implemented in V1.5.
Rationale: Per-column Y scrolling is primarily a demo-scene effect (Sega Mega Drive-style column waves). Game code can achieve most useful vertical scroll effects with global Y scroll per layer. The implementation cost in SRAM bandwidth is non-trivial (see below), and it is not worth trading core reliability for.
Design notes preserved for V2 reference:
BRAM registered-output reads require 2 clock cycles on real hardware (not 1 as simulation shows) when address computation involves offset addition. This was confirmed in V1 — tile maps were originally BRAM-based and moved to SRAM, improving CPU random- access performance by eliminating indirect write overhead. The same constraint applies to any BRAM-based scroll table: addresses must be computed a cycle ahead.
SRAM bandwidth for per-column Y scroll: global Y scroll keeps tile map reads sequential and burst-able (~960 SYS cycles per layer). Per-column scatter breaks burst grouping; worst-case scatter per layer is similar in cycle count but consumes the remaining headroom when all 4 layers are active. A one-layer-at-a-time restriction would be required on this hardware. On V2 (wider SRAM bus, higher clock ceiling) the constraint relaxes. A VBLANK bulk-copy DMA (SRAM → BRAM tile map shadow) would eliminate scatter entirely at the cost of one-frame update latency — acceptable for game code and a clean V2 solution.
7.3 Priority Interaction Between Layers and Sprites #
V1 model (carried forward in V1.5):
Tile layers have fixed compositor positions (layer 0 = bottom, layer 3 = top of tile stack).
Sprites use a 4-bit priority field (0-15) for sprite-to-sprite Z-ordering: a higher value
draws over a lower value at the same pixel. The PixelCompositor maps the sprite priority
value to a position relative to the tile layers. The per-layer register priority
(tile_layer_cfg_t.priority) is new in V1.5 and is introduced during the TileLayer
consolidation stage (Stage 7); until then the compositor retains the V1 fixed tile ordering.
| Source | V1 priority model | V1.5 refactor target |
|---|---|---|
| Background fill | Lowest (fallback) | Unchanged |
| Tile layer 0 | Fixed position 0 | Configurable via tile_layer_cfg_t |
| Tile layer 1 | Fixed position 1 | Configurable via tile_layer_cfg_t |
| Tile layer 2 | Fixed position 2 | Configurable via tile_layer_cfg_t |
| Tile layer 3 | Fixed position 3 | Configurable via tile_layer_cfg_t |
| Sprites | 4-bit per-sprite (0-15) | Retained; maps via compositor |
| Overlay/text | Above all | Unchanged |
Sprite priority field (4-bit, 0-15): stored in sprite_attr_t.priority, bits [62:59]
of the 64-bit attribute word. Within a scanline, higher priority sprites overwrite lower
priority sprites in the sprite line buffer during rendering. Bits [58:57] of the attribute
word are unused and reserved.
Sprite transparent index: A single global SPRITE_TRANS_IDX register (4-bit, default 0)
in the Sprite register group sets the palette index treated as transparent across all sprites.
This matches the per-layer transparent_idx approach for tiles. Per-sprite transparent index
is not supported in V1.5 (the 64-bit attribute word has no spare bits for it without breaking
the V1 format); a global register covers the common case cleanly.
Full priority unification (all sources compared via pixel_t.priority) is a later
refactoring milestone or V2 target — it requires compositor changes that touch the display
output and should not be attempted until all other stages are stable.
8. Sprite Engine #
The V1 sprite engine consists of three modules that are retained and cleaned up in V1.5:
| Module | Role |
|---|---|
SpriteAttribRAM |
Dual-port BRAM, 256 × 64-bit attribute words |
SpriteLineSelector |
Scans all 256 sprites each HBlank; writes up to 48 visible entries to SpriteLineRAM |
SpriteLineRenderer |
Reads SpriteLineRAM, fetches tile data from SRAM, writes pixels to sprite line buffer |
Sprite attribute word (64-bit) — V1 format, preserved for CPU compatibility:
[63] enable
[62:59] priority 4-bit, 0-15 (sprite Z-order; higher overwrites lower)
[58:57] unused reserved, must be written as 0
[56] flip_x
[55] flip_y
[54:51] palette 4-bit, 0-15 palette bank
[50:40] tile_index 11-bit, 0-2047 (same tile format as tile layers, 4bpp)
[39:30] x 10-bit signed, logical pixel coordinate (-512..511)
[29:20] y 10-bit signed, logical pixel coordinate (-512..511)
[19:0] reserved
Sprites are fixed 16×16 pixels, 4bpp, operating in 320×240 logical space (same as tile layers). Maximum 48 sprites per scanline; the selector silently drops extras beyond that limit.
SpriteLineRAM intermediate entry (36-bit):
[35] flip_x
[34:31] priority (carried from attribute)
[30:29] unused
[28:25] row 4-bit (0-15; flip-Y applied by selector)
[24:15] x 10-bit signed
[14:11] palette 4-bit
[10:0] tile_index 11-bit
The sprite_pkg.vhd type sprite_attr_t and sprite_line_entry_t formalise both formats
with to_/from_ conversion functions, replacing the raw bit-slice access currently
scattered across the modules.
9. Text Subsystem #
The text engine is retained as a separate module in V1.5. The primary change from V1 is moving the character map from a dedicated BRAM to SRAM. The font glyph BRAM is retained. See §9.5 for the V2 path to integrate text as a TileLayer mode.
9.1 Text Buffer in SRAM #
The text character map is stored in SRAM at a configurable base address, pointed to by
TEXT_BASE registers. This matches the framebuffer and tile map model.
Benefits over the V1 BRAM character map:
- Scroll is a pointer increment: advance
TEXT_BASEby one row’s worth of character entries. No data movement. This is the critical improvement for EmuTOS terminal operation. - Direct CPU writes: the CPU addresses text positions as
*(text_base + row*cols + col). No index/data register protocol. The 68000 can write to any character position in one bus cycle. - Multiple text screens: swap
TEXT_BASEto show a different text page, same as framebuffer page flipping. - DMA on text buffer:
CMD_MEM_FILLclears the screen in one drawing engine command;CMD_MEM_COPYcopies text regions. - Frees one RAMB18 that was previously dedicated to the text character map.
The text prefetch reads a row of character entries from SRAM each HBlank using the same arbiter mechanism as tile prefetch. For an 80-column display, this is 80 word reads per line — well within the HBlank budget.
9.2 Text Character Map Entry (16-bit per character position) #
Per-character colour is a fundamental text mode feature: terminal emulators set individual character colours via ANSI escape codes, and many text UI conventions (errors in red, headings highlighted) depend on it. Global fg/bg registers cannot provide this. The attribute format therefore embeds colour per character:
Text map word (16-bit):
[15] Blink Per-character foreground blink
[14:12] Background 3-bit palette index (0..7)
[11:8] Foreground 4-bit palette index (0..15)
[7:0] Character Character code 0..255
This matches the classic CGA/EGA text attribute encoding and is compatible with the V1 text attribute format. Colour indices reference the main palette (same 256-entry palette used by tiles and sprites).
Blink and background colour interaction (same as CGA): when blink is not used, bit 15 can be treated as a 4th background colour bit, extending background to 16 colours (palette entries 0..15). The text engine always respects bit 15 as blink; firmware that wants 16 bg colours simply never uses blink. This is a firmware convention, not a mode switch — no register controls it.
Bold/italic are not hardware attributes. They require a separate bold or italic font to be loaded — a software concern. Attribute bits for these are not allocated; the hardware cannot implement them without font support that is outside the VDP’s scope.
9.3 Font Glyphs in BRAM (pre-initialised) #
Font glyph data (pixel patterns) remains in a dedicated BRAM:
- 256 characters × 16 rows × 1 word/row = 4,096 words (8 KB)
- Stored as 16-bit words: upper 8 bits = row pixels, MSB = leftmost pixel; lower 8 bits spare
- Pre-initialised at FPGA configuration using BRAM init attributes: display works at power-on before the CPU has executed any initialisation code
- CPU can upload a custom font via the FONT_ADDR / FONT_DATA register interface (with auto-increment, same as V1); the font BRAM owns this interface directly
- BRAM read is single-cycle, no SRAM arbiter access required for font data
This is the existing FontRAM module, reframed as an internal resource of the text engine
rather than a global BRAM managed by RegisterControl.
9.4 Text Engine Register Block #
The text engine owns its own register group (Group 0x2) in the distributed register scheme:
| Offset | Register | Contents |
|---|---|---|
| +0 | TEXT_BASE_L | Character map SRAM base address, lower 16 bits |
| +1 | TEXT_BASE_H | Character map SRAM base address, upper 4 bits |
| +2 | TEXT_CURSOR | [15:8] cursor row, [7:0] cursor column |
| +3 | TEXT_CURSOR_ATTR | Cursor colour: [7:4] background index (0..7), [3:0] foreground index (0..15) |
| +4 | TEXT_SCROLL_X | Horizontal fine scroll (0..7, sub-character pixel shift) |
| +5 | TEXT_SCROLL_Y | Vertical scroll (row count, 0..N) |
| +6 | TEXT_CTRL | [0]: enable; [1]: cursor enable; [3:2]: blink rate (frames per half-period); [5:4]: columns (00=80, 01=40) |
| +7 | spare |
9.5 Hardware Cursor #
The text engine tracks cursor position from TEXT_CURSOR. When the rendered character
position matches and cursor is enabled, the cursor overlay is applied: the character cell
is drawn using the colours from TEXT_CURSOR_ATTR rather than the character’s own
attribute byte, effectively highlighting it. Blink is driven by the same frame counter
that controls character blink (same rate, same TEXT_CTRL blink field).
Cursor style (underline vs full-block) is controlled via TEXT_CTRL. The cursor does not
need a separate BRAM or sprite; it is a conditional colour substitution during character
rendering.
The existing hardware cursor sprite (Group 0xA registers) is retained independently for pointer/GUI cursor use and is no longer the primary text cursor mechanism in V1.5.
9.6 V2 Path: Text as TileLayer Mode #
For V2, text rendering can be integrated as a colour_mode = 1bpp variant of the generic
TileLayer entity:
TILE_H = 16generic covers 8×16 characters directly; memory footprint per tile is identical to a 4bpp 8×8 tile (16 words each)TILEMAP_W = 80for 640-pixel text — line buffer becomes 80×8 = 640 pixels wide, double-buffered to 1280 entries, requiring a RAMB36 (vs RAMB18 for 320-pixel tile layers). XC6SLX25 has 26 RAMB36 equivalents; with V1.5 BRAM savings this is affordable.- The attribute word is reinterpreted in 1bpp mode using the same V1.5 per-character
encoding: [15]=blink, [14:12]=bg palette index (0..7), [11:8]=fg palette index (0..15),
[7:0]=character code;
transparent_idxis not applicable (1bpp uses palette indices per-character, not a single transparency key) - Font glyphs live in a BRAM local to the layer (same as V1.5 FontRAM, just re-owned)
- Cursor position and style become fields in
tile_layer_cfg_tfor text mode - Removes the separate TextGen module entirely
This path is deferred to V2 to avoid touching the compositor and display pipeline while the V1.5 foundation is still being established.
10. Drawing Engine #
10.1 Command FIFO #
The CPU writes drawing parameters into a register window and asserts a COMMIT register.
This pushes a draw_cmd_t into a BRAM-based FIFO. The drawing engine (SYS domain) pops
and processes commands independently. The CPU can poll a FIFO_FULL status bit or be
interrupted when the FIFO has space.
CPU side (50 MHz) Drawing Engine (100 MHz)
┌──────────────────┐ ┌──────────────────────────┐
│ Register file │──write──▶ │ Command FIFO (BRAM) │
│ PARAM_X0..X2 │ │ (dual-port: CPU writes, │
│ PARAM_Y0..Y2 │ │ engine reads) │
│ PARAM_COLOUR │ └────────────┬─────────────┘
│ PARAM_OP/MODE │ │ draw_cmd_t
│ REG_COMMIT ─────┼──push─────▶ ▼
│ REG_STATUS ◀────┼──busy/full──── DrawDispatch
└──────────────────┘ │ │
▼ ▼
LineOp TriangleOp ...
│ │
└────┬────┘
▼
Span FIFO
│
▼
SRAM writer (req/ack)
Key benefit: The CPU fills the FIFO non-blocking and continues preparing the next
command while the drawing engine renders. This is the primary performance improvement
for operations that require many small commands — such as rendering a screen of text
using DRAW_EXPAND (one command per glyph row, queued without stalling the CPU).
10.2 Two Output Paths from the Drawing Engine #
The drawing engine produces two types of SRAM write requests depending on the operation:
Path A — Span write (filled rectangles, triangle fill, filled polygon):
type draw_span_t is record
y : unsigned(8 downto 0);
x_start : unsigned(9 downto 0);
x_end : unsigned(9 downto 0);
colour : colour_t;
mode : draw_mode_t;
end record;
The span FIFO decouples geometry computation from SRAM write throughput. Rectangle fills emit one span per row. Triangle fills use a fixed-point edge walker (left and right edge positions per scanline) that emits spans row by row as edges are stepped. The SRAM writer converts a span into sequential word-aligned writes, issuing burst requests where the span is wide enough to benefit.
Path B — Pixel write (lines, individual pixel primitives, DRAW_EXPAND):
type draw_pixel_t is record
x : unsigned(9 downto 0);
y : unsigned(8 downto 0);
colour : colour_t;
mode : draw_mode_t;
end record;
Bresenham line drawing generates scattered pixels that do not form horizontal runs.
DRAW_EXPAND generates 16 sequential pixels from one command (one per bit of the
pattern word); these are adjacent horizontally but arrive as individual pixel writes.
Both go through the pixel write path directly to the SRAM writer as individual word
read-modify-write cycles.
Both paths converge at the SRAM writer, which is the sole location that handles draw modes. Mode handling was distributed across V1 drawing modules; in V1.5 it is in one place:
MODE_REPLACE: write directly (span: burst write; pixel: single write)MODE_XOR/MODE_OR/MODE_AND: read-modify-write (single word per operation)MODE_TRANSPARENT: skip write if source pixel matches transparency colour
The span path uses the SRAM burst interface (burst_len up to 3) for wide spans.
The pixel path always uses burst_len=0 (single word). A shallow synchronous write FIFO
between the geometry engine and SRAM writer allows burst assembly for consecutive
addresses — particularly effective for solid-colour hline spans.
10.3 Canvas Space, Drawbase, and Stride #
Drawing coordinates are in canvas space, not screen space. The canvas is an
abstract pixel grid whose origin (0,0) maps to a physical SRAM address given by the
DRAW_BASE register. The V1 drawbase register already provides this; V1.5 makes
it explicit and adds a companion DRAW_STRIDE register.
SRAM address = DRAW_BASE + (y * DRAW_STRIDE) + x_byte_offset
DRAW_STRIDE is the width of the canvas in 16-bit words. In normal operation this
equals the screen width in words (e.g., 320 for 640-pixel 5-6-5 display), but for
off-screen canvases (textures, back-buffer of different size) it can differ.
This means:
- Drawing to the front buffer:
DRAW_BASE = DISPLAY_BASE,DRAW_STRIDE = screen_stride - Drawing to a back buffer:
DRAW_BASE = back_buffer_addr, same stride - Screen flip: swap
DISPLAY_BASEto point at the completed back buffer - Drawing to an off-screen texture:
DRAW_BASEandDRAW_STRIDEset for that texture
All coordinates in draw_cmd_t are canvas-relative (signed(15 downto 0)) so the same
command works regardless of where the canvas is located in SRAM. Commands with coordinates
entirely outside the clip window are discarded in DrawDispatch before any SRAM access.
Clipping is applied against CLIP_X0/Y0/CLIP_X1/Y1 registers, which are also in canvas
space. The default clip window is (0,0) to (canvas_width-1, canvas_height-1).
10.4 Pattern System #
Two pattern mechanisms serve different use cases:
Inline 16-bit pattern (lines, DRAW_EXPAND):
The draw_cmd_t.pattern word is a 1D bitmask. For lines, each bit selects the colour of
the corresponding pixel along the line (bit 15 = first pixel). When pattern = X"0000",
all pixels use colour0 (solid draw). For DRAW_EXPAND, the pattern word IS the source
bitmap — each bit expands to a full pixel using colour0 (bit=1) or colour1 (bit=0).
PatternRAM (rectangles, area fills):
A 64-entry × 16-row × 16-bit RAM holds 2D fill patterns. Each entry is a 16×16 tile
pattern indexed by (x mod 16, y mod 16) or relative to the shape’s origin. The PatternRAM
is pre-initialised at FPGA configuration with the standard EmuTOS fVDI pattern set;
remaining entries are available for application upload at runtime.
Both mechanisms produce the same output contract to the SRAM writer: a pattern_set
signal (1 = use colour0, 0 = use colour1). The SRAM writer applies colours without
knowing which pattern source was used.
The PatternRAM is owned by the drawing engine module. CPU access (upload/readback) is via the Pattern register group (Group 0xD), handled by the drawing engine’s distributed register interface.
11. SRAM Arbiter #
11.1 Interface #
The arbiter accepts requests from N clients, all presenting sram_req_t, and forwards
one at a time to the physical SRAM controller. The physical controller is entirely
internal to SRAMControl.
entity SRAMarbiter is
generic (
N_CLIENTS : integer := 6; -- extensible without touching SRAMControl
MAX_BURST : integer := 8 -- must match SRAM_MAX_BURST in sram_pkg
);
port (
clk_sys : in std_logic;
reset_n : in std_logic;
-- Client interfaces (indexed 0..N_CLIENTS-1)
req_in : in sram_req_array_t(0 to N_CLIENTS-1);
ack_out : out sram_ack_array_t(0 to N_CLIENTS-1);
-- Physical SRAM (to SRAMControl)
phys_req : out sram_req_t;
phys_ack : in sram_ack_t
);
end SRAMarbiter;
11.2 Priority Assignment #
| Client index | Client | Priority class |
|---|---|---|
| 0 | Tile prefetch (all layers, round-robined internally) | Medium |
| 1 | Sprite prefetch | Medium |
| 2 | Bitmap prefetch | Medium |
| 3 | Text prefetch (character map reads) | Medium |
| 4 | Drawing engine span writer | Low (background) |
| 5 | CPU DMA (direct CPU memory access) | High (but rare) |
Priority class is a generic on each client’s arbiter slot. Within the same priority class, round-robin scheduling prevents starvation. CPU DMA is high priority because the CPU is waiting for DTACK, but it rarely asserts. Display prefetch is medium — it has a time budget (HBlank) but is not as urgent as CPU.
Key improvement over V1: Adding a new SRAM client means incrementing N_CLIENTS and
adding one slot. The existing SRAMControl physical layer does not change.
12. Register File and CPU Interface #
12.1 Register Map Organisation #
The V1.5 register map uses a 7-bit word address (128 registers), decoded as:
addr[6:3]= 4-bit group select (16 groups)addr[2:0]= 3-bit register within group (8 registers per group)
Note: The V1.5 register map is incompatible with V1. This is deliberate — the V1 map grew without structure and mixed unrelated concerns in the same address group. The new map gives every logical subsystem a clean, independent address block. Firmware targeting V1.5 hardware must use the new addresses.
| Group | Address range | Block | Notes |
|---|---|---|---|
| 0x0 | 0x00–0x07 | System | Status, control (display mode, enables, display_blank), interrupt, SRAM page |
| 0x1 | 0x08–0x0F | Display | Framebuf base L/H, linescroll base L/H, border colour |
| 0x2 | 0x10–0x17 | Text engine | Text base L/H, fg/bg colour, scroll, cursor, ctrl |
| 0x3 | 0x18–0x1F | Font RAM | Font addr, font data (auto-increment); spare |
| 0x4 | 0x20–0x27 | Palette | Palette addr, data, ctrl0, ctrl1; spare |
| 0x5 | 0x28–0x2F | Tile layer 0 | Tile base, map base, scroll X/Y, ctrl, xscroll addr/data |
| 0x6 | 0x30–0x37 | Tile layer 1 | (same structure) |
| 0x7 | 0x38–0x3F | Tile layer 2 | (same structure) |
| 0x8 | 0x40–0x47 | Tile layer 3 | (same structure) |
| 0x9 | 0x48–0x4F | Sprites | Sprite base L/H, index, pos X/Y, tile, attrib, transparent index |
| 0xA | 0x50–0x57 | Hardware cursor | Enable, X, Y, addr, data, col1/2/3 |
| 0xB | 0x58–0x5F | Drawing | Base L/H, stride, status, color0/1, mode, pattern |
| 0xC | 0x60–0x67 | Draw params | Param 0..6, CMD_COMMIT |
| 0xD | 0x68–0x6F | Pattern + clip | Pattern addr, pattern data; CLIP_X0/Y0/X1/Y1 |
| 0xE | 0x70–0x77 | SRAM map | Page ctrl; reserved for co-processor (see §14.1) |
| 0xF | 0x78–0x7F | Board ID | Version, type ID |
System block detail (Group 0x0):
| Offset | Register | Notes |
|---|---|---|
| +0 | STATUS | Bit 0: draw engine not busy; bit 1: vblank; bit 2: line int |
| +1 | CONTROL | Bits [2:0]: display mode; [5:3]: bpp; [8]: cursor_en; [9]: blink_en; [10]: mode_80col; [11]: sprite_en; [13:12]: scanline_ctrl; [14]: text_overlay; [15]: display_blank |
| +2 | INTERRUPT_CTRL | Individual interrupt enable/mask bits (see §12.3) |
| +3 | LINEINT_Y | Line interrupt scanline position |
| +4 | SRAM_PAGE_CTRL | SRAM page select |
| +5–+7 | spare |
CONTROL[15] — display_blank: When set to 1, the VGA output drives black on all
colour channels while HSYNC and VSYNC continue normally (monitor stays synchronised, no
re-acquire delay on unblank). The drawing engine and all prefetch continue running. This
is a display-output gate only — not a pause. Typical use: set blank before loading the
framebuffer, update FRAMEBUF_BASE, then clear blank to reveal a complete frame
without visible loading artefacts.
Drawing block detail (Group 0xB):
| Offset | Register | Notes |
|---|---|---|
| +0 | DRAW_BASE_LO | Canvas SRAM base address, lower 16 bits |
| +1 | DRAW_BASE_HI | Canvas SRAM base address, upper 4 bits |
| +2 | DRAW_STRIDE | Canvas width in 16-bit words (default = screen stride) |
| +3 | DRAW_STATUS | Bit 0: FIFO full; bit 1: FIFO empty; bit 2: busy |
| +4 | DRAW_COLOR0 | Primary / foreground colour (5-6-5) |
| +5 | DRAW_COLOR1 | Secondary / background colour (5-6-5) |
| +6 | DRAW_MODE | Draw mode (replace/xor/or/and/transparent) |
| +7 | DRAW_PATTERN | Inline 16-bit pattern (lines, DRAW_EXPAND) |
Draw params block detail (Group 0xC):
| Offset | Register | Notes |
|---|---|---|
| +0 | PARAM_0 | Command parameter registers (opcode, coords, etc.) |
| … | … | Written before CMD_COMMIT |
| +6 | PARAM_6 | |
| +7 | CMD_COMMIT | Write to push draw_cmd_t into FIFO |
Pattern + clip block detail (Group 0xD):
| Offset | Register | Notes |
|---|---|---|
| +0 | PATTERN_ADDR | PatternRAM write address (auto-increment on data write) |
| +1 | PATTERN_DATA | PatternRAM write data (16-bit); increments PATTERN_ADDR |
| +2 | CLIP_X0 | Clip window left edge (canvas space) |
| +3 | CLIP_Y0 | Clip window top edge (canvas space) |
| +4 | CLIP_X1 | Clip window right edge (canvas space) |
| +5 | CLIP_Y1 | Clip window bottom edge (canvas space) |
| +6–+7 | spare |
Tile layer block detail (one set per layer, Groups 0x5–0x8):
| Offset | Register | Notes |
|---|---|---|
| +0 | TILE_BASE_L | Tile data SRAM base address, lower 16 bits |
| +1 | TILE_BASE_H | Tile data SRAM base address, upper 4 bits |
| +2 | MAP_BASE_L | Tile map SRAM base address, lower 16 bits |
| +3 | MAP_BASE_H | Tile map SRAM base address, upper 4 bits |
| +4 | SCROLL_X | Global X scroll for this layer |
| +5 | SCROLL_Y | Global Y scroll for this layer |
| +6 | LAYER_CTRL | [0]: enable; [2:1]: colour mode; [5:3]: priority; [9:6]: transparent_idx (default 0) |
| +7 | XSCROLL_ADDR | Indirect address for per-row X scroll table write |
(Y scroll indirect data and YSCROLL_CTRL deferred to V2 — see §7.2)
12.2 Distributed Register Ownership #
In V1.5, each subsystem module owns its own register decode, configuration FFs, and any
RAM access (text BRAM, font BRAM, sprite attrib BRAM, pattern RAM, palette RAM). The
central RegisterControl module is reduced to a thin bus router:
- Group decode:
RegisterControldecodesaddr[6:3](4-bit group) and drives a qualifiedreqto exactly one subsystem module per transaction. - Read-data mux:
RegisterControlselects which subsystem’sdata_outto return based on the active group — a shallow 1-of-16 mux on 4 bits, not the deep nested case structure of V1. - ACK arbitration: Only the addressed module ever has its
reqasserted; itsackis returned directly. No collision is possible. - Bus timing:
RegisterControlretains thecpureg_read_pendinggate to prevent a new transaction starting while a pipelined BRAM read is in flight.
Each subsystem module presents a standard bus port:
-- Standard register sub-port (all subsystems)
reg_cmd : in unsigned(2 downto 0); -- sub-address within group
reg_data_in : in std_logic_vector(15 downto 0);
reg_rw : in std_logic;
reg_req : in std_logic; -- asserted only when group matches
reg_ack : out std_logic;
reg_data_out : out std_logic_vector(15 downto 0)
Benefits of distributed ownership:
- Eliminates the backward dependency where V1
RegisterControlreached directly into other modules’ BRAMs (TextGen, FontRAM, SpriteAttribRAM, PatternRAM, PaletteRAM) - Hot per-line registers (tile scroll X/Y, draw mode) live inside the consuming module with no propagation delay to a central file
- Adding a new subsystem requires: (a) implementing its
reg_*port, and (b) adding one entry to theRegisterControlgroup mux. No other module changes. VDPControllerport list shrinks significantly — no tile/sprite/draw config signals need to be routed throughRegisterControl
Modules owning registers in V1.5:
| Module | Group(s) | RAM access owned |
|---|---|---|
| RegisterSystem | 0x0 | None (FFs only) |
| RegisterDisplay | 0x1 | None (FFs only) |
| TextGen | 0x2, 0x3 | FontRAM BRAM (write path) |
| PaletteCtrl | 0x4 | PaletteRAM BRAM |
| TileLayer (×4) | 0x5–0x8 | Per-layer scroll table BRAM |
| SpriteEngine | 0x9 | SpriteAttribRAM BRAM |
| HardwareCursor | 0xA | Cursor sprite BRAM |
| DrawEngine | 0xB, 0xC, 0xD | PatternRAM BRAM |
| RegisterControl | (router only) | None |
| BoardID | 0xF | None (constants) |
12.3 Interrupt Sources #
V1.5 defines a clean interrupt structure rather than a single irq_n_o:
| Bit | Source | Use case |
|---|---|---|
| 0 | VBlank | Frame sync, animation update |
| 1 | HBlank (configurable line) | Raster effects, scroll update |
| 2 | Drawing FIFO empty | CPU knows drawing is complete |
| 3 | Drawing FIFO has space | CPU can submit next command |
| 4–7 | Reserved | V2: DMA complete, co-processor |
Each bit is individually maskable. The single irq_n_o pin is the OR of unmasked active bits.
13. Refactoring Stages #
The V1 working system is the starting point. Each stage produces a bitfile that can be tested on hardware immediately — the working display output is the regression test. Stages are ordered from lowest to highest risk.
Stage 0 — Display blank bit (quick win, no structural change) #
Files: RegisterControl.vhd, VGA output stage
Changes:
- Add
display_blankport output fromRegisterControl(bit 15 of CONTROL register) - In VGA output stage: gate RGB channels to zero when
display_blank = '1'; HSYNC/VSYNC continue normally
Test: Set bit, verify screen goes black with monitor remaining in sync. Clear bit, verify display resumes. Entirely self-contained change, zero risk to other functionality.
Stage 1 — Library and language cleanup (no logic change) #
Files: all V1 modules
Changes:
- Replace
STD_LOGIC_UNSIGNEDwithNUMERIC_STDthroughout - Fix
powerupprocess non-standardif condition and rising_edgeconstruct - Fix
DualPortRAM/FontRAMsingle-process dual-write (causes BRAM inference failure; split into two separate processes using the standard TDP template)
Test: Synthesise each module after change. Bitfile must be functionally identical to V1.
Stage 2 — Package migration #
Files: all V1 modules, GfxVGA_pkg.vhd
Changes:
- New packages (
gfx_types_pkg,vga_pkg,sram_pkg,draw_pkg,tile_pkg,sprite_pkg,vdp_pkg) are already written and sit alongsideGfxVGA_pkg.vhd - Migrate modules one at a time to import from new packages;
GfxVGA_pkgremoved last - Update
pixel_tusage to new definition
Test: After each module migration, synthesise and confirm display output unchanged.
Stage 3 — CDC formalisation #
Files: VDPController.vhd, ClockVGAtoSYS.vhd
Changes:
- Audit every signal that crosses a clock domain boundary in
VDPController - Any crossing not already routed through
ClockVGAtoSYSis moved there - Remove any ad-hoc 2-FF chains that live inline in
VDPController
Test: Synthesise with timing constraints active. Check ISE timing report for unconstrained cross-domain paths. Display output must be unchanged.
Stage 4 — Register file decomposition #
Files: RegisterControl.vhd, all subsystem modules
Changes:
- Adopt new V1.5 register map (Group 0x0–0xF, see §12.1)
- Implement distributed register ownership: each subsystem module gains its
reg_*port RegisterControlbecomes a thin bus router (group decode + 1-of-16 read mux)- Modules previously accessing external BRAMs via RegisterControl now own those BRAMs directly (FontRAM → TextGen; PaletteRAM → PaletteCtrl; PatternRAM → DrawEngine; SpriteAttribRAM → SpriteEngine)
- Expected saving: ~300 LUT slices on XC6SLX25 from eliminating deep decode trees
Test: All CPU register reads and writes must produce same behaviour. DTACK timing must be unchanged. Verify resource report shows LUT reduction. Note: firmware must be updated to use V1.5 register addresses.
Stage 5 — Text SRAM migration #
Files: TextGen.vhd, text engine register interface
Changes:
- Replace BRAM character map with SRAM pointer (TEXT_BASE register)
- Add text prefetch controller (reads character map from SRAM each HBlank)
- TextGen module owns font BRAM directly (was accessed via RegisterControl)
- Hardware cursor moved from external hardware cursor sprite registers into TextGen
- Character attribute word updated to V1.5 format (blink, reverse video, attribute bits)
Test: Text display must be correct. Scroll by incrementing TEXT_BASE pointer and verify correct screen content. Cursor must blink at configured rate. Custom font upload via FONT_ADDR/FONT_DATA must update display.
Stage 6 — SRAM arbiter interface #
Files: SRAMControl.vhd, VDPController.vhd and all SRAM clients
Changes:
- Adopt
sram_req_t/sram_ack_t/sram_wdata_tport types throughout - Replace grown-in-place arbiter with clean
N_CLIENTSgeneric design (see §11) - Existing SRAM timing and burst protocol unchanged — only interface type changes
- Add text prefetch as client index 3
Test: Bitmap display and drawing engine must produce identical output. SRAM timing must remain within spec.
Stage 7 — TileLayer consolidation #
Files: TileGenerator.vhd (×4), new TileLayer.vhd
Changes:
- Write generic
TileLayerentity usingtile_layer_cfg_tport (see §6.1) - Replace
TileGeneratorinstances one at a time inVDPController - Per-layer priority register introduced via
tile_layer_cfg_t.priority - Carry forward per-row X scroll table unchanged
Test: Replace one layer at a time; confirm tile display correct after each substitution.
Scroll behaviour must match V1. Once all four replaced, delete TileGenerator.vhd.
Stage 8 — Sprite engine cleanup #
Files: SpriteAttribRAM.vhd, SpriteLineSelector.vhd, SpriteLineRenderer.vhd
Changes:
- Adopt
sprite_attr_tandsprite_line_entry_tfromsprite_pkg; replace raw bit-slice access withto_/from_conversion functions - Fix stale comment in
SpriteAttribRAM(bits [59:58] documented as “layer priority” but decoded as unused by selector; clarify as reserved) - No change to the 64-bit attribute format or sprite rendering behaviour
Test: Sprites must appear in identical positions with identical graphics to V1.
Stage 9 — Drawing engine command FIFO #
Files: Drawing.vhd and related
Changes:
- Decouple CPU from drawing state machine via BRAM command FIFO (see §10.1)
- Add
DRAW_STRIDEregister and canvas-space addressing (see §10.3) - Centralise draw mode handling (XOR/OR/AND/transparent) in a single SRAM writer
- Add burst writes for SOLID mode hline spans (shallow write FIFO, BRAM-based)
- Add
DRAW_EXPANDcommand (16-bit bitmap expand to pixels via colour0/colour1) - Add clip window registers and clipping in
DrawDispatch - Fix known V1 bugs: triangle fill using stale line_pattern; redundant DONE state
Test: All drawing operations (rect, line, fill, blit, expand) must produce correct output. CPU must not stall while FIFO has space. FIFO-full behaviour must be verified. Verify OR and AND modes produce correct results. Verify burst writes are faster than V1 single-word path.
Stage 10 — Compositor unification (optional, or defer to V2) #
Files: PixelCompositor.vhd
Changes:
- Unify all display sources to produce
pixel_twithpriorityandtransparentfields - Replace the V1 fixed-priority compositor with the
compositor_selectfunction fromvdp_pkg(pure priority comparator, no special-case wiring)
Test: Visual output must match V1 at default priority settings. This is the highest-risk stage and may be cleanest to defer to V2.
14. V2 Forward Compatibility #
The following V1.5 decisions are made specifically to make V2 changes localised:
| V2 change | V1.5 preparation |
|---|---|
| 32-bit SRAM bus | sram_pkg.vhd uses record — widen wdata/rdata in one place; burst_len already 3-bit so no interface change needed |
| 8-8-8 RGB colour | colour_t in gfx_types_pkg.vhd — widen fields in one place |
| 25.175 MHz pixel clock | PIPELINE_DEPTH constant drives all timing — no structural change |
| HDMI output | PIX domain produces colour_t and sync; HDMI encoder is a new consumer |
| 200 MHz sprite/draw domain | Span FIFO and line buffer write ports designed for cross-domain; add clock port and CDC |
| More tile layers | N_CLIENTS in arbiter, add TileLayer instances and compositor array entry |
| Per-tile priority | Switch tile map to 32-bit entry; second word holds priority override + extended tile index |
| Text as TileLayer mode | TILE_H=16 generic, 1bpp colour mode, cursor in tile_layer_cfg_t — see §9.6; RAMB36 required for 640-pixel line buffer |
| 640-pixel tile layers (hires mode) | Line buffer depth is TILE_W * TILEMAP_W * 2; synthesis picks RAMB18 or RAMB36 automatically from the derived constant |
| Sub-16bpp drawing | Drawing engine SRAM writer extended to handle sub-word pixel packing; canvas space addressing unchanged |
| Distributed register ownership | V1.5 already distributes ownership; V2 adds modules by extending the group mux only |
14.1 Amiga-style Co-processor (V2, Register Space Reserved) #
Registers Group 0xE (0x70–0x77) are reserved for a display co-processor. The concept:
- A BRAM holds a list of instructions:
(WAIT_H, WAIT_V, REG_ADDR, REG_DATA) - The co-processor executes in the PIX domain, watching H/V counters
- When
(H_cnt, V_cnt)matches a WAIT condition, it writesREG_DATAtoREG_ADDR - This allows palette changes, scroll changes, layer enable/disable mid-frame
- Requires a secondary write path into the register file from the PIX domain (CDC: PIX→SYS for register writes, with the co-processor having highest register priority)
Plan the register write path in V1.5 RegisterControl to accept writes from two sources
(CPU and co-processor) with the co-processor winning ties.
15. V1 Issues Addressed in V1.5 #
| V1 issue | V1.5 resolution |
|---|---|
| DualPortRAM/FontRAM BRAM inference failures (both writes in one process) | All RAMs use split-process TDP template; resource usage drop expected |
| CDC boundaries informal / undocumented | All crossings explicit in §2; CDC module is sole crossing point |
| SRAM arbiter grew in-place with features | Clean N_CLIENTS generic arbiter with burst interface; adding client = adding one slot |
| SRAM interface word-oriented; burst structure implicit in clients | Explicit burst_len (3-bit, 1..8 words) in sram_req_t; burst data flow documented (data_valid/rdata per word, done on last); client accumulator pattern defined |
| Bitmap prefetch uses full-line single transaction (~2,560 cycles), blocking all SRAM clients | Split into 8-word bursts interleaved with other clients; same bandwidth, 84% utilisation including all display layers |
| 4 separate TileGenerator instances, fixed compositor priority | Single generic TileLayer entity; priority via per-layer register |
| Compositor priority fixed in hardware wiring | All sources produce pixel_t with priority field; compositor is a pure priority comparator |
| Sprite priority fixed above all tile layers | Per-sprite priority in attribute table (4 bits); default=5 reproduces V1 behaviour |
| Drawing engine directly driven by CPU register state | Command FIFO decouples CPU from drawing state machines; CPU fills queue and continues |
| Drawing modes (XOR etc.) scattered across drawing modules | Centralised in SRAM writer, applied to both span and pixel paths |
| Only XOR supported as a logical drawing operation | OR and AND added to draw_mode_t; centralised SRAM writer handles all three |
| Line drawing incorrectly mapped to span path | Separate pixel-write path for Bresenham lines; span path for fills only |
| Drawing coordinates in screen space only | Canvas space with DRAW_BASE + DRAW_STRIDE; screen-space drawing = set base to display base |
| No hardware clipping | Clip window registers (canvas space); clipping in DrawDispatch before SRAM access |
| No burst writes for hline spans | Shallow write FIFO enables burst assembly for consecutive addresses; large speedup for fills |
| No 1bpp bitmap expand command | DRAW_EXPAND added; enables hardware text/icon rendering via FIFO with no CPU stall |
| Triangle fill uses stale line_pattern (bug) | Fixed: line_pattern forced to X"0000" in FILLTRI command setup |
| Text character map in dedicated BRAM | Moved to SRAM with pointer register; scroll = pointer increment; direct CPU write access |
| Text BRAM accessed via RegisterControl (backward dependency) | TextGen owns its font BRAM and register interface directly |
| RegisterControl monolithic — all register state in one 970-line module | Distributed ownership: each subsystem module owns its register decode and config FFs |
| RegisterControl reaches into other modules’ BRAMs | Eliminated — each BRAM is owned and accessed only by its module |
| Register map ad-hoc, groups mixing unrelated concerns | New blocked map (Groups 0x0–0xF); each group maps to one subsystem module |
| No framebuffer blank during loading | CONTROL[15] = display_blank blacks RGB output while maintaining sync |
STD_LOGIC_UNSIGNED (non-standard library) throughout |
Replace with NUMERIC_STD throughout |
powerup process uses non-standard if condition and rising_edge construct |
Replace with standard synchronous reset pattern |
| Interrupt is a single wire | Structured interrupt register with individual mask bits |
| Transparent colour hardcoded to palette index 0 for all tile layers and sprites | Per-layer transparent_idx in tile_layer_cfg_t (4-bit, default 0, backward compatible); palette index 0 is no longer reserved once a non-zero transparent index is configured. Global SPRITE_TRANS_IDX register covers sprites. |
| Register file as flat FF array (~350-450 slices on S6, ~9-12% of device) | Distributed ownership moves FFs into consuming modules; eliminates centralised decode tree |
Document version 1.3 — June 2026
Target: GfxVGA V1.5, XC6SLX25, ISE 14.x
Forward reference: GfxVGA V2, XC7S50, Vivado 2025.x