Skip to content

Rebase/cache refactoring onto main#20

Open
Aquaticfuller wants to merge 23 commits into
mainfrom
rebase/cache-refactoring-onto-main
Open

Rebase/cache refactoring onto main#20
Aquaticfuller wants to merge 23 commits into
mainfrom
rebase/cache-refactoring-onto-main

Conversation

@Aquaticfuller
Copy link
Copy Markdown
Member

No description provided.

Comment thread hardware/src/tcdm_cache_interco.sv Outdated
`ifndef TARGET_SYNTHESIS
// Debug scoreboard: track outstanding requests per (output-bank, input-core)
// and validate that each response targets a core with outstanding traffic.
int unsigned outstanding_q [NumCache+NumRemotePort-1:0][NumCores+NumRemotePort-1:0];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use unpacked format...

Comment thread hardware/src/tcdm_cache_interco.sv Outdated
// Start from previous occupancy.
for (int o = 0; o < NumCache + NumRemotePort; o++) begin
for (int c = 0; c < NumCores + NumRemotePort; c++) begin
delta_d[o][c] = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove logics and loops inside always_ff, and use <=

spatz_mem_rsp_pop[p] = 1'b0;
spatz_mem_rsp_valid[p] = 1'b1;
spatz_mem_rsp[p] = tcdm_rsp_i[p].p;
`ifndef TARGET_SYNTHESIS
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No logic should be add in always_ff. Please change to use assert property

parameter int unsigned AxiUserWidth = SpatzAxiUserWidth,
parameter int unsigned AxiInIdWidth = SpatzAxiIdInWidth,
parameter int unsigned AxiOutIdWidth = SpatzAxiIdOutWidth,
parameter bit UseFoldedDataBanks = 1'b1,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass these parameters from pkg instead of propogating level-by-level?

- wire DataPartSplit/folded params through cluster/group/tile
- implement skewed folded data SRAM mapping in cachepool_tile
- adjust cluster wrapper tb for the new configuration
- add scalar cache tests that run basic and stress patterns without crossing 128-bit parts
- add vector cache tests that use RVV loads/stores on 128-bit chunks and verify data integrity
- integrate both tests into the test CMake and keep patterns aligned to folded-cache part size
Select a single part per column/bank per cycle and prioritize write parts over read parts to avoid clobbering bank signals.
  - cachepool_tile: use EffectiveCoalFactor=1 in folded mode; pass to cache ctrl.
  - cachepool_cc: size Spatz response FIFO with NumSpatzOutstandingLoads; add overflow assert.
  - tcdm_cache_interco: add non-synthesis outstanding scoreboard/asserts for req/rsp matching.
Bump insitu-cache to the folded/hash-way revision, thread
UseHashWaySelect through cluster/tile, and queue Spatz memory
responses through the local response FIFO instead of bypassing
write acks.
  cachepool_cc: per-port sb_q[user.req_id] slot table for out-of-order
  rsp matching; watchdog dumps stuck ids. Gated by parameter
  (default off, +define+ENABLE_SPATZ_REQ_SCOREBOARD to enable).
The skew-bank arbiter at (col, bank_sel) picks writes over reads
without exposing the loser; a hardwired l1_data_bank_gnt=1 caused the
upstream to consume stale rdata when another way wrote the same
column.  Compute any_other_write_in_col (loop-free, depends only on
part_we) and gate gnt by it: writes always granted, reads granted iff
no OTHER way writes the same (col, bank_sel).  Excludes own way's
writes so own idle words aren't spuriously stalled.  Fixes multi-core
coherence in rlc-mimic and unlocks AllowReadDuringWrite=1 on data
banks.
- l1cache: flush+wait before xbar commit so the reconfig doesn't leave
  dirty lines bound to the old hash layout.
- mcs-lock: move cluster barrier before the non-zero-core spin loop
  (otherwise cores 1+ never barrier and core 0 deadlocks).
- load-store: print the correct buffer name (B/C, not A) in the B/C
  error messages; add c_ptr to the pointer dump.
- idotp-32b: include got/expected in Check Failed! print.
…rw} tests

Register five new cache-focused tests in CMakeLists.txt:
- cache-line-rw-smoke   single-core line-granular RW smoke
- cache-rlc-mimic        RLC traffic mimic (vector load/store)
- cache-vector-rw        multi-iteration vector load-store kernel
- cache-coverage         12-phase multi-core cache stress / coverage
- cache-coverage-min     minimal phase-06 writeback-loss repro
- Bender.lock: bump insitu-cache to the rev with the wrapper/coalescer
  SBs and the SYNC_CTRL_CHECK_PEND fix.
- Makefile: define ENABLE_SPATZ_REQ_SCOREBOARD so the in-RTL Spatz
  req/rsp watchdog is on by default.
- cachepool_tile.sv: per-port pre-strip TCDM req tracer
  (+sb_pretrace_addr_lo/hi) and byte-granular shadow-memory model
  (+mm_enable) that $errors on DATA / TYPE / ORPHAN_RSP mismatches.
  Both passive, off by default, sim-only.
- config.mk: derive axi_user_width as base + 2*(idx_width(num_tiles)-1).
  Previous widths truncated bank_id MSB on the AXI loopback, routing
  cache_ctrl refill responses to the icache bypass slot.

- cachepool_group.sv: use the source tile id `t` (not target_tile) for
  the request destination slot, so the response (routed by user.tile_id
  mod NumRemotePortCore) lands on the same xbar mst port as the request.
The `win` offset combined `it * 64u + cid * 7u`.  The `cid * 7u` term
is odd for cid > 0, so `wp = (base + win + j * 4U)` ended up
unaligned for any non-zero core.  Snitch raises a misaligned
load/store exception for unaligned uint32_t accesses, and this
runtime has no exception handler installed, so cores 1+ entered a
trap loop at PC 0x800005fc while cores 0/2/3 stalled at the next
sync_all.  Result: the test always timed out without printing UART.

Change the per-core stride to `cid * 28u` (= 7 * 4) so the offset
stays varied per core but is always 4-aligned, restoring the
original "varied window" intent.  Test now passes with retval=0.
After the per-core stats printf, non-zero cores entered `while(1){}`
and were never able to reach the second `snrt_cluster_hw_barrier()`
below.  Core 0 then waited forever at that barrier for cores 1+.
Result: the kernel never reached `return 0`, _snrt_exit was never
called, and EOC was never asserted -- the sim always timed out.

Removing the if/while-loop (and the now-pointless second barrier)
lets every core return cleanly; _snrt_exit only fires set_eoc on
core 0 anyway, and the other cores halt naturally.

mcs-lock now reaches EOC retval=0 cleanly.
The existing fft-32b_M1024_N16 test is parameterized for 16 cores --
data_1024_16.h has active_cores=16 baked in and the kernel slices
the work by active_cores.  On a 4-core config only 4/16 of the FFT
actually executes, so the output is uniformly wrong (r:1024,i:1024)
and the test self-fails with retval=1.

Add a 1024_4 variant alongside, generated via gen_data.py from a
new fft_1024_4.json config.  Both variants now coexist; the N16
variant is appropriate for 4t/16c and the N4 variant for 1t/4c.

The new 1024_4 test passes cleanly (r:0, i:0, retval=0).
Midpoint between cachepool_fpu_512 (1t/4c — passes) and
cachepool_4t_fpu_512 (4t/16c — broken).  Used to isolate whether the
multi-tile cache failures are specific to 4 tiles or to any
configuration with NumTiles > 1.  cache-line-rw-smoke fails at 2t/8c
with the same DATA-MISMATCH signature seen at 4t/16c, confirming
the bug is in the inter-tile / group-xbar path itself, not a
4-tile-only artefact.
Reduces cache-line-rw-smoke to the smallest pattern that still
triggers the multi-tile cache bug:

  * only core 0 does work (1 store + 1 load to one cache line)
  * all other cores immediately return 0
  * no printf, no library calls
  * 16 words written + read

On cachepool_fpu_512 (1 tile) this passes cleanly.
On cachepool_2t_fpu_512 and cachepool_4t_fpu_512 the SB still flags
RESP DATA MISMATCH on cache lines touched by the startup/exit
runtime path (not by the test data).  The test's own data check
PASSES because the corrupted line is not the test's buf line, but
the underlying cache-state bug is reproduced.

Conclusion from this repro: the bug fires the moment NumTiles > 1
even on purely single-core local activity -- it is NOT a coherence
problem (no cross-tile sharing happens here) and NOT a remote-port
routing problem (no real remote traffic from cores 1+).  The
suspect surface narrows to the multi-tile-conditional rotation
math in tcdm_cache_interco/cachepool_tile (bits_to_rotate widens
from CacheBankBits to CacheBankBits+TileBits at NumTiles>1) or the
remote-port muxing inside the local cache_ctrl when those ports
are wired in even though they carry no traffic.

Use for future waveform-level debug:
  make vsim config=cachepool_2t_fpu_512 -B
  ./sim/bin/cachepool_cluster.vsim \
      software/build/CachePoolTests/test-cachepool-minimal-tile0-repro
The multi-tile (NumTiles>1) cluster instantiates cachepool_group, which then
instantiates cachepool_tile -- but cachepool_group did not forward
UseHashWaySelect, so the tile fell back to its own 1'b0 default. This silently
disabled hash-way select on every multi-tile build and triggered the
forwarding-buffer / skewed-fold data-corruption path. Add the missing
parameter wiring; default to 1'b1 to match cachepool_cluster.sv.
@Aquaticfuller Aquaticfuller force-pushed the rebase/cache-refactoring-onto-main branch from 37bee71 to 8dc21a5 Compare May 27, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants