A fully private, local Graph-RAG (Retrieval-Augmented Generation) system designed to map, understand, and query large codebases without sending your proprietary code to the cloud.
- 100% Local Inference: Powered by
llama.cppand Semantic Kernel, routing strictly to your local GPU (optimized for models like Gemma 4). - Codebase Semantic Mapping: Uses
networkxto build a structural and semantic GraphML representation of your repository, understanding how files and classes interact. - Dynamic Sub-Graph Retrieval: Prevents memory bandwidth bottlenecks (KV Cache overflow) by extracting only the Top-K relevant nodes and their 1-hop neighbors before injecting them into the LLM context.
- Asynchronous & Highly Optimized: Heavy AST parsing, Leiden community clustering, and graph querying are offloaded to background threads. The FastAPI event loop remains unblocked, providing extremely low-latency routing and ingestion orchestration.
- Parallel LLM Analysis: During Phase 2 knowledge graph enrichment, multiple LLM requests are batched and executed concurrently (via semaphores and
asyncio.gather), drastically reducing overall indexing time. - Graph Caching Mechanism: Prevents redundant XML parsing across queries by caching the parsed networkx graph in memory, reducing system overhead on subsequent questions.
- Streaming Streamlit UI: A responsive chat interface that natively parses and renders fragmented JSON-Lines streams.
- Reasoning Tag Parser: Includes a custom state-machine to perfectly intercept and render
<think>tags from reasoning models into clean Markdown blockquotes. - Real-Time Telemetry: Tracks and displays generation stats (Tokens Per Second, Time Taken, Prompt vs. Generated Tokens) instantly in the UI.
The system is decoupled into two primary services:
- The Backend Engine (FastAPI):
- Handles asynchronous codebase ingestion and sequential AST parsing via an optimized background worker (
asyncio.to_thread). - Executes parallel Phase 2 community clustering and LLM knowledge enrichment using async Semaphores.
- Manages the NetworkX graph database with centralized caching to prevent redundant I/O.
- Executes dynamic Graph Retrieval and streams JSONL responses via Semantic Kernel.
- Handles asynchronous codebase ingestion and sequential AST parsing via an optimized background worker (
- The Command Center (Streamlit):
- Provides a unified dashboard for triggering background ingestion jobs with real-time progress polling.
- Manages the streaming chat interface, telemetry UI, and token limit controls.
- Python 3.12+
- uv package manager
llama.cpp(llama-server) installed and accessible in your path.
To simplify launching the private Graph-RAG ecosystem, a single unified startup script is provided. This script boots the LLM engine, FastAPI API, and Streamlit client in parallel and manages graceful shutdowns:
# Make the script executable (if needed)
chmod +x start.sh
# Start all services
./start.shThe script performs the following operations:
- Local Inference Server: Boots
llama-serveron port8080, loading your local GGUF weights (optimized for Gemma 4) with GPU offloading (-ngl 999) and a high context limit (131072). - FastAPI Backend: Boots the backend engine on http://localhost:8000 after allowing the LLM a 30-second window to load weights into VRAM.
- Streamlit UI Command Center: Launches the frontend on http://localhost:8501.
Note: Pressing Ctrl+C triggers a shell trap that sends SIGINT to all spawned services, terminating background tasks cleanly and preventing zombie processes.
If you prefer starting services individually, run them in separate terminal windows:
- Boot the LLM Engine:
cd ~/llama.cpp/build ./bin/llama-server -m ~/llmhost/model/gemma-4-E4B-it-Q4_K_M.gguf -ngl 999 -c 131072 -fa on -ctk q4_0 -ctv q4_0 --host 0.0.0.0 --port 8080 --jinja --pooling rank
- Start the Backend API:
uv run main.py
- Launch the Streamlit UI:
uv run streamlit run app.py
During integration, the following core architecture and performance challenges were addressed to ensure stable local operation:
- Challenge: By default, Streamlit script execution is synchronous and re-runs from top to bottom on any UI interaction (e.g., resizing, clicking toggles). If a stream from the backend is running during a rerun, the HTTP stream gets severed, aborting the generation midway and starting over.
- Optimization: Decoupled UI interactions from network fetching by offloading the stream consumption to a background Python
threading.Thread. The background runner consumes the backend's JSON-Lines stream and writes incoming tokens/telemetry directly to a thread-safe list inst.session_state. The UI reads from this buffer asynchronously, keeping the network pipeline alive across reruns.
- Challenge: Streamlit
st.tabsunmounts inactive tab DOMs on switch. To show the streaming LLM chat response, a500msauto-refresh (st_autorefresh) is required. However, if the user switches to view thevis.jsinteractive canvas, the autorefresh triggers a constant reload of the map iframe, locking the UI thread and freezing web rendering. - Optimization: Replaced client-side tabs with a top-level
st.radiocontrol acting as a view-state selector. Thest_autorefreshis programmatically wrapped to trigger only when the user is actively viewing the💬 AI Assistanttab and a stream is in progress. Switching to the🕸️ Interactive Architecture Mapsuspends refreshes, allowing smooth canvas rendering and preservation of user zoom/pan states.
- Challenge: Rendering streaming text inside dynamic placeholders like
st.empty()results in repeated DOM node recreation. This leads to heavy layout flickering, visual shifts, and scrollbar bouncing as the container heights collapse. - Optimization: Eliminated empty placeholder wrappers for standard content generation. Streamlit's React implementation natively diffs the updated markdown string in-place. Completed thought cycles are enclosed inside structured complete statuses (
st.status), ensuring stable container sizing.
- Challenge: Critical architecture clues reside in metadata files like
pyproject.tomlandrequirements.txt, which are traditionally skipped by standard AST parsers. - Optimization: Extended AST analysis boundaries by adding
.tomland.txtextensions to the ingestion parser's supported extensions. This enables parsing configuration files to extract third-party library dependencies and map infrastructure linkages.
- Frontend: Streamlit, Requests, vis.js (via iframe)
- Backend: FastAPI, Uvicorn, Semantic Kernel, NetworkX, Leiden Community Clustering
- Local Inference: llama.cpp, Gemma 4 (Reasoning Model)
More features coming soon....