Inference Engine

TL;DR
#

What this is: A production-ready distributed inference framework for scaling LLM serving across multiple workers with memory-aware admission control, backpressure handling, and automatic failure recovery.

What this isn’t: A complete LLM inference solution (model integration pending) or a single-server inference engine.

Tech Stack: TypeScript/Node.js (Coordinator) + Rust (Worker) with Express and Axum.

Key Features: O(1) admission control, horizontal scaling, backpressure, session management, heartbeat-based health monitoring.

Use Cases
#

This framework is designed for:

Scaling LLM inference across multiple GPU workers
Memory-constrained environments where KV cache management is critical
Production deployments requiring high availability and failure resilience
Multi-tenant systems needing session isolation and capacity management
Streaming inference with backpressure to handle slow clients gracefully

Quick Start
#

git clone <repository-url>
cd inference-engine
./start.sh

Then test inference: python test_inference.py "What is the capital of France?" or use the curl/scripts below. See Setup and running for full prerequisites and options.

Setup and running
#

Prerequisites
#

Requirement	Purpose
Node.js 18+	Coordinator (TypeScript/Node)
npm	Install coordinator dependencies
Rust 1.70+	Worker (Rust) — install from rustup.rs
LLVM (Windows only)	Worker build needs libclang, llvm-nm, and llvm-objcopy for the `llama_cpp_sys` crate. Install LLVM (e.g. 17.x) and set `LIBCLANG_PATH` to the LLVM `bin` directory (e.g. `C:\Program Files\LLVM\bin`). Also set `NM_PATH` to the full path to `llvm-nm.exe` and `OBJCOPY_PATH` to the full path to `llvm-objcopy.exe` in the same directory (e.g. `C:\Program Files\LLVM\bin\llvm-objcopy.exe`), or add that directory to PATH. `start.sh` derives `NM_PATH` from `LIBCLANG_PATH` if set.

1. Clone and install
#

git clone <repository-url>
cd inference-engine

Coordinator (one-time):

cd coordinator
npm install
cd ..

Worker: No separate install step; start.sh (or cargo build) will compile it.

2. Download a model (required for inference)
#

The worker loads a GGUF model file. Default is TinyLlama 1.1B.

Download a TinyLlama GGUF from TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF (e.g. tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf).
Place it in modelFiles/ in the project root (create the folder if needed).
Optional: set MODEL_PATH to the full path to your .gguf file if you use a different path or filename. Use forward slashes when setting in Git Bash (e.g. E:/Projects/inference-engine/modelFiles/my-model.gguf).

3. Run the system
#

Option A – Start both with one script (recommended):

./start.sh

This will:

Build and start the Coordinator on http://localhost:1337
Build and start the Worker on http://localhost:3001

Press Ctrl+C to stop both.

Option B – Run Coordinator and Worker separately:

Terminal 1 – Coordinator:

cd coordinator
npm run build
npm start

Terminal 2 – Worker (from project root):

# Optional: set model path (use forward slashes on Windows in Git Bash)
# export MODEL_PATH="E:/Projects/inference-engine/modelFiles/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

cd worker
cargo build
cargo run

4. Test the API
#

Health checks:

curl http://localhost:1337/coordinator/health
curl http://localhost:3001/worker/health
curl http://localhost:1337/coordinator/health/workers

Streaming inference (curl):

curl -N -X POST http://localhost:1337/coordinator/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt":"What is the capital of France?","model":"tinyllama-1.1b","max_tokens":1000}'

Python test script:

python test_inference.py "What is the capital of France?" 1000

Shell test script:

./test_inference.sh "What is the capital of France?" 1000

5. Environment variables (optional)
#

Variable	Where	Description
`MODEL_PATH`	Worker	Path to GGUF model file. Use forward slashes in Git Bash. Default: `.../modelFiles/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf`
`LIBCLANG_PATH`	Worker build (Windows)	LLVM `bin` directory (for libclang), e.g. `C:\Program Files\LLVM\bin`.
`NM_PATH`	Worker build (Windows)	Full path to `llvm-nm.exe`. Can be derived from `LIBCLANG_PATH` (see `start.sh`).
`OBJCOPY_PATH`	Worker build (Windows)	Full path to `llvm-objcopy.exe`, e.g. `C:\Program Files\LLVM\bin\llvm-objcopy.exe`.
`PORT`	Coordinator	Coordinator port (default `1337`).
`HOST`	Coordinator	Coordinator host (default `0.0.0.0`).
`WORKER_ID`, `WORKER_URL`, `COORDINATOR_URL`	Worker	Override worker identity and URLs if running multiple workers or custom topology.

Architecture
#

System Type: Distributed coordinator-worker architecture with stateless scheduling.

Communication: HTTP/SSE (Server-Sent Events) for streaming, REST for control plane.

Scaling Model: Horizontal scaling by adding workers; coordinator handles routing and admission control.

The system consists of three components, each with a single responsibility:

┌─────────────────────────────────────────────────────────────────┐
│                         CLIENT                                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                       COORDINATOR                               │
│  • Admission control (O(1))                                     │
│  • Session tracking                                             │
│  • Backpressure + streaming                                     │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│     WORKER 1      │ │     WORKER 2      │ │     WORKER N      │
│  • KV Cache       │ │  • KV Cache       │ │  • KV Cache       │
│  • Model Weights  │ │  • Model Weights  │ │  • Model Weights  │
│  • Decode Loop    │ │  • Decode Loop    │ │  • Decode Loop    │
└───────────────────┘ └───────────────────┘ └───────────────────┘

Components
#

Coordinator (TypeScript/Node.js)

Entry point for all client requests
Streams tokens from worker to client
Applies backpressure — buffers fill, clients get dropped, not workers
Tracks sessions for real-time capacity awareness
Never touches model weights or KV cache

Scheduler (Pure function)

Selects which worker handles each request
Scores workers by session count (60%) and KV cache usage (40%)
Rejects early if system is at capacity (O(1) check)

Worker (Rust)

Designed to own the model — weights, tokenizer, KV cache (LLM integration pending)
Prefill: Tokenize prompt, build initial KV cache (infrastructure ready)
Decode: Autoregressive token generation (infrastructure ready)
Enforces local limits — max sessions, max KV per session
No client awareness — just produces tokens into a bounded channel

API Reference
#

POST `/coordinator/infer`
#

Start an inference request. Returns streaming tokens via Server-Sent Events.

Request:

{
  "prompt": "string",
  "model": "string",
  "max_tokens": number
}

Response: text/event-stream

Each SSE event:

{
  "token": "string",
  "finished": boolean
}

Status Codes:

200 - Success (streaming)
400 - Missing required fields
502 - Worker unreachable or failed
503 - System at capacity

See protocol/inference.http.md for complete API documentation.

Configuration
#

Coordinator
#

Environment variables (optional):

PORT - Server port (default: 1337)
HOST - Server host (default: 0.0.0.0)

Worker
#

Environment variables:

MODEL_PATH - Path to GGUF model file. Use forward slashes (e.g. E:/path/to/model.gguf) when setting in Git Bash. Default: E:/Projects/inference-engine/modelFiles/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Supported models: The worker uses the llama_cpp Rust crate (v0.3), which bundles llama.cpp. Default is TinyLlama 1.1B (TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF); use any quant e.g. Q4_K_M.gguf. Other supported architectures include Llama, Gemma 2, Phi, Mistral, etc. Gemma 3 is not yet supported by the bundled llama.cpp.

WORKER_ID - Unique identifier (default: worker-1)
WORKER_URL - Reachable URL for coordinator (default: http://localhost:3001)
COORDINATOR_URL - Coordinator base URL (default: http://localhost:1337)

System Limits
#

Per Worker:

100 max sessions
512 MB max KV per session
8 GB total KV cache

System-wide:

1000 total sessions
64 GB total KV cache

Project Structure
#

inference-engine/
├── coordinator/          # TypeScript/Node.js coordinator service
│   ├── src/
│   │   ├── server.ts     # Express server setup
│   │   ├── infer.ts      # Inference request handling
│   │   ├── scheduler.ts  # Worker selection logic
│   │   ├── health.ts     # Health check endpoints
│   │   └── ...
│   └── package.json
│
├── worker/               # Rust worker service
│   ├── src/
│   │   ├── main.rs       # Entry point
│   │   ├── model.rs      # Model loading & inference
│   │   ├── cache.rs      # KV cache management
│   │   ├── stream.rs     # Token streaming
│   │   └── ...
│   └── Cargo.toml
│
├── docs/                 # Detailed documentation
│   ├── ARCHITECTURE.md   # System design deep dive
│   ├── COORDINATOR.md    # Coordinator implementation
│   ├── WORKER.md         # Worker implementation
│   ├── FAILURE_MODES.md  # Failure handling strategies
│   └── ...
│
├── protocol/             # API specifications
│   └── inference.http.md
│
├── start.sh              # Quick start script
└── README.md

Key Features
#

Distributed Architecture: Framework for scaling inference across multiple workers
Memory-Aware Admission Control: O(1) capacity checks prevent overload
Backpressure: Slow clients are dropped, not workers
Failure Resilience: Automatic retries for prefill failures
Session Management: KV cache lifecycle infrastructure with TTL-based cleanup
Real-time Health Tracking: Heartbeat-based worker monitoring
Streaming Infrastructure: Server-Sent Events with bounded channels for backpressure

Keywords: distributed inference, LLM serving, KV cache management, backpressure, admission control, worker scheduling, session management, horizontal scaling, memory-aware load balancing, token streaming, Server-Sent Events, coordinator-worker pattern, failure resilience, health monitoring, heartbeat protocol

Documentation
#

For detailed information, see:

System Overview - High-level design and problem statement
Architecture - Deep dive into system design
Coordinator - Coordinator implementation details
Worker - Worker implementation details
Failure Modes - Failure handling strategies
Streaming - Token streaming and backpressure
KV Cache - KV cache management
Scheduler - Worker selection algorithm
Observability - Metrics and monitoring

Current Status
#

Project Phase: Infrastructure complete, LLM integration pending.

This project provides the infrastructure layer for distributed LLM inference:

✅ Implemented:

Coordinator with admission control and session tracking
Worker framework with health monitoring and heartbeat
Scheduler for worker selection
Streaming infrastructure with backpressure
Session management and KV cache lifecycle (infrastructure)
Failure handling and retry logic

🚧 Pending:

LLM model integration (model loading, tokenization, inference)
Actual KV cache implementation tied to a specific model backend
Token generation logic

Integration Requirements: To complete LLM integration, implement model loading, tokenization, and inference logic in the worker’s model.rs module. The infrastructure for session management, streaming, and KV cache lifecycle is ready.

Troubleshooting
#

Worker fails to start
#

Check ports: Ensure port 3001 is not in use
Verify Rust installation: rustc --version should show 1.70+
Check build errors: Review cargo build output for dependency issues

Coordinator returns 503 “System at capacity”
#

Check worker health: curl http://localhost:3001/worker/health
Verify worker registration: curl http://localhost:1337/coordinator/health/workers
Check system limits: Review session and KV cache limits
Ensure worker is running: Worker must be running and sending heartbeats

Coordinator can’t reach worker
#

Verify worker URL: Check WORKER_URL environment variable matches actual worker address
Check network: Ensure coordinator can reach worker on the specified port
Check heartbeat: Worker should be sending heartbeats every 10 seconds

Development
#

Building
#

Coordinator:

cd coordinator
npm install
npm run build

Worker:

cd worker
cargo build --release

Testing
#

See individual component documentation for testing instructions.

Contributing
#

Contributions welcome! Please read the architecture documentation before making significant changes.

For AI/LLM Parsing
#

Project Summary: Distributed inference framework for LLM serving with coordinator-worker architecture, memory-aware admission control, and backpressure handling.

Primary Technologies: TypeScript, Node.js, Rust, Express, Axum, Server-Sent Events.

Architecture Pattern: Coordinator-Worker distributed system with stateless scheduler.

Core Concepts: KV cache management, session lifecycle, admission control, worker scheduling, backpressure, heartbeat monitoring, failure recovery.

Current State: Infrastructure layer complete; LLM model integration pending.

Related Documentation: See docs/ directory for detailed architecture, failure modes, streaming, and component-specific documentation.

TL;DR#

Use Cases#

Quick Start#

Setup and running#

Prerequisites#

1. Clone and install#

2. Download a model (required for inference)#

3. Run the system#

4. Test the API#

5. Environment variables (optional)#

Architecture#

Components#

API Reference#

POST /coordinator/infer#

Configuration#

Coordinator#

Worker#

System Limits#

Project Structure#

Key Features#

Documentation#

Current Status#

Troubleshooting#

Worker fails to start#

Coordinator returns 503 “System at capacity”#

Coordinator can’t reach worker#

Development#

Building#

Testing#

Contributing#

For AI/LLM Parsing#