Skip to main content

Inference Engine

·1723 words·9 mins

Technologies

TypeScript Node.js Rust Express Axum SSE

A distributed inference framework for large language models that routes requests to workers, manages KV cache lifecycle, handles failures gracefully, and applies backpressure so memory — not compute — is the bottleneck.

Note: This is the infrastructure layer. LLM integration is not yet implemented. The system provides the distributed architecture, routing, and session management, but actual model inference needs to be integrated.

TL;DR
#

What this is: A production-ready distributed inference framework for scaling LLM serving across multiple workers with memory-aware admission control, backpressure handling, and automatic failure recovery.

What this isn’t: A complete LLM inference solution (model integration pending) or a single-server inference engine.

Tech Stack: TypeScript/Node.js (Coordinator) + Rust (Worker) with Express and Axum.

Key Features: O(1) admission control, horizontal scaling, backpressure, session management, heartbeat-based health monitoring.

Use Cases
#

This framework is designed for:

  • Scaling LLM inference across multiple GPU workers
  • Memory-constrained environments where KV cache management is critical
  • Production deployments requiring high availability and failure resilience
  • Multi-tenant systems needing session isolation and capacity management
  • Streaming inference with backpressure to handle slow clients gracefully

Quick Start
#

git clone <repository-url>
cd inference-engine
./start.sh

Then test inference: python test_inference.py "What is the capital of France?" or use the curl/scripts below. See Setup and running for full prerequisites and options.


Setup and running
#

Prerequisites
#

RequirementPurpose
Node.js 18+Coordinator (TypeScript/Node)
npmInstall coordinator dependencies
Rust 1.70+Worker (Rust) — install from rustup.rs
LLVM (Windows only)Worker build needs libclang, llvm-nm, and llvm-objcopy for the llama_cpp_sys crate. Install LLVM (e.g. 17.x) and set LIBCLANG_PATH to the LLVM bin directory (e.g. C:\Program Files\LLVM\bin). Also set NM_PATH to the full path to llvm-nm.exe and OBJCOPY_PATH to the full path to llvm-objcopy.exe in the same directory (e.g. C:\Program Files\LLVM\bin\llvm-objcopy.exe), or add that directory to PATH. start.sh derives NM_PATH from LIBCLANG_PATH if set.

1. Clone and install
#

git clone <repository-url>
cd inference-engine

Coordinator (one-time):

cd coordinator
npm install
cd ..

Worker: No separate install step; start.sh (or cargo build) will compile it.

2. Download a model (required for inference)
#

The worker loads a GGUF model file. Default is TinyLlama 1.1B.

  1. Download a TinyLlama GGUF from TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF (e.g. tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf).
  2. Place it in modelFiles/ in the project root (create the folder if needed).
  3. Optional: set MODEL_PATH to the full path to your .gguf file if you use a different path or filename. Use forward slashes when setting in Git Bash (e.g. E:/Projects/inference-engine/modelFiles/my-model.gguf).

3. Run the system
#

Option A – Start both with one script (recommended):

./start.sh

This will:

  • Build and start the Coordinator on http://localhost:1337
  • Build and start the Worker on http://localhost:3001

Press Ctrl+C to stop both.

Option B – Run Coordinator and Worker separately:

Terminal 1 – Coordinator:

cd coordinator
npm run build
npm start

Terminal 2 – Worker (from project root):

# Optional: set model path (use forward slashes on Windows in Git Bash)
# export MODEL_PATH="E:/Projects/inference-engine/modelFiles/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

cd worker
cargo build
cargo run

4. Test the API
#

Health checks:

curl http://localhost:1337/coordinator/health
curl http://localhost:3001/worker/health
curl http://localhost:1337/coordinator/health/workers

Streaming inference (curl):

curl -N -X POST http://localhost:1337/coordinator/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt":"What is the capital of France?","model":"tinyllama-1.1b","max_tokens":1000}'

Python test script:

python test_inference.py "What is the capital of France?" 1000

Shell test script:

./test_inference.sh "What is the capital of France?" 1000

5. Environment variables (optional)
#

VariableWhereDescription
MODEL_PATHWorkerPath to GGUF model file. Use forward slashes in Git Bash. Default: .../modelFiles/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
LIBCLANG_PATHWorker build (Windows)LLVM bin directory (for libclang), e.g. C:\Program Files\LLVM\bin.
NM_PATHWorker build (Windows)Full path to llvm-nm.exe. Can be derived from LIBCLANG_PATH (see start.sh).
OBJCOPY_PATHWorker build (Windows)Full path to llvm-objcopy.exe, e.g. C:\Program Files\LLVM\bin\llvm-objcopy.exe.
PORTCoordinatorCoordinator port (default 1337).
HOSTCoordinatorCoordinator host (default 0.0.0.0).
WORKER_ID, WORKER_URL, COORDINATOR_URLWorkerOverride worker identity and URLs if running multiple workers or custom topology.

Architecture
#

System Type: Distributed coordinator-worker architecture with stateless scheduling.

Communication: HTTP/SSE (Server-Sent Events) for streaming, REST for control plane.

Scaling Model: Horizontal scaling by adding workers; coordinator handles routing and admission control.

The system consists of three components, each with a single responsibility:

┌─────────────────────────────────────────────────────────────────┐
│                         CLIENT                                  │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                       COORDINATOR                               │
│  • Admission control (O(1))                                     │
│  • Session tracking                                             │
│  • Backpressure + streaming                                     │
└─────────────────────────────────────────────────────────────────┘
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│     WORKER 1      │ │     WORKER 2      │ │     WORKER N      │
│  • KV Cache       │ │  • KV Cache       │ │  • KV Cache       │
│  • Model Weights  │ │  • Model Weights  │ │  • Model Weights  │
│  • Decode Loop    │ │  • Decode Loop    │ │  • Decode Loop    │
└───────────────────┘ └───────────────────┘ └───────────────────┘

Components
#

Coordinator (TypeScript/Node.js)

  • Entry point for all client requests
  • Streams tokens from worker to client
  • Applies backpressure — buffers fill, clients get dropped, not workers
  • Tracks sessions for real-time capacity awareness
  • Never touches model weights or KV cache

Scheduler (Pure function)

  • Selects which worker handles each request
  • Scores workers by session count (60%) and KV cache usage (40%)
  • Rejects early if system is at capacity (O(1) check)

Worker (Rust)

  • Designed to own the model — weights, tokenizer, KV cache (LLM integration pending)
  • Prefill: Tokenize prompt, build initial KV cache (infrastructure ready)
  • Decode: Autoregressive token generation (infrastructure ready)
  • Enforces local limits — max sessions, max KV per session
  • No client awareness — just produces tokens into a bounded channel

API Reference
#

POST /coordinator/infer
#

Start an inference request. Returns streaming tokens via Server-Sent Events.

Request:

{
  "prompt": "string",
  "model": "string",
  "max_tokens": number
}

Response: text/event-stream

Each SSE event:

{
  "token": "string",
  "finished": boolean
}

Status Codes:

  • 200 - Success (streaming)
  • 400 - Missing required fields
  • 502 - Worker unreachable or failed
  • 503 - System at capacity

See protocol/inference.http.md for complete API documentation.


Configuration
#

Coordinator
#

Environment variables (optional):

  • PORT - Server port (default: 1337)
  • HOST - Server host (default: 0.0.0.0)

Worker
#

Environment variables:

  • MODEL_PATH - Path to GGUF model file. Use forward slashes (e.g. E:/path/to/model.gguf) when setting in Git Bash. Default: E:/Projects/inference-engine/modelFiles/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Supported models: The worker uses the llama_cpp Rust crate (v0.3), which bundles llama.cpp. Default is TinyLlama 1.1B (TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF); use any quant e.g. Q4_K_M.gguf. Other supported architectures include Llama, Gemma 2, Phi, Mistral, etc. Gemma 3 is not yet supported by the bundled llama.cpp.

  • WORKER_ID - Unique identifier (default: worker-1)
  • WORKER_URL - Reachable URL for coordinator (default: http://localhost:3001)
  • COORDINATOR_URL - Coordinator base URL (default: http://localhost:1337)

System Limits
#

Per Worker:

  • 100 max sessions
  • 512 MB max KV per session
  • 8 GB total KV cache

System-wide:

  • 1000 total sessions
  • 64 GB total KV cache

Project Structure
#

inference-engine/
├── coordinator/          # TypeScript/Node.js coordinator service
│   ├── src/
│   │   ├── server.ts     # Express server setup
│   │   ├── infer.ts      # Inference request handling
│   │   ├── scheduler.ts  # Worker selection logic
│   │   ├── health.ts     # Health check endpoints
│   │   └── ...
│   └── package.json
├── worker/               # Rust worker service
│   ├── src/
│   │   ├── main.rs       # Entry point
│   │   ├── model.rs      # Model loading & inference
│   │   ├── cache.rs      # KV cache management
│   │   ├── stream.rs     # Token streaming
│   │   └── ...
│   └── Cargo.toml
├── docs/                 # Detailed documentation
│   ├── ARCHITECTURE.md   # System design deep dive
│   ├── COORDINATOR.md    # Coordinator implementation
│   ├── WORKER.md         # Worker implementation
│   ├── FAILURE_MODES.md  # Failure handling strategies
│   └── ...
├── protocol/             # API specifications
│   └── inference.http.md
├── start.sh              # Quick start script
└── README.md

Key Features
#

  • Distributed Architecture: Framework for scaling inference across multiple workers
  • Memory-Aware Admission Control: O(1) capacity checks prevent overload
  • Backpressure: Slow clients are dropped, not workers
  • Failure Resilience: Automatic retries for prefill failures
  • Session Management: KV cache lifecycle infrastructure with TTL-based cleanup
  • Real-time Health Tracking: Heartbeat-based worker monitoring
  • Streaming Infrastructure: Server-Sent Events with bounded channels for backpressure

Keywords: distributed inference, LLM serving, KV cache management, backpressure, admission control, worker scheduling, session management, horizontal scaling, memory-aware load balancing, token streaming, Server-Sent Events, coordinator-worker pattern, failure resilience, health monitoring, heartbeat protocol


Documentation
#

For detailed information, see:


Current Status
#

Project Phase: Infrastructure complete, LLM integration pending.

This project provides the infrastructure layer for distributed LLM inference:

Implemented:

  • Coordinator with admission control and session tracking
  • Worker framework with health monitoring and heartbeat
  • Scheduler for worker selection
  • Streaming infrastructure with backpressure
  • Session management and KV cache lifecycle (infrastructure)
  • Failure handling and retry logic

🚧 Pending:

  • LLM model integration (model loading, tokenization, inference)
  • Actual KV cache implementation tied to a specific model backend
  • Token generation logic

Integration Requirements: To complete LLM integration, implement model loading, tokenization, and inference logic in the worker’s model.rs module. The infrastructure for session management, streaming, and KV cache lifecycle is ready.


Troubleshooting
#

Worker fails to start
#

  • Check ports: Ensure port 3001 is not in use
  • Verify Rust installation: rustc --version should show 1.70+
  • Check build errors: Review cargo build output for dependency issues

Coordinator returns 503 “System at capacity”
#

  • Check worker health: curl http://localhost:3001/worker/health
  • Verify worker registration: curl http://localhost:1337/coordinator/health/workers
  • Check system limits: Review session and KV cache limits
  • Ensure worker is running: Worker must be running and sending heartbeats

Coordinator can’t reach worker
#

  • Verify worker URL: Check WORKER_URL environment variable matches actual worker address
  • Check network: Ensure coordinator can reach worker on the specified port
  • Check heartbeat: Worker should be sending heartbeats every 10 seconds

Development
#

Building
#

Coordinator:

cd coordinator
npm install
npm run build

Worker:

cd worker
cargo build --release

Testing
#

See individual component documentation for testing instructions.


Contributing
#

Contributions welcome! Please read the architecture documentation before making significant changes.


For AI/LLM Parsing
#

Project Summary: Distributed inference framework for LLM serving with coordinator-worker architecture, memory-aware admission control, and backpressure handling.

Primary Technologies: TypeScript, Node.js, Rust, Express, Axum, Server-Sent Events.

Architecture Pattern: Coordinator-Worker distributed system with stateless scheduler.

Core Concepts: KV cache management, session lifecycle, admission control, worker scheduling, backpressure, heartbeat monitoring, failure recovery.

Current State: Infrastructure layer complete; LLM model integration pending.

Related Documentation: See docs/ directory for detailed architecture, failure modes, streaming, and component-specific documentation.