kratos-jarvis

Python 3 Bash Wayland / Hyprland OpenClaw Ollama Self-hosted MIT Daily driver

What it is

kratos-jarvis is a self-hosted, voice-driven autonomous AI assistant built around the OpenClaw agent gateway. It runs entirely on local hardware — no cloud APIs required — and integrates voice input, an LLM relevance gate, screen vision, biometric/location awareness, and self-healing reliability mechanisms into a single cohesive system.

It is published as a reference implementation of one person’s real home setup, not a turnkey installer. Paths, containers, and environment variables are parameterised; no tokens or secrets are committed.

Subsystem	Location
Hands-free voice stack (STT, TTS, VAD, screen vision, dashboard)	`voice/`
Consigliere advisor + executor hooks + learning loop	`autonomy/`
Location + health telemetry ingestion (geofencing, persistence)	`telemetry/`
Watchdogs, heartbeats, context-collapse guard, stress tests	`reliability/`
Per-subsystem deep writeups	`docs/`

Voice subsystem

The continuous-listen engine (voice/claw-listen-daemon.py) runs as a background process and handles the full voice pipeline:

Energy VAD with adaptive noise floor: computes RMS energy per audio frame; applies onset debounce to avoid false triggers from breath or ambient noise; arms the silence watchdog only after a configurable minimum speech duration.
Local Whisper STT: transcribes with faster-whisper-large-v3-turbo via a local OpenAI-compatible audio server running on GPU.
Relevance gate: a small local LLM (via Ollama) decides whether the transcription is addressed to Jarvis or is ambient household speech — queries that fail the gate are discarded without further processing.
Kokoro TTS: synthesises replies with Kokoro-82M, also served by the local audio sidecar; playback mutes the microphone capture to prevent feedback loops.
TTS cleaning: strips markdown, emojis, and URLs before sending text to the TTS engine so spoken output sounds natural.
On-demand screen vision (voice/claw-ver): triggered by a configurable phrase (e.g. “look at monitor 2”), captures the screen with grim, converts to a format the agent accepts, and returns a spoken answer. Vision is never persistent — each capture is one-shot.

Additional voice tools:

Script	Role
`voice/claw-listen`	Control CLI: start/stop/status the daemon
`voice/claw-menu`	wofi panel for quick actions
`voice/claw-talk` + `voice/clawbar-vad.py`	Push-to-talk bridge + VAD (shared with clawbar)
`voice/jarvis`	Single-screen status dashboard: tower health, insight of the day, one action

Consigliere — autonomy layer

The autonomy/ subsystem provides a structured decision-support and executor framework:

Commitment graph: cross-references incoming decisions against a knowledge graph built from git commits and project documentation, surfacing CONNECTIONS, TENSIONS, and STEELMAN arguments.
Learning loop: captures interaction outcomes and feeds them back into the advisor to improve future recommendations.
Executor hooks: fire shell or Python actions in response to advisor outputs (e.g. run a script, update a file, send a notification).
Focus gate: blocks or defers low-priority interruptions based on current activity state.

Telemetry and geofencing

The telemetry/ subsystem ingests real-world context to make the assistant activity-aware:

Ingests GPS coordinates and health vitals (heart rate, steps, SpO2) from Android via Health Connect.
Maintains geofences (home, office, gym, etc.) with configurable dwell times.
Infers current activity (coding, commuting, working out, sleeping) approximately every 20 minutes.
Persists telemetry to Postgres for historical analysis and advisor context.

Reliability

The reliability/ subsystem ensures the system stays running and recovers automatically:

Watchdogs: monitor key processes (daemon, Docker containers, audio sidecar) and restart them on failure.
Dead-man heartbeat: sends periodic signals to a NAS; missed heartbeats trigger an alert.
Context-collapse guard: detects when the LLM context has degraded (e.g. runaway memory) and resets the session.
Resilience stress-test suite: hammers the voice pipeline with synthetic inputs to validate failure modes before they occur in production.

Installation

git clone https://github.com/stevenvo780/kratos-jarvis.git
cd kratos-jarvis

Then follow the subsystem-specific setup guides in docs/voice.md, docs/autonomy.md, and docs/telemetry.md.

# Install Python deps for the listen daemon
pip install faster-whisper tqdm

# Start the continuous-listen engine
python3 voice/claw-listen-daemon.py

# Control the daemon
voice/claw-listen start
voice/claw-listen stop
voice/claw-listen status

# Requires grim (Wayland screenshot tool)
# Trigger via voice ("look at monitor 2") or directly:
voice/claw-ver

# Requires a Postgres database; set connection env vars first
export TELEMETRY_DB_URL=postgres://user:pass@host/db

# Start the telemetry ingestion endpoint
python3 telemetry/ingest.py

# Start watchdog suite
bash reliability/watchdog.sh

# Run the stress-test harness
bash reliability/stress-test.sh

Requirements

Linux with PipeWire or PulseAudio
Hyprland compositor (Wayland)
Docker, with:
- An OpenClaw agent container
- An audio sidecar serving an OpenAI-compatible /v1/audio/transcriptions and /v1/audio/speech endpoint (e.g. speaches with faster-whisper + Kokoro)
Ollama with a small local model for the relevance gate
Postgres for telemetry persistence
grim (optional, for screen vision)
wofi (optional, for the claw-menu panel)

Stack

Layer	Technology
Agent gateway	OpenClaw (Docker container)
STT	faster-whisper-large-v3-turbo (local GPU)
TTS	Kokoro-82M (local GPU, via speaches sidecar)
Relevance / triage LLM	Ollama (small local model)
Voice daemon	Python 3
Control scripts	Bash
Compositor	Hyprland / Wayland
Panel	wofi
Screen capture	grim + ImageMagick
Telemetry DB	Postgres
MCP tooling	MCP tool servers (e.g. agora-mcp)
License	MIT