Ir al contenido

kratos-jarvis

Python 3 Bash Wayland / Hyprland OpenClaw Ollama Self-hosted MIT Daily driver

GitHub: stevenvo780/kratos-jarvis


kratos-jarvis is a self-hosted, voice-driven autonomous AI assistant built around the OpenClaw agent gateway. It runs entirely on local hardware — no cloud APIs required — and integrates voice input, an LLM relevance gate, screen vision, biometric/location awareness, and self-healing reliability mechanisms into a single cohesive system.

It is published as a reference implementation of one person’s real home setup, not a turnkey installer. Paths, containers, and environment variables are parameterised; no tokens or secrets are committed.

SubsystemLocation
Hands-free voice stack (STT, TTS, VAD, screen vision, dashboard)voice/
Consigliere advisor + executor hooks + learning loopautonomy/
Location + health telemetry ingestion (geofencing, persistence)telemetry/
Watchdogs, heartbeats, context-collapse guard, stress testsreliability/
Per-subsystem deep writeupsdocs/

The continuous-listen engine (voice/claw-listen-daemon.py) runs as a background process and handles the full voice pipeline:

  • Energy VAD with adaptive noise floor: computes RMS energy per audio frame; applies onset debounce to avoid false triggers from breath or ambient noise; arms the silence watchdog only after a configurable minimum speech duration.
  • Local Whisper STT: transcribes with faster-whisper-large-v3-turbo via a local OpenAI-compatible audio server running on GPU.
  • Relevance gate: a small local LLM (via Ollama) decides whether the transcription is addressed to Jarvis or is ambient household speech — queries that fail the gate are discarded without further processing.
  • Kokoro TTS: synthesises replies with Kokoro-82M, also served by the local audio sidecar; playback mutes the microphone capture to prevent feedback loops.
  • TTS cleaning: strips markdown, emojis, and URLs before sending text to the TTS engine so spoken output sounds natural.
  • On-demand screen vision (voice/claw-ver): triggered by a configurable phrase (e.g. “look at monitor 2”), captures the screen with grim, converts to a format the agent accepts, and returns a spoken answer. Vision is never persistent — each capture is one-shot.

Additional voice tools:

ScriptRole
voice/claw-listenControl CLI: start/stop/status the daemon
voice/claw-menuwofi panel for quick actions
voice/claw-talk + voice/clawbar-vad.pyPush-to-talk bridge + VAD (shared with clawbar)
voice/jarvisSingle-screen status dashboard: tower health, insight of the day, one action

The autonomy/ subsystem provides a structured decision-support and executor framework:

  • Commitment graph: cross-references incoming decisions against a knowledge graph built from git commits and project documentation, surfacing CONNECTIONS, TENSIONS, and STEELMAN arguments.
  • Learning loop: captures interaction outcomes and feeds them back into the advisor to improve future recommendations.
  • Executor hooks: fire shell or Python actions in response to advisor outputs (e.g. run a script, update a file, send a notification).
  • Focus gate: blocks or defers low-priority interruptions based on current activity state.

The telemetry/ subsystem ingests real-world context to make the assistant activity-aware:

  • Ingests GPS coordinates and health vitals (heart rate, steps, SpO2) from Android via Health Connect.
  • Maintains geofences (home, office, gym, etc.) with configurable dwell times.
  • Infers current activity (coding, commuting, working out, sleeping) approximately every 20 minutes.
  • Persists telemetry to Postgres for historical analysis and advisor context.

The reliability/ subsystem ensures the system stays running and recovers automatically:

  • Watchdogs: monitor key processes (daemon, Docker containers, audio sidecar) and restart them on failure.
  • Dead-man heartbeat: sends periodic signals to a NAS; missed heartbeats trigger an alert.
  • Context-collapse guard: detects when the LLM context has degraded (e.g. runaway memory) and resets the session.
  • Resilience stress-test suite: hammers the voice pipeline with synthetic inputs to validate failure modes before they occur in production.

Ventana de terminal
git clone https://github.com/stevenvo780/kratos-jarvis.git
cd kratos-jarvis

Then follow the subsystem-specific setup guides in docs/voice.md, docs/autonomy.md, and docs/telemetry.md.

Ventana de terminal
# Install Python deps for the listen daemon
pip install faster-whisper tqdm
# Start the continuous-listen engine
python3 voice/claw-listen-daemon.py
# Control the daemon
voice/claw-listen start
voice/claw-listen stop
voice/claw-listen status

  • Linux with PipeWire or PulseAudio
  • Hyprland compositor (Wayland)
  • Docker, with:
    • An OpenClaw agent container
    • An audio sidecar serving an OpenAI-compatible /v1/audio/transcriptions and /v1/audio/speech endpoint (e.g. speaches with faster-whisper + Kokoro)
  • Ollama with a small local model for the relevance gate
  • Postgres for telemetry persistence
  • grim (optional, for screen vision)
  • wofi (optional, for the claw-menu panel)

LayerTechnology
Agent gatewayOpenClaw (Docker container)
STTfaster-whisper-large-v3-turbo (local GPU)
TTSKokoro-82M (local GPU, via speaches sidecar)
Relevance / triage LLMOllama (small local model)
Voice daemonPython 3
Control scriptsBash
CompositorHyprland / Wayland
Panelwofi
Screen capturegrim + ImageMagick
Telemetry DBPostgres
MCP toolingMCP tool servers (e.g. agora-mcp)
LicenseMIT