Skip to content

Native systemd + supercronic deploy for VPS monitoring#259

Merged
spalen0 merged 25 commits into
mainfrom
docker
Jun 4, 2026
Merged

Native systemd + supercronic deploy for VPS monitoring#259
spalen0 merged 25 commits into
mainfrom
docker

Conversation

@spalen0
Copy link
Copy Markdown
Collaborator

@spalen0 spalen0 commented May 31, 2026

Closes #255. First PR in the VPS migration.

What this does

Adds automation/ — a single source of truth (jobs.yaml) plus a thin runner that ports every existing GH Actions cron schedule — and a native systemd deploy that runs it under supercronic on the VPS. No Docker, no compose, no Caddy: liquidity-monitoring already runs on the same box, so a container boundary buys nothing. The existing GH Actions workflows stay live for a parallel-run verification window; cutover is documented in deploy/cutover.md.

Runs alongside liquidity-monitoring's liqmon on the same host — distinct unit (yearn-monitor) and paths (/etc/yearn-monitoring, /srv/yearn-monitoring), so the two coexist.

Layout

  • automation/ — flat package alongside aave/, morpho/, utils/, …
    • jobs.yaml — SSOT: 5 profiles, 32 tasks, one cron each
    • config.py / runner.py / __main__.py — schema+parser; per-task subprocess.run with continue-on-error + one Telegram digest on failure; CLI list / render-crontab / run <profile> [--dry-run]
  • deploy/
    • systemd/yearn-monitor.service — one hardened unit. ExecStartPre requires /etc/yearn-monitoring/.env and renders the crontab from jobs.yaml; ExecStart runs supercronic. Restart=on-failure, ProtectSystem=strict, ReadWritePaths=/srv/cache. No /healthz watchdog — no daemon to wedge.
    • install.sh — idempotent provisioning: uv + Python 3.12 + supercronic (SHA-pinned), clone → /srv/yearn-monitoring, uv sync --frozen, create /srv/cache, install the unit.
    • runbook.md — ops (status, logs, manual runs, updates, failure table).
    • cutover.md — GH Actions → VPS migration playbook (shadow week + flip).
  • tests/test_automation_{config,render,runner}.py — 21 tests.
  • pyproject.toml — adds pyyaml, registers automation. uv.lock synced.

Secrets — plain root-owned .env

This box signs nothing (unlike liquidity's liqmon), so secrets are a plain /etc/yearn-monitoring/.env (0640 root:<deploy-user>) loaded via EnvironmentFile. The unit refuses to start without it. No sops/age ceremony.

Caching — one CACHE_DIR

All on-disk dedupe/cache state resolves against a single CACHE_DIR (set to /srv/cache by the unit; unset locally → repo CWD). utils/cache.py gains cache_path(); the selector cache, stuck-triggers JSON, and the maple/3jane caches (which previously hardcoded repo-relative paths and would have failed writing under the read-only hardened unit) all route through it. jobs.yaml no longer hardcodes /srv/cache paths.

Deployment steps (fresh VPS)

# 1. Provision (installs uv/Python/supercronic, clones repo, venv, systemd unit)
sudo bash /srv/yearn-monitoring/deploy/install.sh
#    …or curl it — see the header of deploy/install.sh.

# 2. Drop the env (copy from .env.example, fill in RPC/Telegram/API keys)
sudo install -m 640 -o root -g <deploy-user> /dev/stdin /etc/yearn-monitoring/.env   # paste, Ctrl-D

# 3. Start it
sudo systemctl enable --now yearn-monitor
systemctl status yearn-monitor

# 4. Verify
cd /srv/yearn-monitoring && uv run python -m automation render-crontab   # expect 5 lines
journalctl -u yearn-monitor -f
uv run python -m automation run hourly --dry-run                         # dry-run, no sends

Updates are git pull --ff-only && sudo systemctl restart yearn-monitor (add uv sync --frozen for dependency changes).

Cutover steps (GH Actions → VPS)

Full playbook in deploy/cutover.md. In short:

  1. Shadow week — deploy with all TELEGRAM_CHAT_ID_* funneled to one shadow chat and all TELEGRAM_TOPIC_ID_* removed (works around telegram.py's per-protocol chat_id having no DEFAULT fallback). GH Actions keeps posting to the real channels; compare for ~7 days. This also warms /srv/cache, so no cold-start alert burst.
  2. Flip — restore the real channels in the env + systemctl restart, confirm one good real tick, then disable the GH crons:
    gh workflow disable hourly.yml daily.yml weekly.yml multisig-checker.yml --repo yearn/monitoring
    (Go live before disabling GitHub: duplicate alerts are deduped, a gap is not.)
  3. Rollbackgh workflow enable … + stop/quiet the unit. Both run in parallel again.
  4. Cleanup — keep workflows disabled (not deleted) ≥30 days; later strip the now-dead actions/cache steps from _run-monitoring.yml.

Verification (local)

Check Result
uv run python -m automation render-crontab 5 flock-wrapped lines
uv run --extra dev pytest 442 passed, 4 skipped
uv run --extra dev ruff check . / ruff format --check . clean
uv lock --check in sync
bash -n deploy/install.sh clean

End-to-end systemd behavior is exercised during provisioning per deploy/runbook.md, not in CI.

Test plan

  • CI green
  • Reviewer: uv run python -m automation render-crontab matches the 5 expected lines
  • Reviewer: skim deploy/systemd/yearn-monitor.service, deploy/runbook.md, deploy/cutover.md
  • After merge: provision the VPS, run the shadow week, then flip per deploy/cutover.md

🤖 Generated with Claude Code

spalen0 and others added 2 commits May 31, 2026 15:15
Add a docker-compose stack that runs every existing GH Actions cron
schedule in a single supercronic container. Closes #255.

- automation/ — flat package: jobs.yaml SSOT + thin runner that executes
  a profile's tasks as subprocesses, posting a single Telegram digest
  on failure. CLI exposes list / render-crontab / run <profile>
  [--dry-run].
- docker/ — multi-stage uv-based Dockerfile (python 3.12-slim,
  supercronic v0.2.34 with pinned SHA, non-root app user, tini as
  PID 1). Single-service compose with a named cache volume; no
  autoheal sidecar.
- jobs.yaml ports hourly, daily, weekly, multisig, and
  yearn-stuck-triggers profiles from the existing workflows; each
  invocation is wrapped in flock -n to prevent overlapping runs.
- Tests cover jobs.yaml parsing/validation, render-crontab output
  shape, argv construction, continue-on-error semantics, spawn
  failure handling, and Telegram digest formatting.

Follow-ups in separate PRs: #256 (python 3.14), #257 (VPS install +
GHCR publishing), #258 (cutover from GH Actions + auto-update).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the docker/ container stack in favor of running supercronic directly
under systemd on the VPS (liquidity-monitoring runs on the same box, so a
container boundary buys nothing here). automation/ is unchanged — same
jobs.yaml SSOT, runner, and CLI; only the execution substrate moves from
container to host.

- remove docker/ (Dockerfile, docker-compose.yml, entrypoint.sh, .dockerignore)
- deploy/systemd/yearn-monitor.service — one hardened unit; ExecStartPre
  decrypts secrets (sops/age) to /etc/yearn-monitoring/.env and renders the
  crontab from jobs.yaml, ExecStart runs supercronic. Restart=on-failure;
  no /healthz watchdog (no daemon to wedge).
- deploy/install.sh — idempotent fresh-VPS provisioning: uv + Python 3.12 +
  supercronic (SHA-pinned) + sops + age, clone, uv sync, /srv/cache, unit.
- deploy/runbook.md — ops: status, logs, manual runs, git-pull updates,
  secret + age-key rotation, host failover, failure table.
- deploy/secrets/ + .sops.yaml — age/sops scaffolding (gitignore, example,
  README) for parity with liquidity-monitoring. prod.env.enc is created by
  the operator after adding their age key.
- automation/: scrub Docker references from jobs.yaml, README, __main__.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@spalen0 spalen0 changed the title Containerize monitoring scripts for Hetzner deployment Native systemd + supercronic deploy for VPS monitoring Jun 2, 2026
spalen0 and others added 23 commits June 2, 2026 21:39
This box runs no signing key (unlike liquidity-monitoring), so encrypting the
env in-repo was ceremony with no funds at stake. The operator drops the env at
/etc/yearn-monitoring/.env (0640 root:<deploy-user>) once; the unit loads it via
EnvironmentFile and refuses to start without it.

- remove .sops.yaml and deploy/secrets/
- service: replace the sops ExecStartPre decrypt with a presence guard
- install.sh: drop sops + age installation; checklist now says "drop .env"
- runbook: replace secret/age-key rotation with edit-in-place + restart

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	pyproject.toml
#	uv.lock
The merge committed main's uv.lock, where pyyaml was only present transitively.
pyproject declares it directly, so regenerate the lock to match (keeps
`uv lock --check` green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h it

Now that the runner has a persistent filesystem, every on-disk dedupe/cache
file resolves against one CACHE_DIR knob (set to /srv/cache by the systemd
unit, unset locally → repo CWD as before) instead of per-file absolute paths
spelled out in jobs.yaml.

- utils/cache.py: add CACHE_DIR + cache_path(); wrap cache/nonces/morpho paths.
  Absolute overrides still win (os.path.join semantics).
- utils/calldata/decoder.py: route the selector cache through cache_path so it
  lands in /srv/cache. Previously it defaulted to selector-cache.txt in the repo
  dir, which is read-only under the hardened unit (ProtectSystem=strict) — the
  write silently failed and the cache never persisted. (Plan item #1.)
- yearn/check_stuck_triggers.py: DEFAULT_CACHE_FILE resolves under CACHE_DIR.
- maple/main.py, 3jane/main.py: these hardcoded their own "cache-id.txt" and so
  bypassed CACHE_DIR entirely — would fail writing to the read-only repo on the
  VPS. Route them through cache_path too.
- automation/jobs.yaml: drop the absolute /srv/cache paths; only daily keeps a
  basename override (cache-id-daily.txt) to stay isolated from the hourly file.
- deploy/: set CACHE_DIR=/srv/cache in the unit; update runbook.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operator playbook for moving the four scheduled workflows (hourly/daily/weekly/
multisig) off GitHub Actions onto the yearn-monitor unit:

- shadow week with all TELEGRAM_CHAT_ID_* funneled to one shadow chat and
  TELEGRAM_TOPIC_ID_* dropped (works around telegram.py's per-protocol chat_id
  having no DEFAULT fallback), which also warms /srv/cache before the flip
- go-live-before-disabling-GitHub flip (duplicates are deduped; a gap is not)
- rollback via `gh workflow enable` + quiet the VPS
- cleanup: keep workflows disabled 30d, strip the now-dead actions/cache steps

Link it from runbook.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When set, every alert from every protocol is routed to a single chat via
the default bot, prefixed with a [protocol] label and with no topic
threading, bypassing both topic and legacy per-protocol routing. Lets the
whole fleet be sent to one dummy group for staging/comparison without
touching the production TELEGRAM_TOPIC_ID_* / per-protocol vars.

Extract the shared POST into _post_message to avoid duplication. Document
in .env.example and deploy/runbook.md, add a unit test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rnald cap

- Service unit yearn-monitor → monitoring (file, SyslogIdentifier, all docs).
- Config dir /etc/yearn-monitoring → /etc/monitoring.
- REPO_DIR default → /srv/monitoring (matches the GitHub repo name); the unit's
  WorkingDirectory/REPO_ROOT/venv PATH are now templated from it via __REPO_DIR__
  so a single REPO_DIR override rewrites the unit.
- install.sh: add journald persistence + SystemMaxUse cap drop-in (JOURNAL_MAX_USE),
  and a Grafana Cloud log-shipping pointer in the final steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- integration-check.sh: read-only smoke test that monitoring + liquidity
  coexist on one host — verifies path/unit/port isolation, masked units,
  shared git-credential reachability for both GitHub repos, liquidity
  /healthz, and WARNs on shared RPC providers / Telegram bot / tight
  RAM-disk. Exit 1 on any hard collision.
- throwaway-vps.sh: one-shot provisioner that clones both repos, runs each
  installer, and brings them up in safe mode (monitoring LOG_LEVEL=DEBUG →
  no Telegram; liquidity SHADOW_MODE=true → dry-run, no signing/push, with
  prod.env.enc removed so it boots without an age key), then runs the
  integration check. `… teardown` undoes it on a reused box.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
integration-check.sh and throwaway-vps.sh reference the private liquidity
signer daemon's internals (shadow mode, SOPS secrets, the 127.0.0.1:8080
surface, the private repo URL). They don't belong in this PUBLIC repo —
they now live in tapired/liquidity-monitoring/deploy/, whose operators are
the ones running coexistence rehearsals.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix Safe nonce dedupe and Telegram plain text sends
Capture each subprocess's stdout/stderr and surface a short error tail
(last 4 lines / 500 chars, stderr-preferred) in the failure digest, so
alerts are actionable without an SSH/journalctl round-trip. Full output
is still re-emitted to the daemon logs on failure.

The tail is rendered inside a Markdown V1 fenced code block with
backticks neutralized, so tracebacks containing _ * [ can't break
parsing and silently drop the whole digest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the weekly and yearn-stuck-triggers profiles into the surviving
profiles and turn off two monitors:

- Move yearn-check-endorsed and yearn-check-timelock-delay from the
  weekly profile into daily (weekly profile removed).
- Move yearn-check-stuck-triggers into hourly as enabled: false
  (yearn-stuck-triggers profile removed).
- Disable yearn-check-shadow-debt in daily.
- Shift hourly cron 11 -> 5 (needs systemctl restart to re-render).
- Update docs and the repo-jobs.yaml test to the 3 remaining profiles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@spalen0 spalen0 merged commit 6094d41 into main Jun 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dockerize monitoring scripts (first PR: packaging + scheduler)

1 participant