Native systemd + supercronic deploy for VPS monitoring by spalen0 · Pull Request #259 · yearn/monitoring

spalen0 · 2026-05-31T15:15:34Z

Closes #255. First PR in the VPS migration.

What this does

Adds automation/ — a single source of truth (jobs.yaml) plus a thin runner that ports every existing GH Actions cron schedule — and a native systemd deploy that runs it under supercronic on the VPS. No Docker, no compose, no Caddy: liquidity-monitoring already runs on the same box, so a container boundary buys nothing. The existing GH Actions workflows stay live for a parallel-run verification window; cutover is documented in deploy/cutover.md.

Runs alongside liquidity-monitoring's liqmon on the same host — distinct unit (yearn-monitor) and paths (/etc/yearn-monitoring, /srv/yearn-monitoring), so the two coexist.

Layout

automation/ — flat package alongside aave/, morpho/, utils/, …
- jobs.yaml — SSOT: 5 profiles, 32 tasks, one cron each
- config.py / runner.py / __main__.py — schema+parser; per-task subprocess.run with continue-on-error + one Telegram digest on failure; CLI list / render-crontab / run <profile> [--dry-run]
deploy/
- systemd/yearn-monitor.service — one hardened unit. ExecStartPre requires /etc/yearn-monitoring/.env and renders the crontab from jobs.yaml; ExecStart runs supercronic. Restart=on-failure, ProtectSystem=strict, ReadWritePaths=/srv/cache. No /healthz watchdog — no daemon to wedge.
- install.sh — idempotent provisioning: uv + Python 3.12 + supercronic (SHA-pinned), clone → /srv/yearn-monitoring, uv sync --frozen, create /srv/cache, install the unit.
- runbook.md — ops (status, logs, manual runs, updates, failure table).
- cutover.md — GH Actions → VPS migration playbook (shadow week + flip).
tests/test_automation_{config,render,runner}.py — 21 tests.
pyproject.toml — adds pyyaml, registers automation. uv.lock synced.

Secrets — plain root-owned `.env`

This box signs nothing (unlike liquidity's liqmon), so secrets are a plain /etc/yearn-monitoring/.env (0640 root:<deploy-user>) loaded via EnvironmentFile. The unit refuses to start without it. No sops/age ceremony.

Caching — one `CACHE_DIR`

All on-disk dedupe/cache state resolves against a single CACHE_DIR (set to /srv/cache by the unit; unset locally → repo CWD). utils/cache.py gains cache_path(); the selector cache, stuck-triggers JSON, and the maple/3jane caches (which previously hardcoded repo-relative paths and would have failed writing under the read-only hardened unit) all route through it. jobs.yaml no longer hardcodes /srv/cache paths.

Deployment steps (fresh VPS)

# 1. Provision (installs uv/Python/supercronic, clones repo, venv, systemd unit)
sudo bash /srv/yearn-monitoring/deploy/install.sh
#    …or curl it — see the header of deploy/install.sh.

# 2. Drop the env (copy from .env.example, fill in RPC/Telegram/API keys)
sudo install -m 640 -o root -g <deploy-user> /dev/stdin /etc/yearn-monitoring/.env   # paste, Ctrl-D

# 3. Start it
sudo systemctl enable --now yearn-monitor
systemctl status yearn-monitor

# 4. Verify
cd /srv/yearn-monitoring && uv run python -m automation render-crontab   # expect 5 lines
journalctl -u yearn-monitor -f
uv run python -m automation run hourly --dry-run                         # dry-run, no sends

Updates are git pull --ff-only && sudo systemctl restart yearn-monitor (add uv sync --frozen for dependency changes).

Cutover steps (GH Actions → VPS)

Full playbook in deploy/cutover.md. In short:

Shadow week — deploy with all TELEGRAM_CHAT_ID_* funneled to one shadow chat and all TELEGRAM_TOPIC_ID_* removed (works around telegram.py's per-protocol chat_id having no DEFAULT fallback). GH Actions keeps posting to the real channels; compare for ~7 days. This also warms /srv/cache, so no cold-start alert burst.
Flip — restore the real channels in the env + systemctl restart, confirm one good real tick, then disable the GH crons:
```
gh workflow disable hourly.yml daily.yml weekly.yml multisig-checker.yml --repo yearn/monitoring
```
(Go live before disabling GitHub: duplicate alerts are deduped, a gap is not.)
Rollback — gh workflow enable … + stop/quiet the unit. Both run in parallel again.
Cleanup — keep workflows disabled (not deleted) ≥30 days; later strip the now-dead actions/cache steps from _run-monitoring.yml.

Verification (local)

Check	Result
`uv run python -m automation render-crontab`	5 flock-wrapped lines
`uv run --extra dev pytest`	442 passed, 4 skipped
`uv run --extra dev ruff check .` / `ruff format --check .`	clean
`uv lock --check`	in sync
`bash -n deploy/install.sh`	clean

End-to-end systemd behavior is exercised during provisioning per deploy/runbook.md, not in CI.

Test plan

CI green
Reviewer: uv run python -m automation render-crontab matches the 5 expected lines
Reviewer: skim deploy/systemd/yearn-monitor.service, deploy/runbook.md, deploy/cutover.md
After merge: provision the VPS, run the shadow week, then flip per deploy/cutover.md

🤖 Generated with Claude Code

Add a docker-compose stack that runs every existing GH Actions cron schedule in a single supercronic container. Closes #255. - automation/ — flat package: jobs.yaml SSOT + thin runner that executes a profile's tasks as subprocesses, posting a single Telegram digest on failure. CLI exposes list / render-crontab / run <profile> [--dry-run]. - docker/ — multi-stage uv-based Dockerfile (python 3.12-slim, supercronic v0.2.34 with pinned SHA, non-root app user, tini as PID 1). Single-service compose with a named cache volume; no autoheal sidecar. - jobs.yaml ports hourly, daily, weekly, multisig, and yearn-stuck-triggers profiles from the existing workflows; each invocation is wrapped in flock -n to prevent overlapping runs. - Tests cover jobs.yaml parsing/validation, render-crontab output shape, argv construction, continue-on-error semantics, spawn failure handling, and Telegram digest formatting. Follow-ups in separate PRs: #256 (python 3.14), #257 (VPS install + GHCR publishing), #258 (cutover from GH Actions + auto-update). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop the docker/ container stack in favor of running supercronic directly under systemd on the VPS (liquidity-monitoring runs on the same box, so a container boundary buys nothing here). automation/ is unchanged — same jobs.yaml SSOT, runner, and CLI; only the execution substrate moves from container to host. - remove docker/ (Dockerfile, docker-compose.yml, entrypoint.sh, .dockerignore) - deploy/systemd/yearn-monitor.service — one hardened unit; ExecStartPre decrypts secrets (sops/age) to /etc/yearn-monitoring/.env and renders the crontab from jobs.yaml, ExecStart runs supercronic. Restart=on-failure; no /healthz watchdog (no daemon to wedge). - deploy/install.sh — idempotent fresh-VPS provisioning: uv + Python 3.12 + supercronic (SHA-pinned) + sops + age, clone, uv sync, /srv/cache, unit. - deploy/runbook.md — ops: status, logs, manual runs, git-pull updates, secret + age-key rotation, host failover, failure table. - deploy/secrets/ + .sops.yaml — age/sops scaffolding (gitignore, example, README) for parity with liquidity-monitoring. prod.env.enc is created by the operator after adding their age key. - automation/: scrub Docker references from jobs.yaml, README, __main__. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This box runs no signing key (unlike liquidity-monitoring), so encrypting the env in-repo was ceremony with no funds at stake. The operator drops the env at /etc/yearn-monitoring/.env (0640 root:<deploy-user>) once; the unit loads it via EnvironmentFile and refuses to start without it. - remove .sops.yaml and deploy/secrets/ - service: replace the sops ExecStartPre decrypt with a presence guard - install.sh: drop sops + age installation; checklist now says "drop .env" - runbook: replace secret/age-key rotation with edit-in-place + restart Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # pyproject.toml # uv.lock

The merge committed main's uv.lock, where pyyaml was only present transitively. pyproject declares it directly, so regenerate the lock to match (keeps `uv lock --check` green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…h it Now that the runner has a persistent filesystem, every on-disk dedupe/cache file resolves against one CACHE_DIR knob (set to /srv/cache by the systemd unit, unset locally → repo CWD as before) instead of per-file absolute paths spelled out in jobs.yaml. - utils/cache.py: add CACHE_DIR + cache_path(); wrap cache/nonces/morpho paths. Absolute overrides still win (os.path.join semantics). - utils/calldata/decoder.py: route the selector cache through cache_path so it lands in /srv/cache. Previously it defaulted to selector-cache.txt in the repo dir, which is read-only under the hardened unit (ProtectSystem=strict) — the write silently failed and the cache never persisted. (Plan item #1.) - yearn/check_stuck_triggers.py: DEFAULT_CACHE_FILE resolves under CACHE_DIR. - maple/main.py, 3jane/main.py: these hardcoded their own "cache-id.txt" and so bypassed CACHE_DIR entirely — would fail writing to the read-only repo on the VPS. Route them through cache_path too. - automation/jobs.yaml: drop the absolute /srv/cache paths; only daily keeps a basename override (cache-id-daily.txt) to stay isolated from the hourly file. - deploy/: set CACHE_DIR=/srv/cache in the unit; update runbook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Operator playbook for moving the four scheduled workflows (hourly/daily/weekly/ multisig) off GitHub Actions onto the yearn-monitor unit: - shadow week with all TELEGRAM_CHAT_ID_* funneled to one shadow chat and TELEGRAM_TOPIC_ID_* dropped (works around telegram.py's per-protocol chat_id having no DEFAULT fallback), which also warms /srv/cache before the flip - go-live-before-disabling-GitHub flip (duplicates are deduped; a gap is not) - rollback via `gh workflow enable` + quiet the VPS - cleanup: keep workflows disabled 30d, strip the now-dead actions/cache steps Link it from runbook.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When set, every alert from every protocol is routed to a single chat via the default bot, prefixed with a [protocol] label and with no topic threading, bypassing both topic and legacy per-protocol routing. Lets the whole fleet be sent to one dummy group for staging/comparison without touching the production TELEGRAM_TOPIC_ID_* / per-protocol vars. Extract the shared POST into _post_message to avoid duplication. Document in .env.example and deploy/runbook.md, add a unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rnald cap - Service unit yearn-monitor → monitoring (file, SyslogIdentifier, all docs). - Config dir /etc/yearn-monitoring → /etc/monitoring. - REPO_DIR default → /srv/monitoring (matches the GitHub repo name); the unit's WorkingDirectory/REPO_ROOT/venv PATH are now templated from it via __REPO_DIR__ so a single REPO_DIR override rewrites the unit. - install.sh: add journald persistence + SystemMaxUse cap drop-in (JOURNAL_MAX_USE), and a Grafana Cloud log-shipping pointer in the final steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- integration-check.sh: read-only smoke test that monitoring + liquidity coexist on one host — verifies path/unit/port isolation, masked units, shared git-credential reachability for both GitHub repos, liquidity /healthz, and WARNs on shared RPC providers / Telegram bot / tight RAM-disk. Exit 1 on any hard collision. - throwaway-vps.sh: one-shot provisioner that clones both repos, runs each installer, and brings them up in safe mode (monitoring LOG_LEVEL=DEBUG → no Telegram; liquidity SHADOW_MODE=true → dry-run, no signing/push, with prod.env.enc removed so it boots without an age key), then runs the integration check. `… teardown` undoes it on a reused box. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

integration-check.sh and throwaway-vps.sh reference the private liquidity signer daemon's internals (shadow mode, SOPS secrets, the 127.0.0.1:8080 surface, the private repo URL). They don't belong in this PUBLIC repo — they now live in tapired/liquidity-monitoring/deploy/, whose operators are the ones running coexistence rehearsals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fix Safe nonce dedupe and Telegram plain text sends

Fix stale governance proposal alerts

… pull

Capture each subprocess's stdout/stderr and surface a short error tail (last 4 lines / 500 chars, stderr-preferred) in the failure digest, so alerts are actionable without an SSH/journalctl round-trip. Full output is still re-emitted to the daemon logs on failure. The tail is rendered inside a Markdown V1 fenced code block with backticks neutralized, so tracebacks containing _ * [ can't break parsing and silently drop the whole digest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fold the weekly and yearn-stuck-triggers profiles into the surviving profiles and turn off two monitors: - Move yearn-check-endorsed and yearn-check-timelock-delay from the weekly profile into daily (weekly profile removed). - Move yearn-check-stuck-triggers into hourly as enabled: false (yearn-stuck-triggers profile removed). - Disable yearn-check-shadow-debt in daily. - Shift hourly cron 11 -> 5 (needs systemctl restart to re-render). - Update docs and the repo-jobs.yaml test to the 3 remaining profiles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

spalen0 and others added 2 commits May 31, 2026 15:15

spalen0 changed the title ~~Containerize monitoring scripts for Hetzner deployment~~ Native systemd + supercronic deploy for VPS monitoring Jun 2, 2026

spalen0 and others added 23 commits June 2, 2026 21:39

Merge remote-tracking branch 'origin/main' into docker

d956067

# Conflicts: # pyproject.toml # uv.lock

chore: api3 markets

0e23d06

docs: expalin automation channel

6df19c7

chore: cleanup

5364289

chore: remove rseth check

1c5cc6f

Fix USTB cache path under systemd

71aefb1

Fix Safe nonce dedupe and Telegram plain text sends

899cbd7

Merge pull request #262 from yearn/fix/usdai-safe-alert-dedupe-telegram

05177a7

Fix Safe nonce dedupe and Telegram plain text sends

Fix stale governance proposal alerts

621a6c9

Merge pull request #263 from yearn/fix/proposal-monitor-stale-alerts

0e4d374

Fix stale governance proposal alerts

deploy: auto-sync the checkout via the multisig profile's pre-run git…

8fce342

… pull

chore: lower syrup col ratio

5b5d373

chore: retire migrated GitHub Actions workflows (#264)

315f0eb

spalen0 merged commit 6094d41 into main Jun 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native systemd + supercronic deploy for VPS monitoring#259

Native systemd + supercronic deploy for VPS monitoring#259
spalen0 merged 25 commits into
mainfrom
docker

spalen0 commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spalen0 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Layout

Secrets — plain root-owned .env

Caching — one CACHE_DIR

Deployment steps (fresh VPS)

Cutover steps (GH Actions → VPS)

Verification (local)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spalen0 commented May 31, 2026 •

edited

Loading

Secrets — plain root-owned `.env`

Caching — one `CACHE_DIR`