Conversation
Add a docker-compose stack that runs every existing GH Actions cron schedule in a single supercronic container. Closes #255. - automation/ — flat package: jobs.yaml SSOT + thin runner that executes a profile's tasks as subprocesses, posting a single Telegram digest on failure. CLI exposes list / render-crontab / run <profile> [--dry-run]. - docker/ — multi-stage uv-based Dockerfile (python 3.12-slim, supercronic v0.2.34 with pinned SHA, non-root app user, tini as PID 1). Single-service compose with a named cache volume; no autoheal sidecar. - jobs.yaml ports hourly, daily, weekly, multisig, and yearn-stuck-triggers profiles from the existing workflows; each invocation is wrapped in flock -n to prevent overlapping runs. - Tests cover jobs.yaml parsing/validation, render-crontab output shape, argv construction, continue-on-error semantics, spawn failure handling, and Telegram digest formatting. Follow-ups in separate PRs: #256 (python 3.14), #257 (VPS install + GHCR publishing), #258 (cutover from GH Actions + auto-update). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the docker/ container stack in favor of running supercronic directly under systemd on the VPS (liquidity-monitoring runs on the same box, so a container boundary buys nothing here). automation/ is unchanged — same jobs.yaml SSOT, runner, and CLI; only the execution substrate moves from container to host. - remove docker/ (Dockerfile, docker-compose.yml, entrypoint.sh, .dockerignore) - deploy/systemd/yearn-monitor.service — one hardened unit; ExecStartPre decrypts secrets (sops/age) to /etc/yearn-monitoring/.env and renders the crontab from jobs.yaml, ExecStart runs supercronic. Restart=on-failure; no /healthz watchdog (no daemon to wedge). - deploy/install.sh — idempotent fresh-VPS provisioning: uv + Python 3.12 + supercronic (SHA-pinned) + sops + age, clone, uv sync, /srv/cache, unit. - deploy/runbook.md — ops: status, logs, manual runs, git-pull updates, secret + age-key rotation, host failover, failure table. - deploy/secrets/ + .sops.yaml — age/sops scaffolding (gitignore, example, README) for parity with liquidity-monitoring. prod.env.enc is created by the operator after adding their age key. - automation/: scrub Docker references from jobs.yaml, README, __main__. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This box runs no signing key (unlike liquidity-monitoring), so encrypting the env in-repo was ceremony with no funds at stake. The operator drops the env at /etc/yearn-monitoring/.env (0640 root:<deploy-user>) once; the unit loads it via EnvironmentFile and refuses to start without it. - remove .sops.yaml and deploy/secrets/ - service: replace the sops ExecStartPre decrypt with a presence guard - install.sh: drop sops + age installation; checklist now says "drop .env" - runbook: replace secret/age-key rotation with edit-in-place + restart Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # pyproject.toml # uv.lock
The merge committed main's uv.lock, where pyyaml was only present transitively. pyproject declares it directly, so regenerate the lock to match (keeps `uv lock --check` green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h it Now that the runner has a persistent filesystem, every on-disk dedupe/cache file resolves against one CACHE_DIR knob (set to /srv/cache by the systemd unit, unset locally → repo CWD as before) instead of per-file absolute paths spelled out in jobs.yaml. - utils/cache.py: add CACHE_DIR + cache_path(); wrap cache/nonces/morpho paths. Absolute overrides still win (os.path.join semantics). - utils/calldata/decoder.py: route the selector cache through cache_path so it lands in /srv/cache. Previously it defaulted to selector-cache.txt in the repo dir, which is read-only under the hardened unit (ProtectSystem=strict) — the write silently failed and the cache never persisted. (Plan item #1.) - yearn/check_stuck_triggers.py: DEFAULT_CACHE_FILE resolves under CACHE_DIR. - maple/main.py, 3jane/main.py: these hardcoded their own "cache-id.txt" and so bypassed CACHE_DIR entirely — would fail writing to the read-only repo on the VPS. Route them through cache_path too. - automation/jobs.yaml: drop the absolute /srv/cache paths; only daily keeps a basename override (cache-id-daily.txt) to stay isolated from the hourly file. - deploy/: set CACHE_DIR=/srv/cache in the unit; update runbook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operator playbook for moving the four scheduled workflows (hourly/daily/weekly/ multisig) off GitHub Actions onto the yearn-monitor unit: - shadow week with all TELEGRAM_CHAT_ID_* funneled to one shadow chat and TELEGRAM_TOPIC_ID_* dropped (works around telegram.py's per-protocol chat_id having no DEFAULT fallback), which also warms /srv/cache before the flip - go-live-before-disabling-GitHub flip (duplicates are deduped; a gap is not) - rollback via `gh workflow enable` + quiet the VPS - cleanup: keep workflows disabled 30d, strip the now-dead actions/cache steps Link it from runbook.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When set, every alert from every protocol is routed to a single chat via the default bot, prefixed with a [protocol] label and with no topic threading, bypassing both topic and legacy per-protocol routing. Lets the whole fleet be sent to one dummy group for staging/comparison without touching the production TELEGRAM_TOPIC_ID_* / per-protocol vars. Extract the shared POST into _post_message to avoid duplication. Document in .env.example and deploy/runbook.md, add a unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rnald cap - Service unit yearn-monitor → monitoring (file, SyslogIdentifier, all docs). - Config dir /etc/yearn-monitoring → /etc/monitoring. - REPO_DIR default → /srv/monitoring (matches the GitHub repo name); the unit's WorkingDirectory/REPO_ROOT/venv PATH are now templated from it via __REPO_DIR__ so a single REPO_DIR override rewrites the unit. - install.sh: add journald persistence + SystemMaxUse cap drop-in (JOURNAL_MAX_USE), and a Grafana Cloud log-shipping pointer in the final steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- integration-check.sh: read-only smoke test that monitoring + liquidity coexist on one host — verifies path/unit/port isolation, masked units, shared git-credential reachability for both GitHub repos, liquidity /healthz, and WARNs on shared RPC providers / Telegram bot / tight RAM-disk. Exit 1 on any hard collision. - throwaway-vps.sh: one-shot provisioner that clones both repos, runs each installer, and brings them up in safe mode (monitoring LOG_LEVEL=DEBUG → no Telegram; liquidity SHADOW_MODE=true → dry-run, no signing/push, with prod.env.enc removed so it boots without an age key), then runs the integration check. `… teardown` undoes it on a reused box. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
integration-check.sh and throwaway-vps.sh reference the private liquidity signer daemon's internals (shadow mode, SOPS secrets, the 127.0.0.1:8080 surface, the private repo URL). They don't belong in this PUBLIC repo — they now live in tapired/liquidity-monitoring/deploy/, whose operators are the ones running coexistence rehearsals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix Safe nonce dedupe and Telegram plain text sends
Fix stale governance proposal alerts
Capture each subprocess's stdout/stderr and surface a short error tail (last 4 lines / 500 chars, stderr-preferred) in the failure digest, so alerts are actionable without an SSH/journalctl round-trip. Full output is still re-emitted to the daemon logs on failure. The tail is rendered inside a Markdown V1 fenced code block with backticks neutralized, so tracebacks containing _ * [ can't break parsing and silently drop the whole digest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the weekly and yearn-stuck-triggers profiles into the surviving profiles and turn off two monitors: - Move yearn-check-endorsed and yearn-check-timelock-delay from the weekly profile into daily (weekly profile removed). - Move yearn-check-stuck-triggers into hourly as enabled: false (yearn-stuck-triggers profile removed). - Disable yearn-check-shadow-debt in daily. - Shift hourly cron 11 -> 5 (needs systemctl restart to re-render). - Update docs and the repo-jobs.yaml test to the 3 remaining profiles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #255. First PR in the VPS migration.
What this does
Adds
automation/— a single source of truth (jobs.yaml) plus a thin runner that ports every existing GH Actions cron schedule — and a native systemd deploy that runs it under supercronic on the VPS. No Docker, no compose, no Caddy: liquidity-monitoring already runs on the same box, so a container boundary buys nothing. The existing GH Actions workflows stay live for a parallel-run verification window; cutover is documented indeploy/cutover.md.Runs alongside liquidity-monitoring's
liqmonon the same host — distinct unit (yearn-monitor) and paths (/etc/yearn-monitoring,/srv/yearn-monitoring), so the two coexist.Layout
automation/— flat package alongsideaave/,morpho/,utils/, …jobs.yaml— SSOT: 5 profiles, 32 tasks, one cron eachconfig.py/runner.py/__main__.py— schema+parser; per-tasksubprocess.runwith continue-on-error + one Telegram digest on failure; CLIlist/render-crontab/run <profile> [--dry-run]deploy/systemd/yearn-monitor.service— one hardened unit.ExecStartPrerequires/etc/yearn-monitoring/.envand renders the crontab fromjobs.yaml;ExecStartruns supercronic.Restart=on-failure,ProtectSystem=strict,ReadWritePaths=/srv/cache. No/healthzwatchdog — no daemon to wedge.install.sh— idempotent provisioning: uv + Python 3.12 + supercronic (SHA-pinned), clone →/srv/yearn-monitoring,uv sync --frozen, create/srv/cache, install the unit.runbook.md— ops (status, logs, manual runs, updates, failure table).cutover.md— GH Actions → VPS migration playbook (shadow week + flip).tests/test_automation_{config,render,runner}.py— 21 tests.pyproject.toml— addspyyaml, registersautomation.uv.locksynced.Secrets — plain root-owned
.envThis box signs nothing (unlike liquidity's
liqmon), so secrets are a plain/etc/yearn-monitoring/.env(0640 root:<deploy-user>) loaded viaEnvironmentFile. The unit refuses to start without it. No sops/age ceremony.Caching — one
CACHE_DIRAll on-disk dedupe/cache state resolves against a single
CACHE_DIR(set to/srv/cacheby the unit; unset locally → repo CWD).utils/cache.pygainscache_path(); the selector cache, stuck-triggers JSON, and themaple/3janecaches (which previously hardcoded repo-relative paths and would have failed writing under the read-only hardened unit) all route through it.jobs.yamlno longer hardcodes/srv/cachepaths.Deployment steps (fresh VPS)
Updates are
git pull --ff-only && sudo systemctl restart yearn-monitor(adduv sync --frozenfor dependency changes).Cutover steps (GH Actions → VPS)
Full playbook in
deploy/cutover.md. In short:TELEGRAM_CHAT_ID_*funneled to one shadow chat and allTELEGRAM_TOPIC_ID_*removed (works aroundtelegram.py's per-protocolchat_idhaving no DEFAULT fallback). GH Actions keeps posting to the real channels; compare for ~7 days. This also warms/srv/cache, so no cold-start alert burst.systemctl restart, confirm one good real tick, then disable the GH crons:gh workflow enable …+ stop/quiet the unit. Both run in parallel again.actions/cachesteps from_run-monitoring.yml.Verification (local)
uv run python -m automation render-crontabuv run --extra dev pytestuv run --extra dev ruff check ./ruff format --check .uv lock --checkbash -n deploy/install.shEnd-to-end systemd behavior is exercised during provisioning per
deploy/runbook.md, not in CI.Test plan
uv run python -m automation render-crontabmatches the 5 expected linesdeploy/systemd/yearn-monitor.service,deploy/runbook.md,deploy/cutover.mddeploy/cutover.md🤖 Generated with Claude Code