Skip to content

Add launchd watchdog for silent LaunchAgent failure detection #2

@ns408

Description

@ns408

Problem

When a LaunchAgent fails silently (e.g., script path breaks after a repo rename), there's no notification — the job just stops running and nobody notices for days. Currently the only visibility is manually checking log files in ~/.local/log/.

Proposed Solution: Hybrid Approach

1. Wrapper script with failure notification (scripts/run-with-notify.sh)

  • Wraps all scheduled jobs; on non-zero exit, sends a macOS notification via osascript
  • Plist templates updated to run jobs through the wrapper
  • Handles the common case: script ran but failed

2. Lightweight watchdog (scripts/launchd-watchdog.sh)

  • Runs every 4 hours via its own LaunchAgent
  • Checks if each monitored job's log file was modified in the last 36 hours
  • Sends macOS notification for stale (never-ran) jobs
  • Handles the edge case the wrapper can't: job never executed (broken path, plist unloaded)

Scope

  • Monitor ns-bootstrap's own agents only (com.ns-bootstrap.update-daily, com.ns-bootstrap.update-interactive)
  • Config file at ~/.config/ns-bootstrap/watchdog-jobs.conf allows adding labels later, but don't design a multi-project framework upfront
  • macOS only initially; Ubuntu can defer since systemd has built-in OnFailure= support

Files to Add/Modify

scripts/
├── run-with-notify.sh                              # NEW: wrapper with failure notification
├── launchd-watchdog.sh                             # NEW: checks jobs ran recently
├── launchd/
│   ├── com.ns-bootstrap.update-daily.plist.template        # MODIFY: use wrapper
│   ├── com.ns-bootstrap.update-interactive.plist.template  # MODIFY: use wrapper
│   └── com.ns-bootstrap.watchdog.plist.template            # NEW: runs watchdog every 4h
install/
│   └── bootstrap.sh                                        # MODIFY: generate default watchdog config

Why hybrid over pure watchdog?

  • The wrapper catches 90% of failures (runtime errors) with near-zero complexity (~15 lines)
  • The watchdog catches the remaining edge case ("job never ran") by checking log freshness — no launchctl list parsing needed
  • A pure watchdog that only polls launchctl list has its own silent-failure problem (turtles all the way down)

Alternatives Considered

Approach Catches runtime failures Catches "never ran" Complexity
Watchdog only (polling) Yes Yes Medium — extra plist, config file, launchctl parsing
Wrapper only Yes No Low — ~15 lines
Hybrid (recommended) Yes Yes Low — two simple scripts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions