Skip to content

Pillar · Monitoring

From signal to resolution, declaratively.

Every monitor in OpsLantern carries its own remediation, verification, flap guard, and escalation. Define it once, safely automate the rest.

Monitor → Decide → Act → Verify

A five-part contract for every check.

Services, resources, ports, queries, queues, replication, folder sizes, file counts, cloud metrics — every monitor describes how it should respond to its own alarm.

Monitor definition yaml
monitor: mssql.service.availability
target_selector: "role:mssql AND env:prod"
signal:
  type: service_state
  name: MSSQL$INSTANCE
  interval: 30s
threshold:
  state: "!= Running"
  sustained_for: 60s
remediation:
  - action: win.service.restart
    max_attempts: 2
    pre_checks: [disk_free > 5GB, no_pending_reboot]
verification:
  - check: tcp_port_open
    port: 1433
  - check: query_ok
    query: "SELECT 1"
escalation:
  after_attempts: 2
  notify: ["on_call:dba"]
  open_ticket: true
flap_guard: 3_restarts_in_15m -> suppress_and_escalate

Shipped day-one

Hundreds of monitors, pre-authored.

Host & OS

  • · CPU sustained high → identify top process → recycle
  • · Memory pressure → swap check → candidate restart
  • · Disk tiered: 80% warn · 90% auto-clean · 95% urgent
  • · Network retransmits / interface errors
  • · Pending reboot status

Services & apps

  • · IIS app-pool recycle on 500-rate spike
  • · Apache / Nginx worker exhaustion → graceful reload
  • · Exchange transport queue depth → flush + restart
  • · MSSQL AG latency → failover drill
  • · MySQL replication lag → resync playbook
  • · Postfix / MailEnable / ModusGate queue depth

Folders & files

  • · Size / file-count thresholds → compress, rotate, purge
  • · Stuck-writer detection (no new files during business hours)
  • · Hash baseline on web roots — webshell / tampering alerts
  • · Growth-rate anomaly (grew 10× faster than 30-day avg)

Cloud-native

  • · Azure VM heartbeat loss → JIT recover
  • · M365 tenant service health → ticket + customer notice
  • · Storage account accidental public exposure → auto-lock
  • · Huawei ECS CPU credits exhausted → scale recommendation

Safety rails

Automation with limits you control.

Per-monitor enable

Nothing is auto-remediated by default. You opt each monitor in per environment.

Flap protection

Auto-suppress when a restart loop is detected. Human escalation takes over.

Blackout windows

No restarts during business hours for Customer X. Declare once per tenant.

Dependency awareness

Do not restart the app server when the database it depends on is still down.

Blast-radius limiter

Max N actions per minute across the fleet — hard-capped, configurable.

Global kill switch

One button pauses every automation during a major incident.

opslantern — monitor: mssql.ag.latency

WARN

replica: sql-ag-03.customerY

lag: 6m 12s (threshold 2m)

suggested remediation

→ mssql.ag.resync (attempt 1 of 2)

→ verify: lag < 30s within 5m

→ escalate on failure: on_call:dba

Silent operator

Suggest mode

Awaiting approval — auto-remediate not enabled for this tenant.

Concept preview — not final UI.

See the full monitor library.