Pillar · Monitoring
From signal to resolution, declaratively.
Every monitor in OpsLantern carries its own remediation, verification, flap guard, and escalation. Define it once, safely automate the rest.
Monitor → Decide → Act → Verify
A five-part contract for every check.
Services, resources, ports, queries, queues, replication, folder sizes, file counts, cloud metrics — every monitor describes how it should respond to its own alarm.
monitor: mssql.service.availability
target_selector: "role:mssql AND env:prod"
signal:
type: service_state
name: MSSQL$INSTANCE
interval: 30s
threshold:
state: "!= Running"
sustained_for: 60s
remediation:
- action: win.service.restart
max_attempts: 2
pre_checks: [disk_free > 5GB, no_pending_reboot]
verification:
- check: tcp_port_open
port: 1433
- check: query_ok
query: "SELECT 1"
escalation:
after_attempts: 2
notify: ["on_call:dba"]
open_ticket: true
flap_guard: 3_restarts_in_15m -> suppress_and_escalate
Shipped day-one
Hundreds of monitors, pre-authored.
Host & OS
- · CPU sustained high → identify top process → recycle
- · Memory pressure → swap check → candidate restart
- · Disk tiered: 80% warn · 90% auto-clean · 95% urgent
- · Network retransmits / interface errors
- · Pending reboot status
Services & apps
- · IIS app-pool recycle on 500-rate spike
- · Apache / Nginx worker exhaustion → graceful reload
- · Exchange transport queue depth → flush + restart
- · MSSQL AG latency → failover drill
- · MySQL replication lag → resync playbook
- · Postfix / MailEnable / ModusGate queue depth
Folders & files
- · Size / file-count thresholds → compress, rotate, purge
- · Stuck-writer detection (no new files during business hours)
- · Hash baseline on web roots — webshell / tampering alerts
- · Growth-rate anomaly (grew 10× faster than 30-day avg)
Cloud-native
- · Azure VM heartbeat loss → JIT recover
- · M365 tenant service health → ticket + customer notice
- · Storage account accidental public exposure → auto-lock
- · Huawei ECS CPU credits exhausted → scale recommendation
Safety rails
Automation with limits you control.
Per-monitor enable
Nothing is auto-remediated by default. You opt each monitor in per environment.
Flap protection
Auto-suppress when a restart loop is detected. Human escalation takes over.
Blackout windows
No restarts during business hours for Customer X. Declare once per tenant.
Dependency awareness
Do not restart the app server when the database it depends on is still down.
Blast-radius limiter
Max N actions per minute across the fleet — hard-capped, configurable.
Global kill switch
One button pauses every automation during a major incident.
WARN
replica: sql-ag-03.customerY
lag: 6m 12s (threshold 2m)
suggested remediation
→ mssql.ag.resync (attempt 1 of 2)
→ verify: lag < 30s within 5m
→ escalate on failure: on_call:dba
Silent operator
Suggest mode
Awaiting approval — auto-remediate not enabled for this tenant.
Concept preview — not final UI.