Skip to content

Ultimate Monitoring & Observability Doctrine

RULE: If the fleet cannot see it, the fleet cannot trust it. Monitoring is not decoration. It is the nervous system of autonomous engineering.


1. Core Principle

Every significant system must answer these questions at all times:

  • Is it alive?
  • Is it healthy?
  • Is it producing value?
  • Is it failing silently?
  • Is it costing too much?
  • Can we explain what happened after the fact?

If a subsystem cannot answer these, it is not production-ready.


2. Observability Layers

2.1 Agent layer

Track:

  • active sessions
  • error rate
  • response latency
  • model/tool availability
  • background task state
  • completion vs timeout vs cancellation

2.2 Workflow layer

Track:

  • trigger frequency
  • execution success/failure
  • queue backlogs
  • node-level bottlenecks
  • side effect confirmation

2.3 Infrastructure layer

Track:

  • CPU / memory / disk
  • network health
  • storage exhaustion risk
  • process uptime
  • HF Space sleep/restart behavior

2.4 Economic layer

Track:

  • token usage
  • hourly cost
  • cost spikes per workflow / agent / model
  • expensive retry loops

3. Mandatory Health Checks

Every production-capable component needs a health signal:

  • websites → reachable URL / status check
  • APIs → health endpoint or equivalent request
  • workflows → last successful execution timestamp
  • agents → last heartbeat / last activity
  • MCP tools → list-tools / smoke-call success

Why

A component that is “probably fine” is already a liability.


4. Logging Rules

4.1 Structured logs only where possible

Prefer machine-readable logs with fields for:

  • timestamp
  • component
  • severity
  • session/task/issue IDs
  • action
  • result
  • error class

4.2 Redaction is mandatory

Never log raw secrets, tokens, passwords, or credentials. Log references, not secrets.

4.3 Log what matters

Log:

  • starts
  • finishes
  • retries
  • failures
  • unusual paths
  • external side effects Do not drown the system in meaningless noise.

5. Alerting Rules

Alert only on things that require action.

Critical alerts

  • system down
  • auth broken
  • workflow repeatedly failing
  • data loss risk
  • billing/cost anomaly
  • stuck queue / stuck retry loop

Warning alerts

  • degraded latency
  • elevated error rate
  • stale worker
  • missed execution window

Anti-pattern

Do not send alerts for every minor fluctuation. Alert fatigue blinds operators.


6. Evidence Preservation

Important incidents must preserve evidence:

  • screenshots
  • logs
  • failing payloads
  • URLs
  • execution IDs
  • commit/branch/issue references

Where LogCenter applies, upload there. /tmp is temporary, not authoritative.


7. Dashboards Must Be Operationally Useful

A good dashboard tells the operator what to do next. A bad dashboard is just colorful guilt.

Minimum useful dashboard sections:

  • overall fleet health
  • current incidents
  • recent failed executions
  • hot repos / open blockers
  • cost trends
  • deployment state

8. Cost Observability

The fleet must watch for:

  • runaway retries
  • bad model routing
  • needlessly expensive model selection
  • repeated failed background tasks
  • duplicate agents doing the same work

Why

An autonomous system that does not watch cost becomes a denial-of-wallet attack against itself.


9. Silent Failure Prevention

Every important side effect must be checked downstream. Examples:

  • issue supposedly created → confirm issue URL exists
  • blog supposedly published → confirm file exists and page resolves
  • deploy supposedly finished → confirm URL and assets load
  • agent supposedly completed → confirm result retrievable

Do not trust intermediate success messages over final reality.


10. Suggested Standard Metrics

AreaMetricWhy it matters
Agentsactive sessionsdetects stalls and overload
Agentserror ratereveals instability
Agentsavg / p95 latencyreveals slowdowns
Workflowssuccess rateproves automation health
Workflowsexecution agedetects dead pipelines
Inframemory / diskprevents crashes
Costshourly spendprevents runaway economics
Productlive URL healthproves end-user reality

11. Final Rule

Monitoring is not for feeling informed. It is for enabling intervention. If a metric cannot drive a decision, rethink why you are collecting it.


Last updated: 2026-04-10
Status: ACTIVE & MANDATORY
Maintainer: sin-zeus