Observability for B2B Custom Software
Dashboards full of green CPU graphs do not tell a plant manager why orders stopped posting to ERP. B2B observability must connect technical signals to business journeys: submit, approve, sync, bill. When incidents strike, operators and customer IT need answers in minutes, not after a postmortem next week. This guide covers logs, metrics, traces, alerting, and SLO thinking for custom SaaS, internal tools, and integration-heavy products. It extends production readiness and ERP integration design with operational practices that keep audit and compliance evidence intact while engineers debug fast.
Observability vs traditional monitoring in B2B
Monitoring checks known failure modes: disk full, 5xx rate, queue depth. Observability lets you ask new questions with correlated logs, metrics, and traces when the failure mode was not predicted. B2B adds business dimensions: tenant_id, site_id, integration partner, workflow step. Every signal should carry these tags so incidents filter to one customer without guessing. Customer contracts may require uptime reports and incident notifications. Your telemetry design should produce those reports without manual spreadsheet archaeology.
- Tag all signals with tenant and correlation ID
- Define business journeys, not only microservice names
- Separate operator-facing status from engineer-facing dashboards
- Retain data long enough for month-end incident investigations
Golden signals per business journey
Pick three to seven journeys that matter commercially: user login, create order, approval chain, ERP post, report export, webhook delivery. For each define success criteria, latency budget, and error budget. Instrument at boundaries: API handler start, domain service commit, outbound integration call, job completion. Missing middle spans make 'slow approval' impossible to localize. Example SLO: 99% of ERP posts complete within 60 seconds excluding customer-scheduled maintenance windows. Measure from user action to confirmed ERP acknowledgment. Align journey list with discovery workflows so observability is designed, not retrofitted.
Structured logging and distributed tracing
Use structured JSON logs with stable field names: timestamp, level, message, tenant_id, user_id, correlation_id, journey, step. Free-text grep across pods does not scale past three services. Propagate correlation IDs from browser or API through jobs to integration clients. Support tickets should include one ID that links entire chain. Tracing helps when you have multiple services or heavy async paths. Sample in production to control cost; always trace errors and slow journeys above threshold. Do not log secrets, full PII, or payment payloads. Redact and classify per audit and privacy policy.
Observability for integrations and queues
Track per-integration health: success rate, latency percentiles, retry count, oldest unprocessed message, reconciliation backlog. ERP slowdown is your incident even when your app is 'up'. Dashboards for customer success: tenant-scoped view of last successful sync, last error message (sanitized), self-service replay if safe. Alert on anomaly, not only threshold: sudden drop in posts for one tenant may mean credential expiry, not global outage. Patterns from API design apply to observability: stable error codes in metrics labels.
- Dead-letter queue depth and age alerts
- Synthetic checks against sandbox ERP every N minutes
- Runbook link in alert payload
- Maintenance calendar overlay to reduce false pages
Alerting, on-call, and customer communication
Page humans only for user-visible or data-integrity risk. Queue lag within SLA goes to ticket, not pager. Define severity levels and expected response times in support contracts. Runbooks for top alerts: ERP 401 credential rotation, database connection exhaustion, SSO metadata expiry, migration job stuck. Customer communication template: impact scope (which tenants), workaround, ETA update channel, post-incident summary for enterprise accounts. Hypercare after go-live from data migration needs tighter thresholds temporarily.
Debugging with tenant isolation intact
Support tools must not disable tenancy for convenience. Use audited impersonation or tenant-scoped diagnostic views. Cross-tenant search is a compliance incident. Provide safe read-only explain plans or trace viewers for one tenant. Export diagnostic bundle customer IT can share without raw database access. In multi-tenant setups, noisy neighbor detection protects large customers: CPU, IO, and integration rate per tenant.
Tooling choices and cost control
Managed observability (Datadog, Honeycomb, Grafana Cloud, OpenTelemetry collectors) trades cost for speed. Log volume is the usual budget surprise; sample debug logs, aggregate metrics, retain traces selectively. OpenTelemetry gives portability if customers ask for telemetry export. Start with one vendor, instrument with OTel APIs where possible. Budget observability in ongoing operating costs, not only as launch task.
Observability checklist before launch
Verify alerts fire in staging drill. Confirm dashboards exist per golden journey. Test correlation ID from UI error banner to log query. Document who receives pages and backup coverage. Add to production readiness gate alongside backup restore and rollback tests.
Next steps
Name your top three business journeys and whether you can measure end-to-end latency today. Add correlation ID propagation this sprint if missing. See other resources, experience, book a call, or contact with architecture sketch, integration count, and current tooling if you need an observability review before launch.
FAQ
What is the minimum observability for B2B MVP?
Structured logs with correlation ID, error tracking, uptime check on critical API, and one dashboard per core journey. Add paging when paying customers depend on daily operations, not at first prototype user.
How is observability different from audit logging?
Observability helps engineers debug and measure reliability. Audit logs prove who did what for compliance. Some events appear in both, but retention, access control, and immutability differ. Do not use debug logs as legal audit trail.
Should customers access our Grafana?
Usually no for full internal dashboards. Offer tenant-scoped status page, sanitized integration health, and incident comms. Some contracts allow read-only SLA dashboards; plan multi-tenant views carefully.
When to hire SRE or platform support?
When on-call load exceeds product team capacity or SLAs are contractual. Until then, define runbooks and automate top alerts during build with senior contractor help if needed.