Skip to content

Platform Observability Stack

DifficultyAdvanced
Team Size3-5 people
Time~40-50 hours
Demo-ready byStep 5
PrerequisitesNode.js, basic DevOps, time-series data concepts
Built byDatadog, Grafana, Prometheus, Jaeger

Skills you'll earn: Metrics collection, log aggregation, distributed tracing, dashboard building, alerting pipelines, OpenTelemetry

Start by collecting a metric from an application. End with a full observability platform: metrics, logs, traces, dashboards, and alerting.

Step 1: Collect and store a metric (~2-3 hours)

You can't improve what you can't measure.

  • Instrument an app to emit a counter (e.g., http_requests_total) using Prometheus client library
  • Expose a /metrics endpoint in Prometheus exposition format
  • Set up Prometheus to scrape the endpoint every 15 seconds
  • Query the metric in Prometheus: rate(http_requests_total[5m])

You now have: Metric collection and querying.

Step 2: Build a dashboard (~3-4 hours)

Raw PromQL queries are not dashboards.

  • Set up Grafana and connect it to Prometheus as a data source
  • Create a dashboard with panels: request rate, error rate, latency histogram
  • Add template variables for service name and environment
  • Set a 30-second auto-refresh

You now have: Visual metric dashboards.

Step 3: Collect logs (~4-5 hours)

Metrics tell you something is wrong. Logs tell you what.

  • Ship application logs to a central store using Loki (or Elasticsearch)
  • Use Promtail or Fluentd as the log shipper
  • Structure logs as JSON: timestamp, level, message, service, request ID
  • Query logs in Grafana: filter by service, level, time range, keyword

You now have: Centralized logging.

Step 4: Distributed tracing (~4-5 hours)

A request hits 5 services. You need to see the full journey.

  • Instrument services with OpenTelemetry SDK
  • Each request carries a trace ID through all services via headers
  • Export traces to Jaeger or Tempo
  • View a trace waterfall: which service took how long, where errors occurred

You now have: Distributed tracing.

Step 5: Alerting (~4-5 hours)

Dashboards only help if someone is watching. Alerts catch problems automatically.

  • Define alert rules in Prometheus: rate(http_errors_total[5m]) > 0.05
  • Route alerts through Alertmanager: group, deduplicate, route to channels
  • Send alerts to Slack, email, or PagerDuty
  • Add silence rules for maintenance windows

You now have: Proactive incident detection.

Step 6: Correlate metrics, logs, and traces (~4-5 hours)

  • Click a spike on a metric graph → jump to logs from that time window
  • Click a log line with a trace ID → jump to the trace waterfall
  • Grafana's Explore view with linked data sources

Step 7: SLOs and error budgets (~4-5 hours)

  • Define SLOs: "99.9% of requests complete in under 500ms"
  • Track error budget burn rate
  • Alert when the error budget is at risk, not just when a threshold is crossed

Useful Resources

Where to go from here

  • Custom metric exporters for databases, queues, and caches
  • Anomaly detection (ML-based alerting instead of static thresholds)
  • Cost attribution (which team's services generate the most telemetry?)
  • Synthetic monitoring (probe endpoints from outside the cluster)
  • Runbook automation (trigger remediation scripts from alerts)