Appearance
Platform Observability Stack
| Difficulty | Advanced |
| Team Size | 3-5 people |
| Time | ~40-50 hours |
| Demo-ready by | Step 5 |
| Prerequisites | Node.js, basic DevOps, time-series data concepts |
| Built by | Datadog, Grafana, Prometheus, Jaeger |
Skills you'll earn: Metrics collection, log aggregation, distributed tracing, dashboard building, alerting pipelines, OpenTelemetry
Start by collecting a metric from an application. End with a full observability platform: metrics, logs, traces, dashboards, and alerting.
Step 1: Collect and store a metric (~2-3 hours)
You can't improve what you can't measure.
- Instrument an app to emit a counter (e.g.,
http_requests_total) using Prometheus client library - Expose a
/metricsendpoint in Prometheus exposition format - Set up Prometheus to scrape the endpoint every 15 seconds
- Query the metric in Prometheus:
rate(http_requests_total[5m])
You now have: Metric collection and querying.
Step 2: Build a dashboard (~3-4 hours)
Raw PromQL queries are not dashboards.
- Set up Grafana and connect it to Prometheus as a data source
- Create a dashboard with panels: request rate, error rate, latency histogram
- Add template variables for service name and environment
- Set a 30-second auto-refresh
You now have: Visual metric dashboards.
Step 3: Collect logs (~4-5 hours)
Metrics tell you something is wrong. Logs tell you what.
- Ship application logs to a central store using Loki (or Elasticsearch)
- Use Promtail or Fluentd as the log shipper
- Structure logs as JSON: timestamp, level, message, service, request ID
- Query logs in Grafana: filter by service, level, time range, keyword
You now have: Centralized logging.
Step 4: Distributed tracing (~4-5 hours)
A request hits 5 services. You need to see the full journey.
- Instrument services with OpenTelemetry SDK
- Each request carries a trace ID through all services via headers
- Export traces to Jaeger or Tempo
- View a trace waterfall: which service took how long, where errors occurred
You now have: Distributed tracing.
Step 5: Alerting (~4-5 hours)
Dashboards only help if someone is watching. Alerts catch problems automatically.
- Define alert rules in Prometheus:
rate(http_errors_total[5m]) > 0.05 - Route alerts through Alertmanager: group, deduplicate, route to channels
- Send alerts to Slack, email, or PagerDuty
- Add silence rules for maintenance windows
You now have: Proactive incident detection.
Step 6: Correlate metrics, logs, and traces (~4-5 hours)
- Click a spike on a metric graph → jump to logs from that time window
- Click a log line with a trace ID → jump to the trace waterfall
- Grafana's Explore view with linked data sources
Step 7: SLOs and error budgets (~4-5 hours)
- Define SLOs: "99.9% of requests complete in under 500ms"
- Track error budget burn rate
- Alert when the error budget is at risk, not just when a threshold is crossed
Useful Resources
- Prometheus Documentation
- Grafana Documentation
- OpenTelemetry Documentation
- Grafana Loki
- Jaeger Tracing
- Google SRE Book — Monitoring
Where to go from here
- Custom metric exporters for databases, queues, and caches
- Anomaly detection (ML-based alerting instead of static thresholds)
- Cost attribution (which team's services generate the most telemetry?)
- Synthetic monitoring (probe endpoints from outside the cluster)
- Runbook automation (trigger remediation scripts from alerts)