prometheus

Version: 77.14.0 Type: application Homepage

Prometheus monitoring stack with Grafana and Alertmanager

Component Information

PropertyValue
Chart Version77.14.0
Chart Typeapplication
Upstream Projectprometheus
MaintainersPlatform Engineering Team (repo)

Why Prometheus?

Prometheus uses a pull model: it scrapes metrics from targets on a schedule. This provides precise control over cardinality and scrape intervals, which helps prevent metric explosions in resource-constrained environments.

The kube-prometheus-stack Helm chart bundles Prometheus with Grafana, Alertmanager, and pre-configured dashboards for Kubernetes components. This reduces the initial configuration work.

ServiceMonitor CRDs make metrics collection declarative. Instead of manually editing Prometheus config files, you define ServiceMonitor resources and Prometheus discovers them automatically, which fits the GitOps approach.

Architecture Role

Prometheus operates at Layer 1 of the platform, the Platform Services layer. It’s a transversal service that monitors everything above it.

Key integration points:

  • ServiceMonitors: Declared by components (Cilium, ArgoCD, Kyverno, etc.) to expose metrics
  • Grafana: Queries Prometheus for metrics visualization
  • Pyrra: Uses Prometheus metrics to calculate SLO burn rates
  • Alertmanager: Receives alerts from Prometheus evaluation rules (currently enabled for Pyrra support)

The configuration uses a pull model with ServiceMonitor CRDs for discovery. Scrape intervals are tuned per target (e.g., 30s for CNI metrics, 60s for application metrics). This balances visibility with resource efficiency.

Prometheus doesn’t currently drive any HorizontalPodAutoscalers (HPAs), meaning metrics are used for passive observability rather than active scaling. This is an opportunity for future optimization.

See Observability Model for the complete observability architecture.

Configuration Values

kube-prometheus-stack

Version: 77.14.0

Component Information

Property Value
Chart Version 77.14.0
Chart Type ``
Upstream Project N/A

Configuration Values

The following table lists the configurable parameters:

Values

Key Type Default Description
alertmanager.alertmanagerSpec.priorityClassName string "platform-observability" Priority class for Alertmanager pod
alertmanager.alertmanagerSpec.resources.limits.cpu string "100m" CPU limit
alertmanager.alertmanagerSpec.resources.limits.memory string "128Mi" Memory limit
alertmanager.alertmanagerSpec.resources.requests.cpu string "25m" CPU request
alertmanager.alertmanagerSpec.resources.requests.memory string "64Mi" Memory request
alertmanager.enabled bool true Enable Alertmanager for alert routing (required for Pyrra burn-rate alerts)
crds object {"enabled":false} Disables the installation of CRDs, as they are managed separately.
grafana."grafana.ini" object {"users":{"allow_sign_up":false,"default_theme":"dark"}} Advanced Grafana configuration via grafana.ini
grafana."grafana.ini".users.allow_sign_up bool false Disables the user sign-up page.
grafana."grafana.ini".users.default_theme string "dark" Set the default UI theme to dark.
grafana.additionalDataSources list [{"access":"proxy","isDefault":false,"name":"Loki","type":"loki","url":"http://loki.observability.svc.cluster.local:3100"}] Additional datasources for Grafana.
grafana.admin object {"existingSecret":"grafana-admin-credentials","passwordKey":"admin-password","userKey":"admin-user"} Use existing secret for admin credentials from Vault via ESO.
grafana.persistence object {"accessModes":["ReadWriteOnce"],"enabled":true,"size":"1Gi","type":"pvc"} Enable persistence for dashboards and settings
grafana.plugins list ["grafana-piechart-panel","grafana-polystat-panel","marcusolsson-json-datasource"] Automatically install useful plugins on startup.
grafana.priorityClassName string "platform-dashboards"
grafana.resources.limits.cpu string "250m" CPU limit
grafana.resources.limits.memory string "256Mi" Memory limit
grafana.resources.requests.cpu string "50m" CPU request
grafana.resources.requests.memory string "128Mi" Memory request
grafana.sidecar object {"dashboards":{"enabled":true,"label":"grafana_dashboard","labelValue":""}} Sidecar to automatically discover and load dashboards from ConfigMaps.
grafana.sidecar.dashboards.labelValue string "" An empty labelValue searches for the presence of the label, regardless of its value.
kube-state-metrics object {"extraArgs":["--resources=cronjobs,daemonsets,deployments,jobs,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,pods,services,statefulsets,storageclasses"],"priorityClassName":"platform-observability","prometheus":{"monitor":{"metricRelabelings":[{"action":"labeldrop","regex":"uid"},{"action":"labeldrop","regex":"container_id"},{"action":"labeldrop","regex":"image_id"}]}},"resources":{"limits":{"cpu":"50m","memory":"128Mi"},"requests":{"cpu":"25m","memory":"64Mi"}}} Resource limits and requests for kube-state-metrics.
kube-state-metrics.extraArgs list ["--resources=cronjobs,daemonsets,deployments,jobs,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,pods,services,statefulsets,storageclasses"] Enable only relevant resource types (whitelist approach)
kube-state-metrics.priorityClassName string "platform-observability" Priority class
kube-state-metrics.prometheus.monitor.metricRelabelings list [{"action":"labeldrop","regex":"uid"},{"action":"labeldrop","regex":"container_id"},{"action":"labeldrop","regex":"image_id"}] Drop high-cardinality labels
kube-state-metrics.resources.limits.cpu string "50m" CPU limit
kube-state-metrics.resources.limits.memory string "128Mi" Memory limit
kube-state-metrics.resources.requests.cpu string "25m" CPU request
kube-state-metrics.resources.requests.memory string "64Mi" Memory request
prometheus-node-exporter object {"extraArgs":["--collector.disable-defaults","--collector.cpu","--collector.cpufreq","--collector.meminfo","--collector.diskstats","--collector.filesystem","--collector.netdev","--collector.loadavg","--collector.pressure","--collector.vmstat","--collector.stat","--collector.uname"],"priorityClassName":"platform-observability","resources":{"limits":{"cpu":"30m","memory":"48Mi"},"requests":{"cpu":"15m","memory":"24Mi"}}} Resource limits and requests for the node-exporter.
prometheus-node-exporter.extraArgs list ["--collector.disable-defaults","--collector.cpu","--collector.cpufreq","--collector.meminfo","--collector.diskstats","--collector.filesystem","--collector.netdev","--collector.loadavg","--collector.pressure","--collector.vmstat","--collector.stat","--collector.uname"] Minimal collector set optimized for K3d
prometheus-node-exporter.priorityClassName string "platform-observability" Priority class
prometheus-node-exporter.resources.limits.cpu string "30m" CPU limit
prometheus-node-exporter.resources.limits.memory string "48Mi" Memory limit
prometheus-node-exporter.resources.requests.cpu string "15m" CPU request
prometheus-node-exporter.resources.requests.memory string "24Mi" Memory request
prometheus.priorityClassName string "platform-observability"
prometheus.prometheusSpec.podMonitorNamespaceSelector object {}
prometheus.prometheusSpec.podMonitorSelector object {"matchLabels":{"prometheus":"kube-prometheus"}} Select PodMonitors similarly (if used by components)
prometheus.prometheusSpec.resources.limits.cpu string "250m" CPU limit
prometheus.prometheusSpec.resources.limits.memory string "512Mi" Memory limit
prometheus.prometheusSpec.resources.requests.cpu string "100m" CPU request
prometheus.prometheusSpec.resources.requests.memory string "384Mi" Memory request
prometheus.prometheusSpec.retention string "6h" Metrics retention time.
prometheus.prometheusSpec.ruleNamespaceSelector object {}
prometheus.prometheusSpec.ruleSelector object {} Select PrometheusRules from any namespace (needed for Pyrra rules)
prometheus.prometheusSpec.scrapeInterval string "60s" Global scrape interval for all ServiceMonitors (unless overridden).
prometheus.prometheusSpec.scrapeTimeout string "40s" Global scrape timeout for all ServiceMonitors (unless overridden).
prometheus.prometheusSpec.serviceMonitorNamespaceSelector object {}
prometheus.prometheusSpec.serviceMonitorSelector object {"matchLabels":{"prometheus":"kube-prometheus"}} Select ServiceMonitors with label prometheus=kube-prometheus across all namespaces
prometheus.prometheusSpec.storageSpec object {"volumeClaimTemplate":{"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}}}}} Enable persistence for Prometheus TSDB. 1Gi supports 6h retention for ~50 pods with 4x overhead margin. Data survives pod restarts but is lost on cluster destruction.
prometheusOperator object {"priorityClassName":"platform-observability","resources":{"limits":{"cpu":"50m","memory":"64Mi"},"requests":{"cpu":"25m","memory":"32Mi"}}} Resource limits and requests for the Prometheus Operator.
prometheusOperator.priorityClassName string "platform-observability" Priority class
prometheusOperator.resources.limits.cpu string "50m" CPU limit
prometheusOperator.resources.limits.memory string "64Mi" Memory limit
prometheusOperator.resources.requests.cpu string "25m" CPU request
prometheusOperator.resources.requests.memory string "32Mi" Memory request
windows-exporter object {"enabled":false} Disables unnecessary components.