mirror of https://github.com/external-secrets/external-secrets.git synced 2024-12-15 17:51:01 +00:00

feat: add provider metrics (#2024 )

* feat: add provider metrics

This adds a counter metric `provider_api_calls_count` that observes
the results of upstream secret provider api calls.

(1) Observability
It allows an user to break down issues by provider and api call by
observing the status=error|success label. More details around the error
can be found in  the logs.

(2) Cost Management
Some providers charge by API calls issued. By providing observability
for the number of calls issued helps users to understand the impact of
deploying ESO and fine-tuning `spec.refreshInterval`.

(3) Rate Limiting
Some providers implement rate-limiting for their services. Having
metrics
for success/failure count helps to understand how many requests are
issued by a given ESO deployment per cluster.

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>

* fix: add service monitor for cert-controller and add SLIs

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>

---------

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>

2023-02-27 22:56:36 +01:00

7.1 KiB

Raw Blame History

hide

toc

Metrics

The External Secrets Operator exposes its Prometheus metrics in the /metrics path. To enable it, set the serviceMonitor.enabled Helm flag to true. In addition you can also set webhook.serviceMonitor.enabled=true and certController.serviceMonitor.enabled=true to create ServiceMonitor resources for the other components.

If you are using a different monitoring tool that also needs a /metrics endpoint, you can set the metrics.service.enabled Helm flag to true. In addition you can also set webhook.metrics.service.enabled and certController.metrics.service.enabled to scrape the other components.

The Operator has the metrics inherited from Kubebuilder plus some custom metrics with the externalsecret prefix.

External Secret Metrics

Name	Type	Description
`externalsecret_provider_api_calls_count`	Counter	Number of API calls made to an upstream secret provider API. The metric provides a `provider`, `call` and `status` labels.
`externalsecret_sync_calls_total`	Counter	Total number of the External Secret sync calls
`externalsecret_sync_calls_error`	Counter	Total number of the External Secret sync errors
`externalsecret_status_condition`	Gauge	The status condition of a specific External Secret
`externalsecret_reconcile_duration`	Gauge	The duration time to reconcile the External Secret
`controller_runtime_reconcile_total`	Counter	Holds the totalnumber of reconciliations per controller. It has two labels. controller label refers to the controller name and result label refers to the reconcile result i.e success, error, requeue, requeue_after.
`controller_runtime_reconcile_errors_total`	Counter	Total number of reconcile errors per controller
`controller_runtime_reconcile_time_seconds`	Histogram	Length of time per reconcile per controller
`controller_runtime_reconcile_queue_length`	Gauge	Length of reconcile queue per controller
`controller_runtime_max_concurrent_reconciles`	Gauge	Maximum number of concurrent reconciles per controller
`controller_runtime_active_workers`	Gauge	Number of currently used workers per controller

Dashboard

We provide a Grafana Dashboard that gives you an overview of External Secrets Operator:

Service Level Indicators and Alerts

We find the following Service Level Indicators (SLIs) useful when operating ESO. They should give you a good starting point and hints to develop your own Service Level Objectives (SLOs).

Webhook HTTP Status Codes

The webhook HTTP status code indicates that a HTTP Request was answered successfully or not. If the Webhook pod is not able to serve the requests properly then that failure may cascade down to the controller or any other user of kube-apiserver.

SLI Example: request error percentage.

sum(increase(controller_runtime_webhook_requests_total{service=~"external-secrets.*",code="500"}[1m]))
/
sum(increase(controller_runtime_webhook_requests_total{service=~"external-secrets.*"}[1m]))

Webhook HTTP Request Latency

If the webhook server is not able to respond in time then that may cause a timeout at the client. This failure may cascade down to the controller or any other user of kube-apiserver.

SLI Example: p99 across all webhook requests.

histogram_quantile(0.99,
  sum(rate(controller_runtime_webhook_latency_seconds_bucket{service=~"external-secrets.*"}[5m])) by (le)
)

Controller Workqueue Depth

If the workqueue depth is > 0 for a longer period of time then this is an indicator for the controller not being able to reconcile resources in time. I.e. delivery of secret updates is delayed.

Note: when a controller is restarted, then queue length = total number of resources. Make sure to measure the time it takes for the controller to fully reconcile all secrets after a restart. In large clusters this may take a while, make sure to define an acceptable timeframe to fully reconcile all resources.

sum(
  workqueue_depth{service=~"external-secrets.*"}
) by (name)

Controller Reconcile Latency

The controller should be able to reconcile resources within a reasonable timeframe. When latency is high secret delivery may impacted.

SLI Example: p99 across all controllers.

histogram_quantile(0.99,
  sum(rate(controller_runtime_reconcile_time_seconds_bucket{service=~"external-secrets.*"}[5m])) by (le)
)

Controller Reconcile Error

The controller should be able to reconcile resources without errors. When errors occurr secret delivery may be impacted which could cascade down to the secret consuming applications.

sum(increase(
  controller_runtime_reconcile_total{service=~"external-secrets.*",controller=~"$controller"}[1m])
) by (result)

7.1 KiB Raw Blame History