external-secrets/docs/api/metrics.md

---
hide:
  - toc
---

# Metrics

The External Secrets Operator exposes its Prometheus metrics in the `/metrics` path. To enable it, set the `serviceMonitor.enabled` Helm flag to `true`.

If you are using a different monitoring tool that also needs a `/metrics` endpoint, you can set the `metrics.service.enabled` Helm flag to `true`. In addition you can also set `webhook.metrics.service.enabled` and `certController.metrics.service.enabled` to scrape the other components.

The Operator has [the controller-runtime metrics inherited from kubebuilder](https://book.kubebuilder.io/reference/metrics-reference.html) plus some custom metrics with a resource name prefix, such as `externalsecret_`.

## Cluster External Secret Metrics
| Name                                       | Type  | Description                                                |
|--------------------------------------------|-------|------------------------------------------------------------|
| `clusterexternalsecret_status_condition`   | Gauge | The status condition of a specific Cluster External Secret |
| `clusterexternalsecret_reconcile_duration` | Gauge | The duration time to reconcile the Cluster External Secret |

## External Secret Metrics
| Name                                           | Type      | Description                                                                                                                                                                                                             |
|------------------------------------------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `externalsecret_provider_api_calls_count`      | Counter   | Number of API calls made to an upstream secret provider API. The metric provides a `provider`, `call` and `status` labels.                                                                                              |
| `externalsecret_sync_calls_total`              | Counter   | Total number of the External Secret sync calls                                                                                                                                                                          |
| `externalsecret_sync_calls_error`              | Counter   | Total number of the External Secret sync errors                                                                                                                                                                         |
| `externalsecret_status_condition`              | Gauge     | The status condition of a specific External Secret                                                                                                                                                                      |
| `externalsecret_reconcile_duration`            | Gauge     | The duration time to reconcile the External Secret                                                                                                                                                                      |

## Cluster Secret Store Metrics
| Name                                    | Type  | Description                                             |
|-----------------------------------------|-------|---------------------------------------------------------|
| `clustersecretstore_status_condition`   | Gauge | The status condition of a specific Cluster Secret Store |
| `clustersecretstore_reconcile_duration` | Gauge | The duration time to reconcile the Cluster Secret Store |

# Secret Store Metrics
| Name                             | Type  | Description                                     |
|----------------------------------|-------|-------------------------------------------------|
| `secretstore_status_condition`   | Gauge | The status condition of a specific Secret Store |
| `secretstore_reconcile_duration` | Gauge | The duration time to reconcile the Secret Store |

## Controller Runtime Metrics
See [the kubebuilder documentation](https://book.kubebuilder.io/reference/metrics-reference.html) on the default exported metrics by controller-runtime.

## Dashboard

We provide a [Grafana Dashboard](https://raw.githubusercontent.com/external-secrets/external-secrets/main/docs/snippets/dashboard.json) that gives you an overview of External Secrets Operator:

![ESO Dashboard](../pictures/eso-dashboard-1.png)
![ESO Dashboard](../pictures/eso-dashboard-2.png)


## Service Level Indicators and Alerts

We find the following Service Level Indicators (SLIs) useful when operating ESO. They should give you a good starting point and hints to develop your own Service Level Objectives (SLOs).

#### Webhook HTTP Status Codes
The webhook HTTP status code indicates that a HTTP Request was answered successfully or not.
If the Webhook pod is not able to serve the requests properly then that failure may cascade down to the controller or any other user of `kube-apiserver`.

SLI Example: request error percentage.
```
sum(increase(controller_runtime_webhook_requests_total{service=~"external-secrets.*",code="500"}[1m]))
/
sum(increase(controller_runtime_webhook_requests_total{service=~"external-secrets.*"}[1m]))
```

#### Webhook HTTP Request Latency
If the webhook server is not able to respond in time then that may cause a timeout at the client.
This failure may cascade down to the controller or any other user of `kube-apiserver`.

SLI Example: p99 across all webhook requests.
```
histogram_quantile(0.99,
  sum(rate(controller_runtime_webhook_latency_seconds_bucket{service=~"external-secrets.*"}[5m])) by (le)
)
```

#### Controller Workqueue Depth
If the workqueue depth is > 0 for a longer period of time then this is an indicator for the controller not being able to reconcile resources in time. I.e. delivery of secret updates is delayed.

Note: when a controller is restarted, then `queue length = total number of resources`. Make sure to measure the time it takes for the controller to fully reconcile all secrets after a restart. In large clusters this may take a while, make sure to define an acceptable timeframe to fully reconcile all resources.

```
sum(
  workqueue_depth{service=~"external-secrets.*"}
) by (name)
```

#### Controller Reconcile Latency
The controller should be able to reconcile resources within a reasonable timeframe. When latency is high secret delivery may impacted.

SLI Example: p99 across all controllers.
```
histogram_quantile(0.99,
  sum(rate(controller_runtime_reconcile_time_seconds_bucket{service=~"external-secrets.*"}[5m])) by (le)
)
```

#### Controller Reconcile Error
The controller should be able to reconcile resources without errors. When errors occurr secret delivery may be impacted which could cascade down to the secret consuming applications.

```
sum(increase(
  controller_runtime_reconcile_total{service=~"external-secrets.*",controller=~"$controller",result="error"}[1m])
) by (result)
```
Feature: initial generator implementation + Github Actions OIDC/AWS (#1539) Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> Co-authored-by: Gustavo Fernandes de Carvalho <gusfcarvalho@gmail.com> 2022-10-29 18:15:50 +00:00			`---`
			`hide:`
			`- toc`
			`---`

			`# Metrics`

Remove unused values from helm chart (#2470) * chore: remove unused servicemonitor-values from helm-chart The templates for the servicemonitors of the webhook-deployment and the certController have been removed in https://github.com/external-secrets/external-secrets/pull/2136. This commit removes the corresponding values in the values.yaml which are now obsolete. Signed-off-by: alexanderwoehler <alexander@woehler.org> * docs: remove references to deleted servicemonitor-values from docs Signed-off-by: alexanderwoehler <alexander@woehler.org> --------- Signed-off-by: alexanderwoehler <alexander@woehler.org> 2023-07-06 05:57:39 +00:00			The External Secrets Operator exposes its Prometheus metrics in the `/metrics` path. To enable it, set the `serviceMonitor.enabled` Helm flag to `true`.
Feature: initial generator implementation + Github Actions OIDC/AWS (#1539) Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> Co-authored-by: Gustavo Fernandes de Carvalho <gusfcarvalho@gmail.com> 2022-10-29 18:15:50 +00:00
			If you are using a different monitoring tool that also needs a `/metrics` endpoint, you can set the `metrics.service.enabled` Helm flag to `true`. In addition you can also set `webhook.metrics.service.enabled` and `certController.metrics.service.enabled` to scrape the other components.

Update docs/api/metrics.md (#2445) Signed-off-by: shuheiktgw <s-kitagawa@mercari.com> 2023-06-24 20:07:33 +00:00			The Operator has [the controller-runtime metrics inherited from kubebuilder](https://book.kubebuilder.io/reference/metrics-reference.html) plus some custom metrics with a resource name prefix, such as `externalsecret_`.
Feature: initial generator implementation + Github Actions OIDC/AWS (#1539) Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> Co-authored-by: Gustavo Fernandes de Carvalho <gusfcarvalho@gmail.com> 2022-10-29 18:15:50 +00:00
Update docs/api/metrics.md (#2445) Signed-off-by: shuheiktgw <s-kitagawa@mercari.com> 2023-06-24 20:07:33 +00:00			`## Cluster External Secret Metrics`
			`\| Name \| Type \| Description \|`
			`\|--------------------------------------------\|-------\|------------------------------------------------------------\|`
			\| `clusterexternalsecret_status_condition` \| Gauge \| The status condition of a specific Cluster External Secret \|
			\| `clusterexternalsecret_reconcile_duration` \| Gauge \| The duration time to reconcile the Cluster External Secret \|
Feature: initial generator implementation + Github Actions OIDC/AWS (#1539) Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> Co-authored-by: Gustavo Fernandes de Carvalho <gusfcarvalho@gmail.com> 2022-10-29 18:15:50 +00:00
Update docs/api/metrics.md (#2445) Signed-off-by: shuheiktgw <s-kitagawa@mercari.com> 2023-06-24 20:07:33 +00:00			`## External Secret Metrics`
			`\| Name \| Type \| Description \|`
			`\|------------------------------------------------\|-----------\|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|`
			\| `externalsecret_provider_api_calls_count` \| Counter \| Number of API calls made to an upstream secret provider API. The metric provides a `provider`, `call` and `status` labels. \|
			\| `externalsecret_sync_calls_total` \| Counter \| Total number of the External Secret sync calls \|
			\| `externalsecret_sync_calls_error` \| Counter \| Total number of the External Secret sync errors \|
			\| `externalsecret_status_condition` \| Gauge \| The status condition of a specific External Secret \|
			\| `externalsecret_reconcile_duration` \| Gauge \| The duration time to reconcile the External Secret \|

			`## Cluster Secret Store Metrics`
			`\| Name \| Type \| Description \|`
			`\|-----------------------------------------\|-------\|---------------------------------------------------------\|`
			\| `clustersecretstore_status_condition` \| Gauge \| The status condition of a specific Cluster Secret Store \|
			\| `clustersecretstore_reconcile_duration` \| Gauge \| The duration time to reconcile the Cluster Secret Store \|

			`# Secret Store Metrics`
			`\| Name \| Type \| Description \|`
			`\|----------------------------------\|-------\|-------------------------------------------------\|`
			\| `secretstore_status_condition` \| Gauge \| The status condition of a specific Secret Store \|
			\| `secretstore_reconcile_duration` \| Gauge \| The duration time to reconcile the Secret Store \|

			`## Controller Runtime Metrics`
			`See [the kubebuilder documentation](https://book.kubebuilder.io/reference/metrics-reference.html) on the default exported metrics by controller-runtime.`
feat: add provider metrics (#2024) * feat: add provider metrics This adds a counter metric `provider_api_calls_count` that observes the results of upstream secret provider api calls. (1) Observability It allows an user to break down issues by provider and api call by observing the status=error\|success label. More details around the error can be found in the logs. (2) Cost Management Some providers charge by API calls issued. By providing observability for the number of calls issued helps users to understand the impact of deploying ESO and fine-tuning `spec.refreshInterval`. (3) Rate Limiting Some providers implement rate-limiting for their services. Having metrics for success/failure count helps to understand how many requests are issued by a given ESO deployment per cluster. Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> * fix: add service monitor for cert-controller and add SLIs Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> --------- Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> 2023-02-27 21:56:36 +00:00
			`## Dashboard`

			`We provide a [Grafana Dashboard](https://raw.githubusercontent.com/external-secrets/external-secrets/main/docs/snippets/dashboard.json) that gives you an overview of External Secrets Operator:`

			`![ESO Dashboard](../pictures/eso-dashboard-1.png)`
			`![ESO Dashboard](../pictures/eso-dashboard-2.png)`


			`## Service Level Indicators and Alerts`

			`We find the following Service Level Indicators (SLIs) useful when operating ESO. They should give you a good starting point and hints to develop your own Service Level Objectives (SLOs).`

			`#### Webhook HTTP Status Codes`
			`The webhook HTTP status code indicates that a HTTP Request was answered successfully or not.`
			If the Webhook pod is not able to serve the requests properly then that failure may cascade down to the controller or any other user of `kube-apiserver`.

			`SLI Example: request error percentage.`
			```
			`sum(increase(controller_runtime_webhook_requests_total{service=~"external-secrets.*",code="500"}[1m]))`
			`/`
			`sum(increase(controller_runtime_webhook_requests_total{service=~"external-secrets.*"}[1m]))`
			```

			`#### Webhook HTTP Request Latency`
			`If the webhook server is not able to respond in time then that may cause a timeout at the client.`
			This failure may cascade down to the controller or any other user of `kube-apiserver`.

			`SLI Example: p99 across all webhook requests.`
			```
			`histogram_quantile(0.99,`
			`sum(rate(controller_runtime_webhook_latency_seconds_bucket{service=~"external-secrets.*"}[5m])) by (le)`
			`)`
			```

			`#### Controller Workqueue Depth`
			`If the workqueue depth is > 0 for a longer period of time then this is an indicator for the controller not being able to reconcile resources in time. I.e. delivery of secret updates is delayed.`

			Note: when a controller is restarted, then `queue length = total number of resources`. Make sure to measure the time it takes for the controller to fully reconcile all secrets after a restart. In large clusters this may take a while, make sure to define an acceptable timeframe to fully reconcile all resources.

			```
			`sum(`
			`workqueue_depth{service=~"external-secrets.*"}`
			`) by (name)`
			```

			`#### Controller Reconcile Latency`
			`The controller should be able to reconcile resources within a reasonable timeframe. When latency is high secret delivery may impacted.`

			`SLI Example: p99 across all controllers.`
			```
			`histogram_quantile(0.99,`
			`sum(rate(controller_runtime_reconcile_time_seconds_bucket{service=~"external-secrets.*"}[5m])) by (le)`
			`)`
			```

			`#### Controller Reconcile Error`
			`The controller should be able to reconcile resources without errors. When errors occurr secret delivery may be impacted which could cascade down to the secret consuming applications.`

			```
			`sum(increase(`
docs: update controller reconcile error rule (#3021) Signed-off-by: Zadkiel Aharonian <zadkiel.aharonian@gmail.com> 2024-01-12 18:54:52 +00:00			`controller_runtime_reconcile_total{service=~"external-secrets.*",controller=~"$controller",result="error"}[1m])`
feat: add provider metrics (#2024) * feat: add provider metrics This adds a counter metric `provider_api_calls_count` that observes the results of upstream secret provider api calls. (1) Observability It allows an user to break down issues by provider and api call by observing the status=error\|success label. More details around the error can be found in the logs. (2) Cost Management Some providers charge by API calls issued. By providing observability for the number of calls issued helps users to understand the impact of deploying ESO and fine-tuning `spec.refreshInterval`. (3) Rate Limiting Some providers implement rate-limiting for their services. Having metrics for success/failure count helps to understand how many requests are issued by a given ESO deployment per cluster. Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> * fix: add service monitor for cert-controller and add SLIs Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> --------- Signed-off-by: Moritz Johner <beller.moritz@googlemail.com> 2023-02-27 21:56:36 +00:00			`) by (result)`
			```