diff --git a/Documentation/proposals/202411-zone-aware-sharding.md b/Documentation/proposals/202411-zone-aware-sharding.md new file mode 100644 index 000000000..080a3abcf --- /dev/null +++ b/Documentation/proposals/202411-zone-aware-sharding.md @@ -0,0 +1,387 @@ +# Zone aware sharding for Prometheus + +* **Owners:** + * [arnecls](https://github.com/arnecls) + +* **Related Tickets:** + * [#6437](https://github.com/prometheus-operator/prometheus-operator/issues/6437) + +* **Other docs:** + * [Well known kubernetes labels](https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone) + * [AWS zone names](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) + * [GCP zone names](https://cloud.google.com/compute/docs/regions-zones#available) + * [Shard Autoscaling](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/proposals/202310-shard-autoscaling.md) design proposal. + +This proposal describes how we can implement zone-aware sharding by adding +support for custom labels and zone configuration options to the existing +Prometheus configuration resources. + +## Why + +When running large, multi-zone clusters, Prometheus scraping can lead to an +increase in inter-zone traffic costs. A solution would be to deploy 1 Prometheus shard +per zone and configure each shard to scrape only the targets local to its zone. The +current sharding implementation can't solve the issue though. While +it's possible to customize the label (`__address__` by default) used for distributing the +targets to the Prometheus instances, there's no way to configure a single Prometheus +resource so that each shard is bound to a specific zone. + +## Goals + +* Define a set of configuration options required to allow zone aware sharding +* Define the relabel configuration to be generated for zone aware sharding +* Schedule Prometheus pods to their respective zones. +* Stay backwards compatible to the current mechanism by default + +## Non-goals + +* Implement mechanisms to automatically fix configuration errors by the user +* Support mixed environments (kubernetes and non-kubernetes targets are scraped) +* Support Kubernetes clusters before 1.26 (topology label support) +* Implement zone aware scraping for targets defined via + `.spec.additionalScrapeConfigs` and `ScrapeConfig` custom resources. + +## How + +> [!NOTE] +> Due to the size of this feature, it will be placed behind a +> [feature gate](https://github.com/prometheus-operator/prometheus-operator/blob/main/pkg/operator/feature_gates.go) +> to allow incremental testing. + +### Algorithm + +In order to do calculate a stable assignment, following parameters are required: + +1. `num_shards`: The number of Prometheus shards +2. `shard_index`: A number of the range `[0..num_shards-1]` identifying a single + Prometheus instance inside a given shard +3. `zones`: A list of the zones to be scraped +4. `zone_label`: A label denoting the zone of a target +5. `address`: The content of the `__address__` label + +It has to be noted that `zone_label` is supposed to contain a value from the +`zones` list. +The `num_shards` value is referring to the currently available `.spec.shards` +from the [Prometheus custom resource definition](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.Prometheus). + +Given these values, a target is to be scraped by a given Prometheus instance +by using the following algorithm: + +```go +assert(num_shards >= len(zones)) # Error: zone(s) will not get scraped +assert(num_shards % len(zones) == 0) # Warning: multi-scraping of instances + +shards_per_zone := max(1, floor(num_shards / len(zones))) +prom_zone_idx := shard_index % shards_per_zone + +if zone_label == zones[prom_zone_idx] { + + assignment_idx := floor(shard_index / shards_per_zone) + + if hash(address) % shards_per_zone == assignment_idx { + do_scrape() + } +} +``` + +By using modulo to calculate the `prom_zone_idx`, instances will be distributed +to zones in the sense of `A,B,C,A,B,C` and so on. This allows a modification of +the `num_shards` value without redistribution of shards or data. +This was preferred over allowing the number of `zones` to change, as this is +less likely to happen. + +#### Edge cases + +We have introduced asserts in the above section to warn about edge cases that +might lead to duplicate data or data loss. + +By the above algorithm. Prometheus instances will be distributed in an +alternating fashion by using the already existing shard index. +This leads to the following edge cases: + +When `num_shards` is 10 and `len(zones)` is 3 as in `[A..C]`. +`shards_per_zone` is 3. This yields the following distribution + +```text +0 1 2 | 3 4 5 | 6 7 8 | 9 | shard index +A B C | A B C | A B C | A | zone +0 0 0 | 1 1 1 | 2 2 2 | 0 | assignment index +``` + +In this case the 2nd assert will warn about double scraping of instances in +zone A, as the same targets are being assigned to instance 0 and 9. + +When `num_shards` is 2 and `len(zones)` is 3 as in `[A..C]`. +`shards_per_zone` is 1. This yields the following distribution + +```text +0 1 | shard index +A B C | zone +0 0 0 | assignment index +``` + +In this case targets in zone C are not being scraped. + +Both cases should lead to an error during reconciliation, causing the change to +not be rolled out. The first case (double scraping) is not as severe as a zone +not being scraped but it is otherwise hard to spot. +It's also to be mentioned that replicas are to be used to achieve redundant +scraping. + +### Topology field discovery + +The kubernetes service discovery currently does not expose any topology field. +Such a field would have to be added, otherwise users would have to inject such +a label themselves. + +A good candidate for such fields are the `topology.kubernetes.io/*` labels +which should be present on all nodes. + +There are two ways to handle this: + +1. A change to the Prometheus kubernetes discovery service to add the required + label to all targets. +2. The operator could do this discovery and add a relabel rule based on the + node name. + +The second solution would require the operator to constantly update the relabel +configuration. This could lead to increased load on clusters with agressive +autoscaling as well as race conditions for pods on newly created nodes, as the +config change is not atomic/instant. + +As of that, a change to the kubernetes service discovery is considered the more +stable, and thus preferrable solution. It will require additional permissions +for Prometheus in case it is not already allowed to read node objects. + +### API changes + +> [!NOTE] +> This proposal is mutually exclusive to [DaemonSet mode](202405-agent-daemonset.md), +> as Prometheus always scrapes a single node in that case. +> Defining a `shardingStrategy` when `DaemonSet mode` is active, should lead to +> a reconciliation error. + +Following the algorithm presented above, we suggest the following configuration +options to be added to the [Prometheus](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.Prometheus) and PrometheusAgent custom resource definitions. + +All values used in this snippet should also be the defaults for their +corresponding keys. + +```yaml +spec: + shardingStrategy: + # Select a sharding mode. Can be 'Classic' or 'Topology'. + # Defaults to `Classic`. + mode: 'Classic' + + # The following section is only valid if "mode" is set to "Topology" + topology: + # Prometheus external label used to communicate the topology zone. + # If not defined, it defaults to "zone". + # If defined to an empty string, no external label is added to the Prometheus configuration. + externalLabelName: "zone" + # All topology values to be used by the cluster, i.e. a list of all + # zones in use. + values: [] +``` + +The `topology` section does not use the term `zone`. This will prevent API +changes in case other topologies, like regions, need to be supported in future +releases. + +Both modes do not contain an explicit overwrite of the label used for sharding. +This feature is already possible by generating a `__tmp_hash` label through +[scrape classes](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.ScrapeClass). + +In case of the `Topology` mode, two labels are used for sharding. One is used +to determine the correct topology of a target, the other one is used to allow +sharding inside a specfic topology (e.g. zone). +The second label implements the exact same mechanics as the `Classic` mode and +thus uses the same `__tmp_hash` overwrite mechanics. +To allow overwrites for the topology determination label, a custom label named +`__tmp_topology` can be generated, following the same idea. + +The `externalLabelName` should be added by default to allow debugging. It also +gives some general, valuable insights for multi-zone setups. + +It is possible to change the `mode` field from `Classic` to `Topology` (and +vice-versa) without service interruption. + +### Generated configuration + +The following examples are based on the algorithm above. +Please note that `shard_index` has to be provided by the operator during +config generation. + +We use a replica count of 2 in all examples to illustrate that this value +does not have any effect, as both replicas will have the same `shared_index` +assigned. + +#### Classic mode + +Given the following configuration + +```yaml +spec: + shards: 4 + replicas: 2 + shardingStrategy: + mode: 'Classic' +``` + +we would get the following output for `shard_index == 2` + +```yaml +- source_labels: ['__address__', '__tmp_hash'] + target_label: '__tmp_hash' + regex: '(.+);' + replacement: '$1' + action: 'replace' +- source_labels: ['__tmp_hash'] + target_label: '__tmp_hash' + modulus: 4 + action: 'hashmod' +- source_labels: ['__tmp_hash'] + regex: '2' # shard_index + action: 'keep' +``` + +#### Topology mode + +Given the following configuration + +```yaml +spec: + shards: 4 + replicas: 2 + shardingStrategy: + mode: 'Topology' + topology: + values: + - 'europe-west4-a' + - 'europe-west4-b' +``` + +we would get the following output for `shard_index == 2`: + +```yaml +# zones := shardingStrategy.topology.values +# shards_per_zone := max(1, floor(shards / len(zones))) + +# topology determination +- source_labels: ['__meta_kubernetes_endpointslice_endpoint_zone', __tmp_topology] + target_label: '__tmp_topology' + regex: '(.+);' +- source_labels: ['__meta_kubernetes_node_label_topology_kubernetes_io_zone', '__meta_kubernetes_node_labelpresent_topology_kubernetes_io_zone', '__tmp_topology'] + regex: '(.+);true;' + target_label: '__tmp_topology' +- source_labels: ['__tmp_topology'] + regex: 'europe-west4-a' # zones[shard_index % shards_per_zone] + action: 'keep' + +# In-topology sharding +- source_labels: ['__address__', '__tmp_hash'] + target_label: '__tmp_hash' + regex: "(.+);" + action: 'replace' +- source_labels: ['__tmp_hash'] + target_label: '__tmp_hash' + modulus: 4 + action: 'hashmod' +- source_labels: [ '__tmp_hash' ] + regex: '1' # floor(shard_index / shards_per_zone) + action: 'keep' +``` + +> [!NOTE] +> Node metadata will need to be added when using certain monitors. +> This requires an additional flag in the `kubernetes_sd_configs` section like this: +> +> ```yaml +> kubernetes_sd_configs: +> - attach_metadata: +> node: true +> ``` + +### Prometheus instance zone assignment + +To make sure that Prometheus instances are deployed to the correct zone (of their +assigned target), we need to generate a [node affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity) +or a [node selector](https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/#create-a-pod-that-gets-scheduled-to-your-chosen-node). + +As node selectors are simpler to manage, and node affinities might run into +ordering issues when a user defines their own affinities, node selectors should +be used. +If a `nodeSelector` has already been defined, it will be merged with the node +selector generated here. If the same key was used, the value will be +replaced with the generated value. + +Given this input: + +```yaml +spec: + nodeSelector: + 'foo': 'bar' + 'topology.kubernetes.io/zone': 'will be replaced' + shards: 4 + replicas: 2 + scrapeClasses: + - name: 'topology' + default: true + attachMetadata: + node: true + shardingStrategy: + mode: 'Topology' + topology: + values: + - 'europe-west4-a' + - 'europe-west4-b' +``` + +The following snippet would be generated for `shared_index == 2`: + +```yaml +# zones := shardingStrategy.topology.values +# shards_per_zone := max(1, floor(shards / len(zones))) +spec: + nodeSelector: + # Existing nodeSelectors using 'topology.kubernetes.io/zone' will be + # replaced with the generated value: + # zones[shard_index % shards_per_zone] + 'topology.kubernetes.io/zone': 'europe-west4-a' + # Existing nodeSelectors will be kept + 'foo': 'bar' +``` + +## Alternatives + +We could allow users to define the complete relabel and node selector logic +themselves. This would be more flexible, but also way harder to configure. + +By abstracting into `shardingStrategy`, we can cover the most common cases +without requiring users to have deep knowledge about Prometheus relabel +configuration. + +A field `additionalRelabelConfig` was discussed to allow arbitrary logic to be +added before the sharding configuration. It was decided that this would +duplicate the functionality of [scrape classes](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.ScrapeClass) +found in, e.g., the [Prometheus](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#prometheusspec) +custom resource definition. + +The use of `sourceLabel` fields instead of the `__tmp-*` label mechanic was +discussed. It was agreed to not introduce new fields as an existing feature +would have to be changed, and to not further increase the number of API fields. + +An overwrite for the topology node label was discussed. This field was not +added as there was no clear use-case yet. The general structure was kept so +that it will still be possible to add such a field in a future release. + +## Action Plan + +A rough plan of the steps required to implement the feature: + +1. Add the `PrometheusTopologySharding` feature gate. +2. Implement the API changes with pre-flight validations. +3. Implement the node selector update when `mode: Topology`. +4. Implement the external label name when `mode: Topology`. +5. Implement the target sharding when `mode: Topology`.