1
0
Fork 0
mirror of https://github.com/prometheus-operator/prometheus-operator.git synced 2025-04-14 00:06:30 +00:00
prometheus-operator/Documentation/proposals/202411-zone-aware-sharding.md
2025-01-07 11:48:38 +00:00

14 KiB

Zone aware sharding for Prometheus

This proposal describes how we can implement zone-aware sharding by adding support for custom labels and zone configuration options to the existing Prometheus configuration resources.

Why

When running large, multi-zone clusters, Prometheus scraping can lead to an increase in inter-zone traffic costs. A solution would be to deploy 1 Prometheus shard per zone and configure each shard to scrape only the targets local to its zone. The current sharding implementation can't solve the issue though. While it's possible to customize the label (__address__ by default) used for distributing the targets to the Prometheus instances, there's no way to configure a single Prometheus resource so that each shard is bound to a specific zone.

Goals

  • Define a set of configuration options required to allow zone aware sharding
  • Define the relabel configuration to be generated for zone aware sharding
  • Schedule Prometheus pods to their respective zones.
  • Stay backwards compatible to the current mechanism by default

Non-goals

  • Implement mechanisms to automatically fix configuration errors by the user
  • Support mixed environments (kubernetes and non-kubernetes targets are scraped)
  • Support Kubernetes clusters before 1.26 (topology label support)
  • Implement zone aware scraping for targets defined via .spec.additionalScrapeConfigs and ScrapeConfig custom resources.

How

Note

Due to the size of this feature, it will be placed behind a feature gate to allow incremental testing.

Algorithm

In order to do calculate a stable assignment, following parameters are required:

  1. num_shards: The number of Prometheus shards
  2. shard_index: A number of the range [0..num_shards-1] identifying a single Prometheus instance inside a given shard
  3. zones: A list of the zones to be scraped
  4. zone_label: A label denoting the zone of a target
  5. address: The content of the __address__ label

It has to be noted that zone_label is supposed to contain a value from the zones list. The num_shards value is referring to the currently available .spec.shards from the Prometheus custom resource definition.

Given these values, a target is to be scraped by a given Prometheus instance by using the following algorithm:

assert(num_shards >= len(zones))     # Error: zone(s) will not get scraped
assert(num_shards % len(zones) == 0) # Warning: multi-scraping of instances

shards_per_zone := max(1, floor(num_shards / len(zones)))
prom_zone_idx := shard_index % shards_per_zone

if zone_label == zones[prom_zone_idx] {

    assignment_idx := floor(shard_index / shards_per_zone)

    if hash(address) % shards_per_zone == assignment_idx {
        do_scrape()
    }
}

By using modulo to calculate the prom_zone_idx, instances will be distributed to zones in the sense of A,B,C,A,B,C and so on. This allows a modification of the num_shards value without redistribution of shards or data. This was preferred over allowing the number of zones to change, as this is less likely to happen.

Edge cases

We have introduced asserts in the above section to warn about edge cases that might lead to duplicate data or data loss.

By the above algorithm. Prometheus instances will be distributed in an alternating fashion by using the already existing shard index. This leads to the following edge cases:

When num_shards is 10 and len(zones) is 3 as in [A..C]. shards_per_zone is 3. This yields the following distribution

0 1 2 | 3 4 5 | 6 7 8 | 9 | shard index
A B C | A B C | A B C | A | zone
0 0 0 | 1 1 1 | 2 2 2 | 0 | assignment index

In this case the 2nd assert will warn about double scraping of instances in zone A, as the same targets are being assigned to instance 0 and 9.

When num_shards is 2 and len(zones) is 3 as in [A..C]. shards_per_zone is 1. This yields the following distribution

0 1   | shard index
A B C | zone
0 0 0 | assignment index

In this case targets in zone C are not being scraped.

Both cases should lead to an error during reconciliation, causing the change to not be rolled out. The first case (double scraping) is not as severe as a zone not being scraped but it is otherwise hard to spot. It's also to be mentioned that replicas are to be used to achieve redundant scraping.

Topology field discovery

The kubernetes service discovery currently does not expose any topology field. Such a field would have to be added, otherwise users would have to inject such a label themselves.

A good candidate for such fields are the topology.kubernetes.io/* labels which should be present on all nodes.

There are two ways to handle this:

  1. A change to the Prometheus kubernetes discovery service to add the required label to all targets.
  2. The operator could do this discovery and add a relabel rule based on the node name.

The second solution would require the operator to constantly update the relabel configuration. This could lead to increased load on clusters with agressive autoscaling as well as race conditions for pods on newly created nodes, as the config change is not atomic/instant.

As of that, a change to the kubernetes service discovery is considered the more stable, and thus preferrable solution. It will require additional permissions for Prometheus in case it is not already allowed to read node objects.

API changes

Note

This proposal is mutually exclusive to DaemonSet mode, as Prometheus always scrapes a single node in that case. Defining a shardingStrategy when DaemonSet mode is active, should lead to a reconciliation error.

Following the algorithm presented above, we suggest the following configuration options to be added to the Prometheus and PrometheusAgent custom resource definitions.

All values used in this snippet should also be the defaults for their corresponding keys.

spec:
  shardingStrategy:
     # Select a sharding mode. Can be 'Classic' or 'Topology'.
     # Defaults to `Classic`.
    mode: 'Classic'

    # The following section is only valid if "mode" is set to "Topology"
    topology:
      # Prometheus external label used to communicate the topology zone.
      # If not defined, it defaults to "zone".
      # If defined to an empty string, no external label is added to the Prometheus configuration.
      externalLabelName: "zone"
      # All topology values to be used by the cluster, i.e. a list of all
      # zones in use.
      values: []

The topology section does not use the term zone. This will prevent API changes in case other topologies, like regions, need to be supported in future releases.

Both modes do not contain an explicit overwrite of the label used for sharding. This feature is already possible by generating a __tmp_hash label through scrape classes.

In case of the Topology mode, two labels are used for sharding. One is used to determine the correct topology of a target, the other one is used to allow sharding inside a specfic topology (e.g. zone). The second label implements the exact same mechanics as the Classic mode and thus uses the same __tmp_hash overwrite mechanics. To allow overwrites for the topology determination label, a custom label named __tmp_topology can be generated, following the same idea.

The externalLabelName should be added by default to allow debugging. It also gives some general, valuable insights for multi-zone setups.

It is possible to change the mode field from Classic to Topology (and vice-versa) without service interruption.

Generated configuration

The following examples are based on the algorithm above. Please note that shard_index has to be provided by the operator during config generation.

We use a replica count of 2 in all examples to illustrate that this value does not have any effect, as both replicas will have the same shared_index assigned.

Classic mode

Given the following configuration

spec:
  shards: 4
  replicas: 2
  shardingStrategy:
    mode: 'Classic'

we would get the following output for shard_index == 2

- source_labels: ['__address__', '__tmp_hash']
  target_label: '__tmp_hash'
  regex: '(.+);'
  replacement: '$1'
  action: 'replace'
- source_labels: ['__tmp_hash']
  target_label: '__tmp_hash'
  modulus: 4
  action: 'hashmod'
- source_labels: ['__tmp_hash']
  regex: '2'                    # shard_index
  action: 'keep'

Topology mode

Given the following configuration

spec:
  shards: 4
  replicas: 2
  shardingStrategy:
    mode: 'Topology'
    topology:
      values:
        - 'europe-west4-a'
        - 'europe-west4-b'

we would get the following output for shard_index == 2:

# zones := shardingStrategy.topology.values
# shards_per_zone := max(1, floor(shards / len(zones)))

# topology determination
- source_labels: ['__meta_kubernetes_endpointslice_endpoint_zone', __tmp_topology]
  target_label: '__tmp_topology'
  regex: '(.+);'
- source_labels: ['__meta_kubernetes_node_label_topology_kubernetes_io_zone', '__meta_kubernetes_node_labelpresent_topology_kubernetes_io_zone', '__tmp_topology']
  regex: '(.+);true;'
  target_label: '__tmp_topology'
- source_labels: ['__tmp_topology']
  regex: 'europe-west4-a'          # zones[shard_index % shards_per_zone]
  action: 'keep'

# In-topology sharding
- source_labels: ['__address__', '__tmp_hash']
  target_label: '__tmp_hash'
  regex: "(.+);"
  action: 'replace'
- source_labels: ['__tmp_hash']
  target_label: '__tmp_hash'
  modulus: 4
  action: 'hashmod'
- source_labels: [ '__tmp_hash' ]
  regex: '1'                       # floor(shard_index / shards_per_zone)
  action: 'keep'

Note

Node metadata will need to be added when using certain monitors. This requires an additional flag in the kubernetes_sd_configs section like this:

kubernetes_sd_configs:
  - attach_metadata:
      node: true

Prometheus instance zone assignment

To make sure that Prometheus instances are deployed to the correct zone (of their assigned target), we need to generate a node affinity or a node selector.

As node selectors are simpler to manage, and node affinities might run into ordering issues when a user defines their own affinities, node selectors should be used. If a nodeSelector has already been defined, it will be merged with the node selector generated here. If the same key was used, the value will be replaced with the generated value.

Given this input:

spec:
  nodeSelector:
    'foo': 'bar'
    'topology.kubernetes.io/zone': 'will be replaced'
  shards: 4
  replicas: 2
  scrapeClasses:
    - name: 'topology'
      default: true
      attachMetadata:
        node: true
  shardingStrategy:
    mode: 'Topology'
    topology:
      values:
        - 'europe-west4-a'
        - 'europe-west4-b'

The following snippet would be generated for shared_index == 2:

# zones := shardingStrategy.topology.values
# shards_per_zone := max(1, floor(shards / len(zones)))
spec:
  nodeSelector:
    # Existing nodeSelectors using 'topology.kubernetes.io/zone' will be
    # replaced with the generated value:
    # zones[shard_index % shards_per_zone]
    'topology.kubernetes.io/zone': 'europe-west4-a'
    # Existing nodeSelectors will be kept
    'foo': 'bar'

Alternatives

We could allow users to define the complete relabel and node selector logic themselves. This would be more flexible, but also way harder to configure.

By abstracting into shardingStrategy, we can cover the most common cases without requiring users to have deep knowledge about Prometheus relabel configuration.

A field additionalRelabelConfig was discussed to allow arbitrary logic to be added before the sharding configuration. It was decided that this would duplicate the functionality of scrape classes found in, e.g., the Prometheus custom resource definition.

The use of sourceLabel fields instead of the __tmp-* label mechanic was discussed. It was agreed to not introduce new fields as an existing feature would have to be changed, and to not further increase the number of API fields.

An overwrite for the topology node label was discussed. This field was not added as there was no clear use-case yet. The general structure was kept so that it will still be possible to add such a field in a future release.

Action Plan

A rough plan of the steps required to implement the feature:

  1. Add the PrometheusTopologySharding feature gate.
  2. Implement the API changes with pre-flight validations.
  3. Implement the node selector update when mode: Topology.
  4. Implement the external label name when mode: Topology.
  5. Implement the target sharding when mode: Topology.