14 KiB
Zone aware sharding for Prometheus
-
Owners:
-
Related Tickets:
-
Other docs:
- Well known kubernetes labels
- AWS zone names
- GCP zone names
- Shard Autoscaling design proposal.
This proposal describes how we can implement zone-aware sharding by adding support for custom labels and zone configuration options to the existing Prometheus configuration resources.
Why
When running large, multi-zone clusters, Prometheus scraping can lead to an
increase in inter-zone traffic costs. A solution would be to deploy 1 Prometheus shard
per zone and configure each shard to scrape only the targets local to its zone. The
current sharding implementation can't solve the issue though. While
it's possible to customize the label (__address__
by default) used for distributing the
targets to the Prometheus instances, there's no way to configure a single Prometheus
resource so that each shard is bound to a specific zone.
Goals
- Define a set of configuration options required to allow zone aware sharding
- Define the relabel configuration to be generated for zone aware sharding
- Schedule Prometheus pods to their respective zones.
- Stay backwards compatible to the current mechanism by default
Non-goals
- Implement mechanisms to automatically fix configuration errors by the user
- Support mixed environments (kubernetes and non-kubernetes targets are scraped)
- Support Kubernetes clusters before 1.26 (topology label support)
- Implement zone aware scraping for targets defined via
.spec.additionalScrapeConfigs
andScrapeConfig
custom resources.
How
Note
Due to the size of this feature, it will be placed behind a feature gate to allow incremental testing.
Algorithm
In order to do calculate a stable assignment, following parameters are required:
num_shards
: The number of Prometheus shardsshard_index
: A number of the range[0..num_shards-1]
identifying a single Prometheus instance inside a given shardzones
: A list of the zones to be scrapedzone_label
: A label denoting the zone of a targetaddress
: The content of the__address__
label
It has to be noted that zone_label
is supposed to contain a value from the
zones
list.
The num_shards
value is referring to the currently available .spec.shards
from the Prometheus custom resource definition.
Given these values, a target is to be scraped by a given Prometheus instance by using the following algorithm:
assert(num_shards >= len(zones)) # Error: zone(s) will not get scraped
assert(num_shards % len(zones) == 0) # Warning: multi-scraping of instances
shards_per_zone := max(1, floor(num_shards / len(zones)))
prom_zone_idx := shard_index % shards_per_zone
if zone_label == zones[prom_zone_idx] {
assignment_idx := floor(shard_index / shards_per_zone)
if hash(address) % shards_per_zone == assignment_idx {
do_scrape()
}
}
By using modulo to calculate the prom_zone_idx
, instances will be distributed
to zones in the sense of A,B,C,A,B,C
and so on. This allows a modification of
the num_shards
value without redistribution of shards or data.
This was preferred over allowing the number of zones
to change, as this is
less likely to happen.
Edge cases
We have introduced asserts in the above section to warn about edge cases that might lead to duplicate data or data loss.
By the above algorithm. Prometheus instances will be distributed in an alternating fashion by using the already existing shard index. This leads to the following edge cases:
When num_shards
is 10 and len(zones)
is 3 as in [A..C]
.
shards_per_zone
is 3. This yields the following distribution
0 1 2 | 3 4 5 | 6 7 8 | 9 | shard index
A B C | A B C | A B C | A | zone
0 0 0 | 1 1 1 | 2 2 2 | 0 | assignment index
In this case the 2nd assert will warn about double scraping of instances in zone A, as the same targets are being assigned to instance 0 and 9.
When num_shards
is 2 and len(zones)
is 3 as in [A..C]
.
shards_per_zone
is 1. This yields the following distribution
0 1 | shard index
A B C | zone
0 0 0 | assignment index
In this case targets in zone C are not being scraped.
Both cases should lead to an error during reconciliation, causing the change to not be rolled out. The first case (double scraping) is not as severe as a zone not being scraped but it is otherwise hard to spot. It's also to be mentioned that replicas are to be used to achieve redundant scraping.
Topology field discovery
The kubernetes service discovery currently does not expose any topology field. Such a field would have to be added, otherwise users would have to inject such a label themselves.
A good candidate for such fields are the topology.kubernetes.io/*
labels
which should be present on all nodes.
There are two ways to handle this:
- A change to the Prometheus kubernetes discovery service to add the required label to all targets.
- The operator could do this discovery and add a relabel rule based on the node name.
The second solution would require the operator to constantly update the relabel configuration. This could lead to increased load on clusters with agressive autoscaling as well as race conditions for pods on newly created nodes, as the config change is not atomic/instant.
As of that, a change to the kubernetes service discovery is considered the more stable, and thus preferrable solution. It will require additional permissions for Prometheus in case it is not already allowed to read node objects.
API changes
Note
This proposal is mutually exclusive to DaemonSet mode, as Prometheus always scrapes a single node in that case. Defining a
shardingStrategy
whenDaemonSet mode
is active, should lead to a reconciliation error.
Following the algorithm presented above, we suggest the following configuration options to be added to the Prometheus and PrometheusAgent custom resource definitions.
All values used in this snippet should also be the defaults for their corresponding keys.
spec:
shardingStrategy:
# Select a sharding mode. Can be 'Classic' or 'Topology'.
# Defaults to `Classic`.
mode: 'Classic'
# The following section is only valid if "mode" is set to "Topology"
topology:
# Prometheus external label used to communicate the topology zone.
# If not defined, it defaults to "zone".
# If defined to an empty string, no external label is added to the Prometheus configuration.
externalLabelName: "zone"
# All topology values to be used by the cluster, i.e. a list of all
# zones in use.
values: []
The topology
section does not use the term zone
. This will prevent API
changes in case other topologies, like regions, need to be supported in future
releases.
Both modes do not contain an explicit overwrite of the label used for sharding.
This feature is already possible by generating a __tmp_hash
label through
scrape classes.
In case of the Topology
mode, two labels are used for sharding. One is used
to determine the correct topology of a target, the other one is used to allow
sharding inside a specfic topology (e.g. zone).
The second label implements the exact same mechanics as the Classic
mode and
thus uses the same __tmp_hash
overwrite mechanics.
To allow overwrites for the topology determination label, a custom label named
__tmp_topology
can be generated, following the same idea.
The externalLabelName
should be added by default to allow debugging. It also
gives some general, valuable insights for multi-zone setups.
It is possible to change the mode
field from Classic
to Topology
(and
vice-versa) without service interruption.
Generated configuration
The following examples are based on the algorithm above.
Please note that shard_index
has to be provided by the operator during
config generation.
We use a replica count of 2 in all examples to illustrate that this value
does not have any effect, as both replicas will have the same shared_index
assigned.
Classic mode
Given the following configuration
spec:
shards: 4
replicas: 2
shardingStrategy:
mode: 'Classic'
we would get the following output for shard_index == 2
- source_labels: ['__address__', '__tmp_hash']
target_label: '__tmp_hash'
regex: '(.+);'
replacement: '$1'
action: 'replace'
- source_labels: ['__tmp_hash']
target_label: '__tmp_hash'
modulus: 4
action: 'hashmod'
- source_labels: ['__tmp_hash']
regex: '2' # shard_index
action: 'keep'
Topology mode
Given the following configuration
spec:
shards: 4
replicas: 2
shardingStrategy:
mode: 'Topology'
topology:
values:
- 'europe-west4-a'
- 'europe-west4-b'
we would get the following output for shard_index == 2
:
# zones := shardingStrategy.topology.values
# shards_per_zone := max(1, floor(shards / len(zones)))
# topology determination
- source_labels: ['__meta_kubernetes_endpointslice_endpoint_zone', __tmp_topology]
target_label: '__tmp_topology'
regex: '(.+);'
- source_labels: ['__meta_kubernetes_node_label_topology_kubernetes_io_zone', '__meta_kubernetes_node_labelpresent_topology_kubernetes_io_zone', '__tmp_topology']
regex: '(.+);true;'
target_label: '__tmp_topology'
- source_labels: ['__tmp_topology']
regex: 'europe-west4-a' # zones[shard_index % shards_per_zone]
action: 'keep'
# In-topology sharding
- source_labels: ['__address__', '__tmp_hash']
target_label: '__tmp_hash'
regex: "(.+);"
action: 'replace'
- source_labels: ['__tmp_hash']
target_label: '__tmp_hash'
modulus: 4
action: 'hashmod'
- source_labels: [ '__tmp_hash' ]
regex: '1' # floor(shard_index / shards_per_zone)
action: 'keep'
Note
Node metadata will need to be added when using certain monitors. This requires an additional flag in the
kubernetes_sd_configs
section like this:kubernetes_sd_configs: - attach_metadata: node: true
Prometheus instance zone assignment
To make sure that Prometheus instances are deployed to the correct zone (of their assigned target), we need to generate a node affinity or a node selector.
As node selectors are simpler to manage, and node affinities might run into
ordering issues when a user defines their own affinities, node selectors should
be used.
If a nodeSelector
has already been defined, it will be merged with the node
selector generated here. If the same key was used, the value will be
replaced with the generated value.
Given this input:
spec:
nodeSelector:
'foo': 'bar'
'topology.kubernetes.io/zone': 'will be replaced'
shards: 4
replicas: 2
scrapeClasses:
- name: 'topology'
default: true
attachMetadata:
node: true
shardingStrategy:
mode: 'Topology'
topology:
values:
- 'europe-west4-a'
- 'europe-west4-b'
The following snippet would be generated for shared_index == 2
:
# zones := shardingStrategy.topology.values
# shards_per_zone := max(1, floor(shards / len(zones)))
spec:
nodeSelector:
# Existing nodeSelectors using 'topology.kubernetes.io/zone' will be
# replaced with the generated value:
# zones[shard_index % shards_per_zone]
'topology.kubernetes.io/zone': 'europe-west4-a'
# Existing nodeSelectors will be kept
'foo': 'bar'
Alternatives
We could allow users to define the complete relabel and node selector logic themselves. This would be more flexible, but also way harder to configure.
By abstracting into shardingStrategy
, we can cover the most common cases
without requiring users to have deep knowledge about Prometheus relabel
configuration.
A field additionalRelabelConfig
was discussed to allow arbitrary logic to be
added before the sharding configuration. It was decided that this would
duplicate the functionality of scrape classes
found in, e.g., the Prometheus
custom resource definition.
The use of sourceLabel
fields instead of the __tmp-*
label mechanic was
discussed. It was agreed to not introduce new fields as an existing feature
would have to be changed, and to not further increase the number of API fields.
An overwrite for the topology node label was discussed. This field was not added as there was no clear use-case yet. The general structure was kept so that it will still be possible to add such a field in a future release.
Action Plan
A rough plan of the steps required to implement the feature:
- Add the
PrometheusTopologySharding
feature gate. - Implement the API changes with pre-flight validations.
- Implement the node selector update when
mode: Topology
. - Implement the external label name when
mode: Topology
. - Implement the target sharding when
mode: Topology
.