1
0
Fork 0
mirror of https://github.com/prometheus-operator/prometheus-operator.git synced 2025-04-21 03:38:43 +00:00

Merge pull request from coreos/roadmap

Move roadmap into its own file and extend it
This commit is contained in:
Frederic Branczyk 2017-02-10 16:40:19 -08:00 committed by GitHub
commit 48a318f4af
4 changed files with 74 additions and 100 deletions

View file

@ -30,46 +30,3 @@ instances in high availability mode.
| externalUrl | External URL Alertmanager will be reachable under. Used for registering routes. | false | string | |
| paused | If true, the operator won't process any changes affecting the Alertmanager setup | false | bool | false |
## Current state and roadmap
### Config file
The Operator expects a `ConfigMap` of the name `<alertmanager-name>` which
contains the configuration for the Alertmanager instances to run. It is left to
the user to populate it with the desired configuration. Note, that the
Alertmanager pods will stay in a `Pending` state as long as the `ConfigMap`
does not exist.
### Deployment
The Alertmanager, in high availablility mode, is a distributed system. A
desired deployment ensures no data loss and zero downtime while performing a
deployment. Zero downtime is simply done as the Alertmanager is running high
availability mode. No data loss is achieved by using PVCs and attaching the
same volumes a previous Alertmanager instance had to a new instance. The hard
part, however, is knowing whether a new instance is healthy or not.
A healthy node would be one that has joined the existing mesh network and has
been communicated the state that it missed while that particular instance was
down for the upgrade.
Currently there is no way to tell whether an Alertmanager instance is healthy
under the above conditions. There are discussions of using vector clocks to
resolve merges in the above mentioned situation, and ensure on a best effort
basis that joining the network was successful.
> Note that single instance Alertmanager setups will therefore not have zero
> downtime on deployments.
The current implementation of rolling deployments simply decides based on the
Pod state whether an instance is considered healthy. This mechanism may be part
of an implementation with the characteristics that are mentioned above.
### Cluster-wide version
Currently the operator installs a default version with optional explicit
definition of the used version in the TPR.
In the future, there should be a cluster wide version so that the controller
can orchestrate upgrades of all running Alertmanager setups.

View file

@ -68,60 +68,3 @@ still benefiting from the Operator's capabilities of managing Prometheus setups.
| name | Name of the Alertmanager endpoints. This equals the targeted Alertmanager service. | true | string |
| port | Name or number of the service port to push alerts to | false | integer or string |
| scheme | HTTP scheme to use when pushing alerts | false | http |
## Current state and roadmap
### Rule files
The Operator creates an empty ConfigMap of the name `<prometheus-name>-rules` if it
doesn't exist yet. It is left to the user to populate it with the desired rules.
It is still up for discussion whether it should be possible to include rule files living
in arbitrary ConfigMaps by their labels.
Intuitively, it seems fitting to define in each `ServiceMonitor` which rule files (based
label selections over ConfigMaps) should be deployed with it.
However, rules act upon all metrics in a Prometheus server. Hence, defining the
relationship in each `ServiceMonitor` may cause undesired interference.
### Alerting
The TPR allows to configure multiple namespace/name pairs of Alertmanagers
services. The Prometheus instances will send their alerts to each endpoint
of this service.
Currently Prometheus only allows to configure Alertmanager URLs via flags
on startup. Thus the Prometheus pods have to be restarted manually if the
endpoints change.
PetSets or manually maintained headless services in Kubernetes allow to
provide stable URLs working around this. In the future, Prometheus will allow
for dynamic service discovery of Alertmanagers ([tracking issue](https://github.com/prometheus/prometheus/issues/2057)).
### Cluster-wide version
Currently the controller installs a default version with optional explicit
definition of the used version in the TPR.
In the future, there should be a cluster wide version so that the controller
can orchestrate upgrades of all running Prometheus setups.
### Dashboards
In the future, the Prometheus Operator should register new Prometheus setups
it brought up as data sources in potential Grafana deployments.
### Resource limits
Prometheus instances are deployed with default values for requested and maximum
resource usage of CPU and memory. This will be made configurable in the `Prometheus`
TPR eventually.
Prometheus comes with a variety of configuration flags for its storage engine that
have to be tuned for better performance in large Prometheus servers. It will be the
operators job to tune those correctly to be aligned with the experiences load
and the resource limits configured by the user.
### Horizontal sharding
Prometheus has basic capabilities to run horizontally sharded setups. This is only
necessary in the largest of clusters. The Operator is an ideal candidate to manage the
sharding process and make it appear seamless to the user.

View file

@ -21,6 +21,8 @@ Once installed, the Prometheus Operator provides the following features:
For an introduction to the Prometheus Operator, see the initial [blog
post](https://coreos.com/blog/the-prometheus-operator.html).
The current project roadmap [can be found here](./ROADMAP.md).
## Prometheus Operator vs. kube-prometheus
The Prometheus Operator makes the Prometheus configuration Kubernetes native

72
ROADMAP.md Normal file
View file

@ -0,0 +1,72 @@
# Current state and roadmap
The following is a loose collection of potential features that fit the future
scope of the operator. Their exact implementation and viability are subject
to further discussion.
# Prometheus
### Rule files
The Operator creates an empty ConfigMap of the name `<prometheus-name>-rules` if it
doesn't exist yet. It is left to the user to populate it with the desired rules.
It is still up for discussion whether it should be possible to include rule files living
in arbitrary ConfigMaps by their labels.
Intuitively, it seems fitting to define in each `ServiceMonitor` which rule files (based
label selections over ConfigMaps) should be deployed with it.
However, rules act upon all metrics in a Prometheus server. Hence, defining the
relationship in each `ServiceMonitor` may cause undesired interference.
### Dashboards
In the future, the Prometheus Operator should register new Prometheus setups
it brought up as data sources in potential Grafana deployments.
### Horizontal sharding
Prometheus has basic capabilities to run horizontally sharded setups. This is only
necessary in the largest of clusters. The Operator is an ideal candidate to manage the
sharding process and make it appear seamless to the user.
### Federation
Prometheus supports federation patterns, which have to be setup manually. Direct support
in the operator is a desirable feature that could potentially tightly integrate with
Kubernetes cluster federation to minimize user-defined configuration.
## Alertmanager
### Configuration file
The Operator expects a `ConfigMap` of the name `<alertmanager-name>` which
contains the configuration for the Alertmanager instances to run. It is left to
the user to populate it with the desired configuration. Note, that the
Alertmanager pods will stay in a `Pending` state as long as the `ConfigMap`
does not exist.
### Deployment
The Alertmanager, in high availablility mode, is a distributed system. A
desired deployment ensures no data loss and zero downtime while performing a
deployment. Zero downtime is simply done as the Alertmanager is running high
availability mode. No data loss is achieved by using PVCs and attaching the
same volumes a previous Alertmanager instance had to a new instance. The hard
part, however, is knowing whether a new instance is healthy or not.
A healthy node would be one that has joined the existing mesh network and has
been communicated the state that it missed while that particular instance was
down for the upgrade.
Currently there is no way to tell whether an Alertmanager instance is healthy
under the above conditions. There are discussions of using vector clocks to
resolve merges in the above mentioned situation, and ensure on a best effort
basis that joining the network was successful.
> Note that single instance Alertmanager setups will therefore not have zero
> downtime on deployments.
The current implementation of rolling deployments simply decides based on the
Pod state whether an instance is considered healthy. This mechanism may be part
of an implementation with the characteristics that are mentioned above.