mirror of
https://github.com/prometheus-operator/prometheus-operator.git
synced 2025-04-21 03:38:43 +00:00
Merge pull request #139 from coreos/roadmap
Move roadmap into its own file and extend it
This commit is contained in:
commit
48a318f4af
4 changed files with 74 additions and 100 deletions
|
@ -30,46 +30,3 @@ instances in high availability mode.
|
|||
| externalUrl | External URL Alertmanager will be reachable under. Used for registering routes. | false | string | |
|
||||
| paused | If true, the operator won't process any changes affecting the Alertmanager setup | false | bool | false |
|
||||
|
||||
## Current state and roadmap
|
||||
|
||||
### Config file
|
||||
|
||||
The Operator expects a `ConfigMap` of the name `<alertmanager-name>` which
|
||||
contains the configuration for the Alertmanager instances to run. It is left to
|
||||
the user to populate it with the desired configuration. Note, that the
|
||||
Alertmanager pods will stay in a `Pending` state as long as the `ConfigMap`
|
||||
does not exist.
|
||||
|
||||
### Deployment
|
||||
|
||||
The Alertmanager, in high availablility mode, is a distributed system. A
|
||||
desired deployment ensures no data loss and zero downtime while performing a
|
||||
deployment. Zero downtime is simply done as the Alertmanager is running high
|
||||
availability mode. No data loss is achieved by using PVCs and attaching the
|
||||
same volumes a previous Alertmanager instance had to a new instance. The hard
|
||||
part, however, is knowing whether a new instance is healthy or not.
|
||||
|
||||
A healthy node would be one that has joined the existing mesh network and has
|
||||
been communicated the state that it missed while that particular instance was
|
||||
down for the upgrade.
|
||||
|
||||
Currently there is no way to tell whether an Alertmanager instance is healthy
|
||||
under the above conditions. There are discussions of using vector clocks to
|
||||
resolve merges in the above mentioned situation, and ensure on a best effort
|
||||
basis that joining the network was successful.
|
||||
|
||||
> Note that single instance Alertmanager setups will therefore not have zero
|
||||
> downtime on deployments.
|
||||
|
||||
The current implementation of rolling deployments simply decides based on the
|
||||
Pod state whether an instance is considered healthy. This mechanism may be part
|
||||
of an implementation with the characteristics that are mentioned above.
|
||||
|
||||
### Cluster-wide version
|
||||
|
||||
Currently the operator installs a default version with optional explicit
|
||||
definition of the used version in the TPR.
|
||||
|
||||
In the future, there should be a cluster wide version so that the controller
|
||||
can orchestrate upgrades of all running Alertmanager setups.
|
||||
|
||||
|
|
|
@ -68,60 +68,3 @@ still benefiting from the Operator's capabilities of managing Prometheus setups.
|
|||
| name | Name of the Alertmanager endpoints. This equals the targeted Alertmanager service. | true | string |
|
||||
| port | Name or number of the service port to push alerts to | false | integer or string |
|
||||
| scheme | HTTP scheme to use when pushing alerts | false | http |
|
||||
|
||||
|
||||
## Current state and roadmap
|
||||
|
||||
### Rule files
|
||||
|
||||
The Operator creates an empty ConfigMap of the name `<prometheus-name>-rules` if it
|
||||
doesn't exist yet. It is left to the user to populate it with the desired rules.
|
||||
|
||||
It is still up for discussion whether it should be possible to include rule files living
|
||||
in arbitrary ConfigMaps by their labels.
|
||||
Intuitively, it seems fitting to define in each `ServiceMonitor` which rule files (based
|
||||
label selections over ConfigMaps) should be deployed with it.
|
||||
However, rules act upon all metrics in a Prometheus server. Hence, defining the
|
||||
relationship in each `ServiceMonitor` may cause undesired interference.
|
||||
|
||||
### Alerting
|
||||
|
||||
The TPR allows to configure multiple namespace/name pairs of Alertmanagers
|
||||
services. The Prometheus instances will send their alerts to each endpoint
|
||||
of this service.
|
||||
|
||||
Currently Prometheus only allows to configure Alertmanager URLs via flags
|
||||
on startup. Thus the Prometheus pods have to be restarted manually if the
|
||||
endpoints change.
|
||||
PetSets or manually maintained headless services in Kubernetes allow to
|
||||
provide stable URLs working around this. In the future, Prometheus will allow
|
||||
for dynamic service discovery of Alertmanagers ([tracking issue](https://github.com/prometheus/prometheus/issues/2057)).
|
||||
|
||||
### Cluster-wide version
|
||||
|
||||
Currently the controller installs a default version with optional explicit
|
||||
definition of the used version in the TPR.
|
||||
In the future, there should be a cluster wide version so that the controller
|
||||
can orchestrate upgrades of all running Prometheus setups.
|
||||
|
||||
### Dashboards
|
||||
|
||||
In the future, the Prometheus Operator should register new Prometheus setups
|
||||
it brought up as data sources in potential Grafana deployments.
|
||||
|
||||
### Resource limits
|
||||
|
||||
Prometheus instances are deployed with default values for requested and maximum
|
||||
resource usage of CPU and memory. This will be made configurable in the `Prometheus`
|
||||
TPR eventually.
|
||||
|
||||
Prometheus comes with a variety of configuration flags for its storage engine that
|
||||
have to be tuned for better performance in large Prometheus servers. It will be the
|
||||
operators job to tune those correctly to be aligned with the experiences load
|
||||
and the resource limits configured by the user.
|
||||
|
||||
### Horizontal sharding
|
||||
|
||||
Prometheus has basic capabilities to run horizontally sharded setups. This is only
|
||||
necessary in the largest of clusters. The Operator is an ideal candidate to manage the
|
||||
sharding process and make it appear seamless to the user.
|
||||
|
|
|
@ -21,6 +21,8 @@ Once installed, the Prometheus Operator provides the following features:
|
|||
For an introduction to the Prometheus Operator, see the initial [blog
|
||||
post](https://coreos.com/blog/the-prometheus-operator.html).
|
||||
|
||||
The current project roadmap [can be found here](./ROADMAP.md).
|
||||
|
||||
## Prometheus Operator vs. kube-prometheus
|
||||
|
||||
The Prometheus Operator makes the Prometheus configuration Kubernetes native
|
||||
|
|
72
ROADMAP.md
Normal file
72
ROADMAP.md
Normal file
|
@ -0,0 +1,72 @@
|
|||
# Current state and roadmap
|
||||
|
||||
The following is a loose collection of potential features that fit the future
|
||||
scope of the operator. Their exact implementation and viability are subject
|
||||
to further discussion.
|
||||
|
||||
# Prometheus
|
||||
|
||||
### Rule files
|
||||
|
||||
The Operator creates an empty ConfigMap of the name `<prometheus-name>-rules` if it
|
||||
doesn't exist yet. It is left to the user to populate it with the desired rules.
|
||||
|
||||
It is still up for discussion whether it should be possible to include rule files living
|
||||
in arbitrary ConfigMaps by their labels.
|
||||
Intuitively, it seems fitting to define in each `ServiceMonitor` which rule files (based
|
||||
label selections over ConfigMaps) should be deployed with it.
|
||||
However, rules act upon all metrics in a Prometheus server. Hence, defining the
|
||||
relationship in each `ServiceMonitor` may cause undesired interference.
|
||||
|
||||
### Dashboards
|
||||
|
||||
In the future, the Prometheus Operator should register new Prometheus setups
|
||||
it brought up as data sources in potential Grafana deployments.
|
||||
|
||||
### Horizontal sharding
|
||||
|
||||
Prometheus has basic capabilities to run horizontally sharded setups. This is only
|
||||
necessary in the largest of clusters. The Operator is an ideal candidate to manage the
|
||||
sharding process and make it appear seamless to the user.
|
||||
|
||||
### Federation
|
||||
|
||||
Prometheus supports federation patterns, which have to be setup manually. Direct support
|
||||
in the operator is a desirable feature that could potentially tightly integrate with
|
||||
Kubernetes cluster federation to minimize user-defined configuration.
|
||||
|
||||
## Alertmanager
|
||||
|
||||
### Configuration file
|
||||
|
||||
The Operator expects a `ConfigMap` of the name `<alertmanager-name>` which
|
||||
contains the configuration for the Alertmanager instances to run. It is left to
|
||||
the user to populate it with the desired configuration. Note, that the
|
||||
Alertmanager pods will stay in a `Pending` state as long as the `ConfigMap`
|
||||
does not exist.
|
||||
|
||||
### Deployment
|
||||
|
||||
The Alertmanager, in high availablility mode, is a distributed system. A
|
||||
desired deployment ensures no data loss and zero downtime while performing a
|
||||
deployment. Zero downtime is simply done as the Alertmanager is running high
|
||||
availability mode. No data loss is achieved by using PVCs and attaching the
|
||||
same volumes a previous Alertmanager instance had to a new instance. The hard
|
||||
part, however, is knowing whether a new instance is healthy or not.
|
||||
|
||||
A healthy node would be one that has joined the existing mesh network and has
|
||||
been communicated the state that it missed while that particular instance was
|
||||
down for the upgrade.
|
||||
|
||||
Currently there is no way to tell whether an Alertmanager instance is healthy
|
||||
under the above conditions. There are discussions of using vector clocks to
|
||||
resolve merges in the above mentioned situation, and ensure on a best effort
|
||||
basis that joining the network was successful.
|
||||
|
||||
> Note that single instance Alertmanager setups will therefore not have zero
|
||||
> downtime on deployments.
|
||||
|
||||
The current implementation of rolling deployments simply decides based on the
|
||||
Pod state whether an instance is considered healthy. This mechanism may be part
|
||||
of an implementation with the characteristics that are mentioned above.
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue