Merge pull request #139 from coreos/roadmap

Move roadmap into its own file and extend it
2025-04-21 03:38:43 +00:00 · 2017-02-10 16:40:19 -08:00 · 2017-02-10 16:40:19 -08:00 · 48a318f4af
commit 48a318f4af
parent 78d837f120 52f02ccb2e
4 changed files with 74 additions and 100 deletions
--- a/Documentation/alertmanager.md
+++ b/Documentation/alertmanager.md
@ -30,46 +30,3 @@ instances in high availability mode.
 | externalUrl | External URL Alertmanager will be reachable under. Used for registering routes. | false | string |  |
 | paused | If true, the operator won't process any changes affecting the Alertmanager setup | false | bool | false |

-## Current state and roadmap
-
-### Config file
-
-The Operator expects a `ConfigMap` of the name `<alertmanager-name>` which
-contains the configuration for the Alertmanager instances to run. It is left to
-the user to populate it with the desired configuration. Note, that the
-Alertmanager pods will stay in a `Pending` state as long as the `ConfigMap`
-does not exist.
-
-### Deployment
-
-The Alertmanager, in high availablility mode, is a distributed system. A
-desired deployment ensures no data loss and zero downtime while performing a
-deployment. Zero downtime is simply done as the Alertmanager is running high
-availability mode. No data loss is achieved by using PVCs and attaching the
-same volumes a previous Alertmanager instance had to a new instance. The hard
-part, however, is knowing whether a new instance is healthy or not.
-
-A healthy node would be one that has joined the existing mesh network and has
-been communicated the state that it missed while that particular instance was
-down for the upgrade.
-
-Currently there is no way to tell whether an Alertmanager instance is healthy
-under the above conditions. There are discussions of using vector clocks to
-resolve merges in the above mentioned situation, and ensure on a best effort
-basis that joining the network was successful.
-
-> Note that single instance Alertmanager setups will therefore not have zero
-> downtime on deployments.
-
-The current implementation of rolling deployments simply decides based on the
-Pod state whether an instance is considered healthy. This mechanism may be part
-of an implementation with the characteristics that are mentioned above.
-
-### Cluster-wide version
-
-Currently the operator installs a default version with optional explicit
-definition of the used version in the TPR.
-
-In the future, there should be a cluster wide version so that the controller
-can orchestrate upgrades of all running Alertmanager setups.
-
--- a/Documentation/prometheus.md
+++ b/Documentation/prometheus.md
@ -68,60 +68,3 @@ still benefiting from the Operator's capabilities of managing Prometheus setups.
 | name | Name of the Alertmanager endpoints. This equals the targeted Alertmanager service. | true | string | 
 | port | Name or number of the service port to push alerts to | false | integer or string |
 | scheme | HTTP scheme to use when pushing alerts | false | http |
-
-
-## Current state and roadmap
-
-### Rule files
-
-The Operator creates an empty ConfigMap of the name `<prometheus-name>-rules` if it
-doesn't exist yet. It is left to the user to populate it with the desired rules.
-
-It is still up for discussion whether it should be possible to include rule files living
-in arbitrary ConfigMaps by their labels.
-Intuitively, it seems fitting to define in each `ServiceMonitor` which rule files (based 
-label selections over ConfigMaps) should be deployed with it.
-However, rules act upon all metrics in a Prometheus server. Hence, defining the
-relationship in each `ServiceMonitor` may cause undesired interference.
- 
-### Alerting
-
-The TPR allows to configure multiple namespace/name pairs of Alertmanagers
-services. The Prometheus instances will send their alerts to each endpoint
-of this service.
-
-Currently Prometheus only allows to configure Alertmanager URLs via flags
-on startup. Thus the Prometheus pods have to be restarted manually if the 
-endpoints change.
-PetSets or manually maintained headless services in Kubernetes allow to
-provide stable URLs working around this. In the future, Prometheus will allow
-for dynamic service discovery of Alertmanagers ([tracking issue](https://github.com/prometheus/prometheus/issues/2057)). 
-
-### Cluster-wide version
-
-Currently the controller installs a default version with optional explicit
-definition of the used version in the TPR.
-In the future, there should be a cluster wide version so that the controller
-can orchestrate upgrades of all running Prometheus setups.
-
-### Dashboards
-
-In the future, the Prometheus Operator should register new Prometheus setups
-it brought up as data sources in potential Grafana deployments. 
-
-### Resource limits
-
-Prometheus instances are deployed with default values for requested and maximum
-resource usage of CPU and memory. This will be made configurable in the `Prometheus` 
-TPR eventually.
-
-Prometheus comes with a variety of configuration flags for its storage engine that
-have to be tuned for better performance in large Prometheus servers. It will be the
-operators job to tune those correctly to be aligned with the experiences load
-and the resource limits configured by the user.
-
-### Horizontal sharding
-
-Prometheus has basic capabilities to run horizontally sharded setups. This is only
-necessary in the largest of clusters. The Operator is an ideal candidate to manage the
-sharding process and make it appear seamless to the user.
--- a/README.md
+++ b/README.md
@ -21,6 +21,8 @@ Once installed, the Prometheus Operator provides the following features:
 For an introduction to the Prometheus Operator, see the initial [blog
 post](https://coreos.com/blog/the-prometheus-operator.html).

+The current project roadmap [can be found here](./ROADMAP.md).
+
 ## Prometheus Operator vs. kube-prometheus

 The Prometheus Operator makes the Prometheus configuration Kubernetes native
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -0,0 +1,72 @@
+# Current state and roadmap
+
+The following is a loose collection of potential features that fit the future
+scope of the operator. Their exact implementation and viability are subject
+to further discussion.
+
+# Prometheus
+
+### Rule files
+
+The Operator creates an empty ConfigMap of the name `<prometheus-name>-rules` if it
+doesn't exist yet. It is left to the user to populate it with the desired rules.
+
+It is still up for discussion whether it should be possible to include rule files living
+in arbitrary ConfigMaps by their labels.
+Intuitively, it seems fitting to define in each `ServiceMonitor` which rule files (based 
+label selections over ConfigMaps) should be deployed with it.
+However, rules act upon all metrics in a Prometheus server. Hence, defining the
+relationship in each `ServiceMonitor` may cause undesired interference.
+ 
+### Dashboards
+
+In the future, the Prometheus Operator should register new Prometheus setups
+it brought up as data sources in potential Grafana deployments. 
+
+### Horizontal sharding
+
+Prometheus has basic capabilities to run horizontally sharded setups. This is only
+necessary in the largest of clusters. The Operator is an ideal candidate to manage the
+sharding process and make it appear seamless to the user.
+
+### Federation
+
+Prometheus supports federation patterns, which have to be setup manually. Direct support
+in the operator is a desirable feature that could potentially tightly integrate with
+Kubernetes cluster federation to minimize user-defined configuration.
+
+## Alertmanager
+
+### Configuration file
+
+The Operator expects a `ConfigMap` of the name `<alertmanager-name>` which
+contains the configuration for the Alertmanager instances to run. It is left to
+the user to populate it with the desired configuration. Note, that the
+Alertmanager pods will stay in a `Pending` state as long as the `ConfigMap`
+does not exist.
+
+### Deployment
+
+The Alertmanager, in high availablility mode, is a distributed system. A
+desired deployment ensures no data loss and zero downtime while performing a
+deployment. Zero downtime is simply done as the Alertmanager is running high
+availability mode. No data loss is achieved by using PVCs and attaching the
+same volumes a previous Alertmanager instance had to a new instance. The hard
+part, however, is knowing whether a new instance is healthy or not.
+
+A healthy node would be one that has joined the existing mesh network and has
+been communicated the state that it missed while that particular instance was
+down for the upgrade.
+
+Currently there is no way to tell whether an Alertmanager instance is healthy
+under the above conditions. There are discussions of using vector clocks to
+resolve merges in the above mentioned situation, and ensure on a best effort
+basis that joining the network was successful.
+
+> Note that single instance Alertmanager setups will therefore not have zero
+> downtime on deployments.
+
+The current implementation of rolling deployments simply decides based on the
+Pod state whether an instance is considered healthy. This mechanism may be part
+of an implementation with the characteristics that are mentioned above.
+