the stable/prometheus-operator chart has been deprecated and further
development has been moved to prometheus-community/kube-prometheus-stack
Signed-off-by: Khaled Elkhawaga <k.elkhawaga@gmail.com>
Removing liveness probe to prevent killing prometheus pod during WAL
replay.
This should be reverted around kubernetes 1.21 release. At that point
startupProbe should be added.
When the Thanos spec doesn't configure object storage, there's no need to
configure the Thanos sidecar for block uploads and mount the
Prometheus data volume.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
This change adds 3 metrics tracking client-go requests to the Kubernetes
API:
* `prometheus_operator_kubernetes_client_http_requests_total`, counter
with a `status_code` label.
* `prometheus_operator_kubernetes_client_http_request_duration_seconds`,
summary with a `endpoint` label.
* `prometheus_operator_kubernetes_client_rate_limiter_duration_seconds`,
summary with a `endpoint` label.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Previously the operator would fail the reconciliation when a service
monitor was referencing a bad secret or configmap (either the object
didn't exist or the key was missing).
With this change, the operator will skip these service monitors.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
`controller-gen` does not work across package boundaries. Run it from
inside the API package directory to work around this.
Signed-off-by: Matthias Rampke <matthias@rampke.de>
This allows others to import them without incurring all the dependencies
of the operator transitively, and avoid version conflicts with other
dependencies as much as possible.
Fixed#3097.
Signed-off-by: Matthias Rampke <matthias@rampke.de>
This test generates the same configuration many times, for each
Prometheus version, to see if it is deterministic. As the compatibility
matrix grows, test times increase. Now, this sometimes fails in CI
because Travis kills jobs after 10 minutes of no output.
Run each version as a subtest, and run tests with `-v`, so that output
is produced after each version. This avoids the no-output timeout.
Parallelize testing for each Prometheus version.
When the tests are run with `-short` (as in `make test-unit`), only try
one hundred iterations. With the race detector on, as in that target, this takes
around 5 seconds. Without the race detector, short tests on this
package now run quick enough for fast iteration in an IDE.
Add an additional target and Travis job for running the long tests, but
without the race detector. This brings the run time for the full 1000
iterations per version to under a minute.
Signed-off-by: Matthias Rampke <matthias@rampke.de>
Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.
In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.
As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.
Related: https://github.com/prometheus/alertmanager/issues/2250