1
0
Fork 0
mirror of https://github.com/prometheus-operator/prometheus-operator.git synced 2025-04-16 01:06:27 +00:00
Commit graph

990 commits

Author SHA1 Message Date
paulfantom
35b2954459
pkg/prometheus: remove liveness probe
Removing liveness probe to prevent killing prometheus pod during WAL
replay.

This should be reverted around kubernetes 1.21 release. At that point
startupProbe should be added.
2020-09-15 12:05:18 +02:00
Simon Pasquier
675d303ee0
pkg/prometheus: enable Thanos uploads only when needed (#3485)
When the Thanos spec doesn't configure object storage, there's no need to
configure the Thanos sidecar for block uploads and mount the
Prometheus data volume.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-09-11 16:16:19 +02:00
Sergiusz Urbaniak
289ee029ef
Merge pull request #3440 from s-urbaniak/remove-mlw
remove multilistwatcher and denylistfilter
2020-09-08 07:34:39 +02:00
Sergiusz Urbaniak
c786d8ef2e pkg/informers: add mising godoc 2020-09-07 15:24:18 +02:00
Sergiusz Urbaniak
34ba8237f5 pkg/informers: fix stylistic nits
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
4f36b38e6c pkg/informers: add unit tests 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
badeafdc36 pkg/informers: add godoc 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
5e94344182 pkg/listwatch: remove multilistwatcher 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
2379f59f6f pkg/prometheus: check error immediately after List 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
27c1680975 pkg/*: renamings and reformatting 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
0c9283465a pkg/thanos: remove multilistwatcher 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
920f2490d9 pkg/alertmanager: remove multlistwatcher 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
e9ad330bf8 pkg/prometheus: remove multilistwatcher 2020-09-04 17:08:33 +02:00
Sergiusz Urbaniak
f22fd2c7c0 pkg/listwach: remove denylist ListerWatcher 2020-09-04 16:58:51 +02:00
Sergiusz Urbaniak
54bbe620bb pkg/informers: initial commit 2020-09-04 16:58:51 +02:00
Simon Pasquier
3b2e17d714 Instrument client-go requests
This change adds 3 metrics tracking client-go requests to the Kubernetes
API:

* `prometheus_operator_kubernetes_client_http_requests_total`, counter
  with a `status_code` label.
* `prometheus_operator_kubernetes_client_http_request_duration_seconds`,
  summary with a `endpoint` label.
* `prometheus_operator_kubernetes_client_rate_limiter_duration_seconds`,
  summary with a `endpoint` label.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-09-04 16:03:13 +02:00
Simon Pasquier
053da63f0b *: pass context.Context to client-go functions
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-09-03 14:13:31 +02:00
Sergiusz Urbaniak
d1e9fc77e2
Merge pull request #3395 from matthiasr/mr/pkg-monitoring
Break the API types out into their own module
2020-09-02 09:35:50 +02:00
Sergiusz Urbaniak
909fc64585
Merge pull request #3445 from simonpasquier/fix-3327
pkg/prometheus: skip invalid service monitors
2020-08-31 16:56:45 +02:00
Sergiusz Urbaniak
608be1baec
Merge pull request #3436 from hwoarang/add-cluster-reconnect-timeout
pkg/alertmanager: Use lower value for --cluster.reconnect-timeout
2020-08-31 15:39:11 +02:00
Simon Pasquier
7ed47043ce Add tests for assetStore
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-31 14:51:30 +02:00
Simon Pasquier
a0a1816f4c Use cache.Store instead of custom stores 2020-08-31 10:51:09 +02:00
Simon Pasquier
caf6b9f3ce pkg/prometheus: skip invalid service monitors
Previously the operator would fail the reconciliation when a service
monitor was referencing a bad secret or configmap (either the object
didn't exist or the key was missing).

With this change, the operator will skip these service monitors.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-31 10:51:09 +02:00
Matthias Rampke
2a67feba74
Break the API types out into their own module
This allows others to import them without incurring all the dependencies
of the operator transitively, and avoid version conflicts with other
dependencies as much as possible.

Fixed #3097.

Signed-off-by: Matthias Rampke <matthias@rampke.de>
2020-08-28 13:41:46 +00:00
Matthias Rampke
76d5211a6c
Avoid CI timeouts in TestConfigGeneration (#3432)
This test generates the same configuration many times, for each
Prometheus version, to see if it is deterministic. As the compatibility
matrix grows, test times increase. Now, this sometimes fails in CI
because Travis kills jobs after 10 minutes of no output.

Run each version as a subtest, and run tests with `-v`, so that output
is produced after each version. This avoids the no-output timeout.

Parallelize testing for each Prometheus version.

When the tests are run with `-short` (as in `make test-unit`), only try
one hundred iterations. With the race detector on, as in that target, this takes
around 5 seconds. Without the race detector, short tests on this
package now run quick enough for fast iteration in an IDE.

Add an additional target and Travis job for running the long tests, but
without the race detector. This brings the run time for the full 1000
iterations per version to under a minute.

Signed-off-by: Matthias Rampke <matthias@rampke.de>
2020-08-28 14:53:32 +02:00
Markos Chandras
86102e73e9
pkg/alertmanager: Use lower value for --cluster.reconnect-timeout
Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.

In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.

As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.

Related: https://github.com/prometheus/alertmanager/issues/2250
2020-08-26 13:02:35 +03:00
Simon Pasquier
0811e8f65c pkg/alertmanager: cleanup resources via OwnerReferences
The Alertmanager controller deleted dependent resources manually while
prometheus and thanos rely on Kubernetes to do the work using
OwnerReferences.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-20 16:21:07 +02:00
Simon Pasquier
e64718cb6b
pkg: add prometheus_operator_reconcile_operations_total metric (#3415)
* pkg: add prometheus_operator_reconcile_operations_total metric

We already have the `prometheus_operator_reconcile_errors_total` metric
to track the number of reconciliation attempts that failed but we miss
the number of attempts which makes it harder to alert on it. With this
change, we can compute the ratio of reconciliations that failed.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Update alert definition with new metric
2020-08-19 16:41:02 +02:00
Matthias Rampke
5c1f668c97
Fix validation logic for SecretOrConfigMap
This was flagged by
[golangci-lint](https://staticcheck.io/docs/checks#SA4022). The check
was for the address of the pointer, not the value. Add a test (failing
on master) to verify this, and fix the validation logic.

Follow-up to #2716.

Signed-off-by: Matthias Rampke <matthias@rampke.de>
2020-08-17 11:50:59 +00:00
Sergiusz Urbaniak
54704fac8f
Merge pull request #3392 from lilic/fix-image-tag-version
pkg/operator/image.go: Adjust image path building
2020-08-11 10:43:24 +02:00
Lili Cosic
7b4a9d740d pkg/prometheus/statefulset_test.go: Adjust tests 2020-08-10 14:49:55 +02:00
Lili Cosic
49e2842c49 pkg/alertmanager,thanos,prometheus: Adjust usage 2020-08-10 14:49:55 +02:00
Lili Cosic
caed11f835 pkg/operator/image.go: Adjust image path building
Image can contain tag already, this checks if it does it just returns
the image. Otherwise sha/digest takes priority over tag and lastly
version is taken due to historic reasons.
2020-08-10 14:49:55 +02:00
郑佳金
d90df0a0e7 make generate 2020-08-06 16:59:29 +08:00
郑佳金
9c066705a4 feat: support special post alerts timeout 2020-08-06 16:59:15 +08:00
paulfantom
67780ccc45
repository migration to prometheus-operator organization 2020-08-05 13:13:46 +02:00
Sören Jentzsch
7778fe0239
Allow for enabling Alertmanager HA cluster mode even when running with single replica, via newly introduced forceEnableClusterMode flag.
With #3196 we lost the possibility to setup Alertmanager clusters with a single replica across multiple Kubernetes clusters.

Fixes #3337
2020-08-04 01:36:10 +02:00
Frederic Branczyk
ef0bc1c45a
Merge pull request #3377 from coderanger/patch-1
🐛 Don't overwrite __param_target
2020-08-03 11:11:06 +02:00
Noah Kantrowitz
41c2202698 🐛 Don't overwrite __param_target
It is already set above using the sd metadata, no need to overwrite it back to __address__.
2020-08-01 23:15:58 -07:00
Frederic Branczyk
6c8f7fa6b6
Merge pull request #3374 from vincent-pli/clearify-targetport-servicemonitor
Clarify targetPort in endpoint
2020-07-31 11:26:17 +02:00
pengli
2ebe8247d8 Clarify targetPort in endpoint 2020-07-30 17:39:15 -07:00
Lili Cosic
8f49757672 pkg/listwatch: Change to accept single instance of rvs 2020-07-29 11:46:06 +02:00
Michal Fojtik
7bbd81692a listwatch: do not duplicate resource versions 2020-07-29 11:45:59 +02:00
Frederic Branczyk
f6b342d3f7
Merge pull request #3364 from coreos/revert-3308-normalize-default-durations
Revert "Normalize default durations"
2020-07-27 11:38:28 +02:00
Frederic Branczyk
f1e0131c1b
Merge pull request #3358 from jbfavre/fix_prometheus_version_propagation
Propagate Prometheus image version to statefulset
2020-07-27 11:20:16 +02:00
Frederic Branczyk
024da7b667
Fix expected default probe scrape interval 2020-07-27 10:29:46 +02:00
Frederic Branczyk
1d00eeb962
Revert "Normalize default durations" 2020-07-27 07:42:21 +02:00
Jean-Baptiste Favre
c710ec3e39 Fix Go gormat 2020-07-24 14:14:13 +02:00
Jean-Baptiste Favre
dc2a4527c2 Improve unit tests for Version, Tag & SHA matrix 2020-07-24 14:07:58 +02:00
Simon Pasquier
2021270248 pkg: instrument resources being tracked by the operator
This change adds a new `prometheus_operator_resources` metric that keeps
track of the number of resources currently managed by the operator. The
metric is broken down by controller and type of resource.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-07-24 13:39:01 +02:00