1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2025-03-10 02:37:11 +00:00
Commit graph

199 commits

Author SHA1 Message Date
Markus Lehtonen
7a050e7cf9 nfd-master: ditch apihelper
Implement some of frequently used helper functions inpackage.

This patch also contains big changes to the nfd-master unit tests. Much
of this is about migrating from the mocked apihelper interface to fake
kubernetes client that provides a bit more apiserver'ish functionality.
At the same time there is quite a bit of renaming in the tests,
shortening and unifying naming and getting rid of the extensive usage of
"mock" everywhere.
2024-01-26 16:09:22 +02:00
Markus Lehtonen
53003cbf69 pkg/utils: move JsonPatch from pkg/apihelper 2024-01-25 17:23:14 +02:00
Markus Lehtonen
acf815fb10 pkg/utils: move GetKubeconfig from pkg/apihelper here
This change is part of an effort to remove the pkg/apihelper package.
GetKubeconfig is useful helper functionality shared accross the codebase
so move it into a "safe" location.
2024-01-24 16:10:02 +02:00
Markus Lehtonen
57b7a3c6a8 Wrap nested errors 2024-01-22 22:45:15 +02:00
Markus Lehtonen
58ae81804c go.mod: update dependencies 2024-01-15 21:29:32 +02:00
Markus Lehtonen
a053efda64 nfd-master: run a separate gRPC health server
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.

The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).

The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.
2024-01-04 13:58:26 +02:00
Markus Lehtonen
97bf841140 apis/nfd: split rule processing into a separate package
This patch tidies up the nfdv1alpha1 API package by refactoring out the
implementation of (NodeFeature)Rule evaluation into a separate package.
2023-12-20 12:52:15 +02:00
Markus Lehtonen
cb0a46ec0e Use generics for maps and slices 2023-12-13 12:09:53 +02:00
Markus Lehtonen
a77983556f nfd-master: remove default denied ns from config
These are now handled by the validate package. If we have them here in
nfd-master, the default namespace (feature.node.kubernetes.io) gets
denied.
2023-12-12 16:12:53 +02:00
Carlos Eduardo Arango Gutierrez
affb93ea50
Create a Validate pkg
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-12-11 16:54:22 +01:00
Markus Lehtonen
1d012a28cd Option to stop implicitly adding default prefix to names
Add new autoDefaultNs (default is "true") config option to nfd-master.
Setting the config option to false stops NFD from automatically adding
the "feature.node.kubernetes.io/" prefix to labels, annotations and
extended resources. Taints are not affected as for them no prefix is
automatically added. The user-visible part of enabling the option change
is that NodeFeatureRules, local feature files, hooks and configuration
of the "custom" may need to be altereda (if the auto-prefixing is
relied on).

For now, the config option defaults to "true", meaning no change in
default behavior. However, the intent is to change the default to
"false" in a future release, deprecating the option and eventually
removing it (forcing it to "false").

The goal of stopping doing "auto-prefixing" is to simplify the operation
(of nfd and users). Make the naming more straightforward and easier to
understand and debug (kind of WYSIWYG), eliminating peculiar corner
cases:

1. Make validation simpler and unambiguous
2. Remove "overloading" of names, i.e. the mapping two values to the
   same actual name. E.g. previously something like

      labels:
        feature.node.kubernetes.io/foo: bar
        foo: baz

   Could actually result in node label:

     feature.node.kubernetes.io/foo: baz

3. Make the processing/usagee of the "rule.matched" and "local.labels"
   feature in NodeFeatureRules unambiguous and more understadable. E.g.
   previously you could have node label
   "feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule
   you'd need to use the unprefixed name "local-foo" or the fully
   prefixed name, depending on what was specified in the feature file (or
   hook) on the node(s).

NOTE: setting autoDefaultNs to false is a breaking change for users who
rely on automatic prefixing with the default feature.node.kubernetes.io/
namespace. NodeFeatureRules, feature files, hooks and custom rules
(configuration of the "custom" source of nfd-worker) will need to be
altered.  Unprefixed labels, annoations and extended resources will be
denied by nfd-master.
2023-11-24 12:48:20 +02:00
Markus Lehtonen
dc5af8be04 nfd-master: predictable handling of unprefixed names
Make the handling of unprefixed names (of labels, annotations and
extended resources) well-defined and predictable. Previously the
resulting output was not predictable in case the same name was coming in
both the unprefixed and prefixed form, say unprefixed "foo=bar" coming from
one source (be it nfd-worker or NodeFeature(Rule)) and
"feature.node.kubernetes.io/foo=baz" from a NodeFeature(Rule).
Previously the output value was randomly either "bar" or "baz".

This patch adds prefixes to all names early in the processing
"pipeline", preventing random name clashes later on.
2023-11-23 22:16:04 +02:00
Markus Lehtonen
678d7e89cb nfd-master: drop stale variables
Remove some stale variables that were leftover from the recent removal
of nfd version annotations.
2023-11-23 19:01:22 +02:00
Carlos Eduardo Arango Gutierrez
c0063be4f4
Discover node features as annotations
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: bebc <mchf1990212@gmail.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-10-25 19:58:58 +02:00
Markus Lehtonen
a9849f20ff nfd-master: fix retry of node updates
This patch addresses issues with slow node status (extended resources)
updates. Previously we did just a few retries in quick succession which
could result in the node update failing, just because node status was
updated slower than our retry window. The patch mitigates the issue by
increasing the number of tries to 15. In addition, it creates a
ratelimiter with a longer per-item (per-node) base delay.

The patch also fixes the e2e-tests to expose the issue.
2023-10-20 17:24:01 +03:00
Markus Lehtonen
5171ae0f90 Refactor metrics
Move common boilerplate code under pkg/utils.
2023-10-09 10:49:12 +03:00
Markus Lehtonen
1d8a83b045 nfd-master: stop creating NFD version annotations
We now have metrics for getting detailed information about the NFD
instances running. There should be no need to pollute the node object
with NFD version annotations.

One problem with the annotations also that they were incomplete in the
sense that they only covered nfd-master and nfd-worker but not
nfd-topology-updater or nfd-gc.

Also, there was a problem with stale annotations, giving misleading
information. E.g. there was no way to remove old/stale master.version
annotations if nfd-master was scheduled on another node where it was
previously running.
2023-10-05 14:53:29 +03:00
Markus Lehtonen
9ea0a1b420 nfd-master: correctly clean up annotations
Delete correct annotations if -instance is specified.
2023-10-05 11:10:06 +03:00
Markus Lehtonen
b09ce75c8e nfd-master: fix filtering of extended resources
Fix a bug in checking the allowed ".feature.node.kubernetes.io" ns
suffix for extended resources. Also update e2e-tests to cover this case.
2023-09-27 10:55:11 +03:00
AhmedGrati
7ab6314bdc chore: introduce a commong klog handling for cmd/nfd-*
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-09-07 22:38:15 +01:00
AhmedGrati
b0be40aa09 feat: add logging parameters in configuration file for nfd master
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-09-06 15:27:27 +01:00
Kubernetes Prow Robot
f852c32a55
Merge pull request #1252 from AhmedGrati/test-add-updater-pool-unit-tests
test: add node updater pool unit tests
2023-09-01 07:34:32 -07:00
Markus Lehtonen
5ad2294c14 metrics: add nfd_node_update_requests_total counter
Add a counter for total number of node update/sync requests. In
practice, this counts the number of gRPC requests received if the gRPC
API is in use. If the NodeFeature API is enabled, this counts the
requests initiated by the NFD API controller, i.e. updates triggered by
changes in NodeFeature or NodeFeatureRule objects plus updates initiated
by the controller resync period.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
4b24cc1afa metrics: counters for rejected labels, extended resources and taints
Add counters for labels, extended resources and taints rejected/filtered
out by nfd-master.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
a8a29e6df2 metrics: add nfd_nodefeaturerule_processing_errors_total counter
Add a counter for errors encountered when processing NodeFeatureRules.
Another simple counter without any additional prometheus labels -
nfd-master logs can provide further details.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
b90f2c318e metrics: add nfd_node_update_failures_total counter
Add a new counter for tracking node update failures from nfd-master.
This tracks both normal feature updates and the --prune sub-command.
This is a simple counter without any additional labels - nfd-master logs
can be used for further diagnostics.
2023-08-07 09:37:27 +03:00
Markus Lehtonen
039378c725 nfd-master: use term node update instead of labeling
Rename symbols and reword log messages to correlate with the
functionality (we may do other updates than just modify labels
nowadays).
2023-08-01 16:42:34 +03:00
Markus Lehtonen
d8f167d8a9 nfd-master: remove one stale empty line 2023-08-01 16:38:32 +03:00
Markus Lehtonen
47f621d970 metrics: improve the node updates gauge
Rename the metric, better describe what we're measuring and better
comply with prometheus naming conventions. Also change it to represent
actual updates of the node object on the Kubernetes apiserver.
2023-07-31 19:45:22 +03:00
Markus Lehtonen
945e7fcb3f metrics: improve nfr processing time metric
Change the metric from a simple gauge (that basically was a single value
for the whole cluster) into a HistogramVec, aligning with the feature
discovery duration metric in nfd-worker. This improved metric now has
prometheus labels for the NFR name and node name, i.e. it is tracking
per-NFR metric for each node being processed. Also, change the naming to
better comply with prometheus suggested conventions.
2023-07-31 19:45:22 +03:00
Kubernetes Prow Robot
77d869c4f7
Merge pull request #1242 from ArangoGutierrez/metrics
Enable metrics via prometheus operator
2023-07-21 02:26:08 -07:00
Carlos Eduardo Arango Gutierrez
e3aedd33e2
Enable metrics via prometheus operator
Expose metrics via prometheus.monitoring.coreos.com/v1

The exposed metrics are

| Metric        | Type | Meaning |
| --------------- | ---------------- | ---------------- |
|  `nfd_master_build_info`           | Gauge | Version from which nfd-master was built. |
|  `nfd_worker_build_info`           | Gauge | Version from which nfd-worker was built. |
|  `nfd_updated_nodes`           |  Counter | Time taken to label a node |
|  `nfd_crd_processing_time`          |  Gauge | Time taken to process a NodeFeatureRule CRD |
| `nfd_feature_discovery_duration_seconds` |  HistogramVec | Time taken to discover features on a node |

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-07-21 10:59:52 +02:00
AhmedGrati
8e55d78d85 test: add node updater pool unit tests
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-07-19 12:03:35 +01:00
Markus Lehtonen
dac45be28c nfd-master: check for nil references in nfdAPIUpdateAllNodes
Just a safeguard.
2023-07-17 17:49:44 +03:00
guoguangwu
b946bcc0f5 nfd-master-internal_test.go rm pkg imported twice
Signed-off-by: guoguangwu <guoguangwu@magic-shield.com>
2023-06-21 16:53:55 +08:00
Kubernetes Prow Robot
306969a945
Merge pull request #1133 from AhmedGrati/feat-parallelize-nodes-update
feat: parallelize nodes update
2023-06-02 05:28:57 -07:00
AhmedGrati
b3cfe17392 feat: parallelize nodes update
This PR aims to optimize the process of updating nodes with
corresponding features. In fact, previously, we were updating nodes
sequentially even though they are independent from each other.
Therefore, we integrated new components: LabelersNodePool which is
responsible for spininng a goroutine whenever there's a request for
updating nodes, and a Workqueue which is responsible for holding nodes names
that should be updated.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-06-02 11:41:50 +01:00
AhmedGrati
08b9c3486e feat: support dynamic values for labels in the NodeFeatureRule
This PR aims to support the dynamic values for labels in the
NodeFeatureRule CRD, it would offer more flexible labeling for users.
To achieve this, we check whether label value starts with "@", and if
it's the case, we will get the value of the feature value, and update
the value of the label with the feature value.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-05-31 23:30:26 +01:00
Markus Lehtonen
bf670de68d pkg/utils: migrate KlogDump to structured logging
Drop the KlogDump helper in favor of klog.InfoS. However, that patch
introduces a new DelayedDumper() helper to avoid processing
(marshalling) of object unless really evaluated by the logging function.
2023-05-31 14:43:08 +03:00
Markus Lehtonen
8113d651c2 nfd-master: migrate to structured logging 2023-05-31 14:43:05 +03:00
Markus Lehtonen
2a3c7e4c93 nfd-master: add validation of label names and values
Validate labels before trying to update the node. Makes us fail early
nad prevent useless retries in case invalid labels are tried.
2023-05-29 16:54:14 +03:00
Markus Lehtonen
1809c24314 nfd-master: use close for stop channel
Simpler and more reliable (in case of multiple consumers) to just close
the channel.
2023-05-24 16:51:48 +03:00
PiotrProkop
272fd4784f Add new flag enable-leader-election for nfd-master.
It allows NFD-master to be run in active-passive way when running
multiple instances of NFD-master to prevent multiple components
from updating same custom resources.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-05-15 13:30:07 +02:00
Kubernetes Prow Robot
85073525c3
Merge pull request #1185 from AhmedGrati/fix-resync-period-functionality
nfd-master: fix resync period config option
2023-05-02 11:14:16 -07:00
AhmedGrati
87c2d7e184 nfd-master: fix resync period config option
This PR fixes the resync-period configuration option of the nfd-master.
In fact, previously, changes were not reflected in the nfd-master at
runtime. e2e tests are also implemented to make sure that the fix is
already working as expected.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-05-02 13:17:01 +02:00
Markus Lehtonen
fb20388028 nfd-master: refactor filtering of taints 2023-04-28 18:13:54 +03:00
Markus Lehtonen
43ced0c1a1 nfd-master: refactor filtering of feature labels
More consistent error messages. Also preparation for dynamic labels
values (that '@' notation currently supported for extended resources).
2023-04-28 18:13:54 +03:00
Markus Lehtonen
6ca687fbef nfd-master: refactor filtering of extended resources
Simplify code a bit and get more consistent error messages (in addition
to fixing some of those).
2023-04-28 18:13:54 +03:00
Markus Lehtonen
131325fb2c nfd-master: refactor api-controller object handling
Split out resolving of node name (of the node to be updated) into a
separate function. Makes it possible to add unit tests. Also. do
unconditional type casting in the handler functions – that shouldn't
fail unless there is a really serious internal inconsistency in the
codebase so it should be ok to panic.
2023-04-28 17:33:33 +03:00
Markus Lehtonen
77011a775f nfd-master: log node name when processing NodeFeatureRules 2023-04-26 07:22:30 +03:00