1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

368 commits

Author SHA1 Message Date
Markus Lehtonen
b988139094 apis/nfd: validate input when matching expression
Don't assume that the fields are correct.
2023-12-01 09:22:32 +02:00
Markus Lehtonen
94bffbf645 generate: update kube code-gen to v1.28.4 2023-11-29 18:37:19 +02:00
Kubernetes Prow Robot
dfef0ebe4a
Merge pull request #1472 from marquiz/devel/typo-fix
nfd-worker: fix typo in log message
2023-11-24 16:53:49 +01:00
Markus Lehtonen
f266533a7d nfd-worker: fix typo in log message 2023-11-24 17:17:42 +02:00
Markus Lehtonen
f6c360188e Use T.Run in expression unit tests
The "better way" of running test cases, get e.g. better output in case
of errors.

Also drop some unneeded type definitions from the tests.
2023-11-24 17:14:12 +02:00
Markus Lehtonen
f489ca98b5 Reproducible output from expression matching
Fix flakyness of unit tests by adding back the sorting of matched
feature elements that was unadvisedly removed in
63c22551df. This might help debugging some
corner cases in real-life scenarios (when using templating), too.
2023-11-24 16:27:38 +02:00
Kubernetes Prow Robot
ed8898de6a
Merge pull request #1461 from marquiz/devel/no-implicit-ns
Option to stop implicitly adding default prefix to names
2023-11-24 14:53:09 +01:00
Kubernetes Prow Robot
7154458524
Merge pull request #1468 from marquiz/devel/nfr-template-fix
apis/nfd: fix multiple matcher terms targeting the same feature
2023-11-24 13:20:49 +01:00
Markus Lehtonen
1d012a28cd Option to stop implicitly adding default prefix to names
Add new autoDefaultNs (default is "true") config option to nfd-master.
Setting the config option to false stops NFD from automatically adding
the "feature.node.kubernetes.io/" prefix to labels, annotations and
extended resources. Taints are not affected as for them no prefix is
automatically added. The user-visible part of enabling the option change
is that NodeFeatureRules, local feature files, hooks and configuration
of the "custom" may need to be altereda (if the auto-prefixing is
relied on).

For now, the config option defaults to "true", meaning no change in
default behavior. However, the intent is to change the default to
"false" in a future release, deprecating the option and eventually
removing it (forcing it to "false").

The goal of stopping doing "auto-prefixing" is to simplify the operation
(of nfd and users). Make the naming more straightforward and easier to
understand and debug (kind of WYSIWYG), eliminating peculiar corner
cases:

1. Make validation simpler and unambiguous
2. Remove "overloading" of names, i.e. the mapping two values to the
   same actual name. E.g. previously something like

      labels:
        feature.node.kubernetes.io/foo: bar
        foo: baz

   Could actually result in node label:

     feature.node.kubernetes.io/foo: baz

3. Make the processing/usagee of the "rule.matched" and "local.labels"
   feature in NodeFeatureRules unambiguous and more understadable. E.g.
   previously you could have node label
   "feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule
   you'd need to use the unprefixed name "local-foo" or the fully
   prefixed name, depending on what was specified in the feature file (or
   hook) on the node(s).

NOTE: setting autoDefaultNs to false is a breaking change for users who
rely on automatic prefixing with the default feature.node.kubernetes.io/
namespace. NodeFeatureRules, feature files, hooks and custom rules
(configuration of the "custom" source of nfd-worker) will need to be
altered.  Unprefixed labels, annoations and extended resources will be
denied by nfd-master.
2023-11-24 12:48:20 +02:00
Markus Lehtonen
dc5af8be04 nfd-master: predictable handling of unprefixed names
Make the handling of unprefixed names (of labels, annotations and
extended resources) well-defined and predictable. Previously the
resulting output was not predictable in case the same name was coming in
both the unprefixed and prefixed form, say unprefixed "foo=bar" coming from
one source (be it nfd-worker or NodeFeature(Rule)) and
"feature.node.kubernetes.io/foo=baz" from a NodeFeature(Rule).
Previously the output value was randomly either "bar" or "baz".

This patch adds prefixes to all names early in the processing
"pipeline", preventing random name clashes later on.
2023-11-23 22:16:04 +02:00
Markus Lehtonen
678d7e89cb nfd-master: drop stale variables
Remove some stale variables that were leftover from the recent removal
of nfd version annotations.
2023-11-23 19:01:22 +02:00
Markus Lehtonen
63c22551df apis/nfd: fix multiple matcher terms targeting the same feature
Fix NodeFeatureRule templating in cases where multiple matchFeatures
terms are targeting the same feature. Previously, only matched feature
elements from the last matcher terms were used as the input to the
template. However, the input should contain all matched elements from
all matcher terms.

For example, consider the example rule snippet below:

  ...
  labelsTemplate: |
    {{ range .pci.device }}vendor.io/pci-device.{{ .class }}-{{ .device }}=exists
    {{ end }}
  matchFeatures:
    - feature: pci.device
      matchExpressions:
        class: {op: InRegexp, value: ["^03"]}
        vendor: {op: In, value: ["1234"]}
    - feature: pci.device
      matchExpressions:
        class: {op: InRegexp, value: ["^12"]}

This rule matches if both a pci device of class 03 from vendor 1234
exists and a pci device of class 12 (from any vendor) exists.
Previously, the template would only generate labels from the devices in
class 12 (as that's the last term). With this patch the template creates
device labels from devices in both classes 03 and 12.
2023-11-22 10:43:52 +02:00
Kubernetes Prow Robot
371ed3ff21
Merge pull request #1458 from marquiz/devel/logging-fix
apis/nfd: fix logging of rule expression processing
2023-11-21 12:04:59 +01:00
Markus Lehtonen
9cbe742bfb apis/nfd: fix incorrect comments of matching functions
This patch updates the comments to correspond to the actual behavior
which was changed back in 36341bf4c7.
2023-11-20 10:11:35 +02:00
Markus Lehtonen
8ec55fe8db apis/nfd: fix logging of rule expression processing 2023-11-10 09:40:54 +02:00
Carlos Eduardo Arango Gutierrez
c0063be4f4
Discover node features as annotations
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: bebc <mchf1990212@gmail.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-10-25 19:58:58 +02:00
Markus Lehtonen
a9849f20ff nfd-master: fix retry of node updates
This patch addresses issues with slow node status (extended resources)
updates. Previously we did just a few retries in quick succession which
could result in the node update failing, just because node status was
updated slower than our retry window. The patch mitigates the issue by
increasing the number of tries to 15. In addition, it creates a
ratelimiter with a longer per-item (per-node) base delay.

The patch also fixes the e2e-tests to expose the issue.
2023-10-20 17:24:01 +03:00
Markus Lehtonen
98c3b0750d nfd-gc: add metrics
Implements three metrics for nfd-gc:

- nfd_gc_build_info: version information of nfd-gc.
- nfd_gc_objects_deleted_total: total number of NodeFeature and
  NodeResourceTopology objects deleted by nfd-gc.
- nfd_gc_object_delete_failures_total: number of errors encountered when
  deleting NodeFeature and NodeResourceTopology objects.
2023-10-09 13:39:28 +00:00
Markus Lehtonen
f5c6ce2843 nfd-gc: simplify initialization 2023-10-09 11:48:49 +03:00
Markus Lehtonen
5171ae0f90 Refactor metrics
Move common boilerplate code under pkg/utils.
2023-10-09 10:49:12 +03:00
Markus Lehtonen
1d8a83b045 nfd-master: stop creating NFD version annotations
We now have metrics for getting detailed information about the NFD
instances running. There should be no need to pollute the node object
with NFD version annotations.

One problem with the annotations also that they were incomplete in the
sense that they only covered nfd-master and nfd-worker but not
nfd-topology-updater or nfd-gc.

Also, there was a problem with stale annotations, giving misleading
information. E.g. there was no way to remove old/stale master.version
annotations if nfd-master was scheduled on another node where it was
previously running.
2023-10-05 14:53:29 +03:00
Markus Lehtonen
9ea0a1b420 nfd-master: correctly clean up annotations
Delete correct annotations if -instance is specified.
2023-10-05 11:10:06 +03:00
Markus Lehtonen
dbf00dcda6 apis/nfd: drop one stale comment line
Drop a leftover "docstring" comment that wasn't removed with the type it
refers to.
2023-09-27 14:23:12 +03:00
Markus Lehtonen
b09ce75c8e nfd-master: fix filtering of extended resources
Fix a bug in checking the allowed ".feature.node.kubernetes.io" ns
suffix for extended resources. Also update e2e-tests to cover this case.
2023-09-27 10:55:11 +03:00
AhmedGrati
7ab6314bdc chore: introduce a commong klog handling for cmd/nfd-*
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-09-07 22:38:15 +01:00
AhmedGrati
b0be40aa09 feat: add logging parameters in configuration file for nfd master
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-09-06 15:27:27 +01:00
Kubernetes Prow Robot
19520c079c
Merge pull request #1325 from ffromani/nfd-updater-fix-events
nfd-updater: events: enable timer-only flow
2023-09-04 05:47:49 -07:00
Francesco Romani
000c919071 nfd-updater: events: enable timer-only flow
The nfd-topology-updater has state-directories notification mechanism
enabled by default.
In theory, we can have only timer-based updates, but if the option
is given to disable the state-directories event source, then all
the update mechanism is mistakenly disabled, including the
timer-based updates.

The two updaters mechanism should be decoupled.
So this PR changes this to make sure we can enable just and only
the timer-based updates.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-09-04 13:05:50 +02:00
Kubernetes Prow Robot
f852c32a55
Merge pull request #1252 from AhmedGrati/test-add-updater-pool-unit-tests
test: add node updater pool unit tests
2023-09-01 07:34:32 -07:00
Kubernetes Prow Robot
e1f90a233b
Merge pull request #1305 from marquiz/devel/nf-gc
Garbage collection of NodeFeature objects
2023-08-28 02:59:42 -07:00
Kubernetes Prow Robot
6d95e59cd0
Merge pull request #1290 from marquiz/devel/metrics-new
metrics: additional metrics for nfd-master
2023-08-28 02:07:42 -07:00
Markus Lehtonen
e3415ec484 nfd-gc: support garbage collection of NodeFeatures
Hook into the same logic already exercised for NodeResourceTopology
objects: GC watches for node delete events and immediately drops stale
objects (NRT and now also NF). In addition there is a periodic resync to
catch any missed node deletes, once every hour by default.
2023-08-22 21:24:26 +03:00
Markus Lehtonen
01c08d67b6 Rename nfd-topology-gc to nfd-gc
This is preparation for making it a generic garbage collector for all
nfd-managed api objects.
2023-08-21 21:46:11 +03:00
Kubernetes Prow Robot
e0c477090b
Merge pull request #1311 from marquiz/devel/refactor-gc-5
topology-gc: simplify listing of node objects
2023-08-21 11:40:05 -07:00
Markus Lehtonen
f05b0e26ea topology-gc: move initial GC out of startNodeInformer()
Small refactor. Contextually this feels more like under periodicGC().
2023-08-21 10:11:46 +03:00
Kubernetes Prow Robot
a60502a313
Merge pull request #1307 from marquiz/devel/refactor-gc
topology-gc: refactor unit tests
2023-08-21 00:09:23 -07:00
Kubernetes Prow Robot
536f9d17d0
Merge pull request #1295 from marquiz/devel/topology-updater-metrics
nfd-topology-updater: add metrics support
2023-08-20 23:25:24 -07:00
Markus Lehtonen
2e8da8849a topology-gc: simplify listing of node objects
Hopefully makes the code slightly more readable.
2023-08-21 09:13:41 +03:00
Markus Lehtonen
0b5e51bd35 topology-gc: refactor unit tests
Remove a lot of boilerplate code by defining reusable functions.

Also, test the Run() method instead of the functions callees of Run() as
it is the top level functionality that was tested in practice (we don't
have separate unit tests for the callee functions).
2023-08-21 09:10:24 +03:00
Kubernetes Prow Robot
4674bce27d
Merge pull request #1310 from marquiz/devel/refactor-gc-4
topology-gc: rename runGC to garbageCollect()
2023-08-18 11:26:34 -07:00
Kubernetes Prow Robot
f4cf4877f2
Merge pull request #1309 from marquiz/devel/refactor-gc-3
topology-gc: rename run()
2023-08-18 11:26:28 -07:00
Markus Lehtonen
ec51b29b3c topology-gc: rename runGC to garbageCollect()
One less function named run.
2023-08-18 17:57:05 +03:00
Markus Lehtonen
98b0b36b87 topology-gc: rename run()
Too many run methods here.
2023-08-18 17:52:11 +03:00
Markus Lehtonen
108d603bdc topology-gc: fix Stop
The stop channel has multiple readers to we need to close it so that all
of the readers get notified.
2023-08-18 17:46:54 +03:00
Kubernetes Prow Robot
9d61b19454
Merge pull request #1287 from freelizhun/fix-empty-hugepages
fix empty hugepages in some numa nodes caused no such file or directory errors
2023-08-08 02:50:16 -07:00
lizhun
a4ad3d4411 fix empty hugepages in some numa nodes caused no such file or directory error
Signed-off-by: lizhun <lizhun@kylinos.cn>
2023-08-08 15:14:44 +08:00
Markus Lehtonen
5ad2294c14 metrics: add nfd_node_update_requests_total counter
Add a counter for total number of node update/sync requests. In
practice, this counts the number of gRPC requests received if the gRPC
API is in use. If the NodeFeature API is enabled, this counts the
requests initiated by the NFD API controller, i.e. updates triggered by
changes in NodeFeature or NodeFeatureRule objects plus updates initiated
by the controller resync period.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
4b24cc1afa metrics: counters for rejected labels, extended resources and taints
Add counters for labels, extended resources and taints rejected/filtered
out by nfd-master.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
a8a29e6df2 metrics: add nfd_nodefeaturerule_processing_errors_total counter
Add a counter for errors encountered when processing NodeFeatureRules.
Another simple counter without any additional prometheus labels -
nfd-master logs can provide further details.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
b90f2c318e metrics: add nfd_node_update_failures_total counter
Add a new counter for tracking node update failures from nfd-master.
This tracks both normal feature updates and the --prune sub-command.
This is a simple counter without any additional labels - nfd-master logs
can be used for further diagnostics.
2023-08-07 09:37:27 +03:00