1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

492 commits

Author SHA1 Message Date
googs1025
e631a52374 chore: add metrics system prefix 2024-11-28 09:57:40 +08:00
Markus Lehtonen
45f49d574a nfd-master: drop resourceLabels
Drop the resourceLabels config file option and the corresponding
-resource-labels command line flag. They were deprecated in NFD v0.13 so
it's time to let them go. NodeFeatureRule(s) should be used to manage
ERs, instead.
2024-11-07 15:16:52 +02:00
Carlos Eduardo Arango Gutierrez
62f4eddce6
Drop support for hooks
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-11-04 14:50:07 +01:00
Kubernetes Prow Robot
b997ade5b3
Merge pull request #1942 from marquiz/devel/drop-grpc
nfd-master: drop stale unreachable deprecation notices
2024-11-04 11:16:31 +01:00
Kubernetes Prow Robot
1c6ce897f2
Merge pull request #1816 from marquiz/devel/gc-test-assert-msg
tests: better assertion message in nfd-gc unit tests
2024-10-31 19:33:27 +00:00
Markus Lehtonen
ca85075972 nfd-master: use Typed* workqueue types
Drop the usage of deprecated functions and types, makes linters happy.
2024-10-30 12:25:16 +02:00
Carlos Eduardo Arango Gutierrez
0bd82cf82a
Drop NFD gRPC API
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-10-29 15:15:18 +01:00
Kubernetes Prow Robot
fd2893e2a5
Merge pull request #1592 from AhmedThresh/feat-configure-cr-restrictions
feat/nfd-master: configure CR restrictions
2024-10-24 12:20:54 +01:00
Markus Lehtonen
db07fe1ff4 nfd-gc: drop one duplicate import from tests 2024-09-27 15:26:18 +03:00
AhmedGrati
28b40c90b8 deploy: add CR restrictions to the helm config
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Signed-off-by: AhmedThresh <ahmed.grati@insat.ucar.tn>
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Signed-off-by: AhmedThresh <ahmed.grati@insat.ucar.tn>
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Signed-off-by: AhmedThresh <ahmed.grati@insat.ucar.tn>
2024-09-16 16:02:42 +02:00
Markus Lehtonen
9fad67ee39 nfd-master: cleanup updater-pool method args
We store the work queues in the updater pool struct so we don't need to
pass those as function arguments.
2024-09-16 14:50:08 +03:00
Markus Lehtonen
02b6b7395c Drop dynamic run-time reconfiguration
Simplify the code and reduce possible error scenarios by dropping
fsnotify-based reconfiguration from nfd-master and nfd-worker. Also
eliminates repeated re-configuration in scenarios where kubelet
continuosly touches the (every minute) mounted file (configmap) on the
filesystem.

Also modifies the Helm and kustomize deployments so that nfd-master,
nfd-worker and nfd-topology-updater pods are restarted on configmap
updates. In kustomize, the slght downside of this is the name of the
config map(s) depends on the content, so every time a user customizes
the config data, the old unused configmap will be left and must be
garbage-collected manually.
2024-08-21 12:46:36 +03:00
Markus Lehtonen
2bb8a72532 nfd-master: proper shutdown of nfd api informers
Stop blocking on event channels when the api controller is stopped.
Ensures that the nfd API informer factory is properly shut down and all
resources released when stop() is called. This eliminates a memory leak
on re-configure events when leader election is enabled.
2024-08-20 12:44:08 +03:00
Kubernetes Prow Robot
5a5b9e3c19
Merge pull request #1843 from marquiz/devel/master-chan
nfd-master: use only unbuffered chans in the nfd api-controller
2024-08-19 07:23:12 -07:00
Markus Lehtonen
bf6ffadf36 nfd-master: use only unbuffered chans in the nfd api-controller
There's no reason why the "update all" chans should be buffered (while
the other are not).
2024-08-19 14:02:13 +03:00
Markus Lehtonen
0d3c1ac75b nfd-master: explicit state variable for the node updater pool 2024-08-19 13:27:56 +03:00
AhmedGrati
7bad0d583c feat/nfd-master: support CR restrictions
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2024-08-10 22:39:10 +02:00
Markus Lehtonen
d6c1a7e44f tests: better assertion message in nfd-gc unit tests 2024-08-02 08:23:54 +03:00
Markus Lehtonen
45164f580a nfd-gc: use paging when listing CRs
List NodeFeature and NodeResourceTopology objects in pages of 200 items.
This reduces memory consumption and eliminates timeouts (on the
apiserver side) in big clusters of thousands of nodes.
2024-08-02 08:20:17 +03:00
Kubernetes Prow Robot
57f1b79856
Merge pull request #1813 from marquiz/devel/gc-metalister
nfd-gc: only fetch object metadata
2024-08-01 12:53:33 -07:00
Markus Lehtonen
54befffa94 nfd-gc: only fetch object metadata
Significantly reduce the apiserver and network load by only
listing/getting the object metadata.
2024-07-30 16:01:04 +03:00
Kubernetes Prow Robot
2d24a4bee4
Merge pull request #1811 from marquiz/devel/informer-listopts
nfd-master: tweak list options for NodeFeature informer
2024-07-30 03:56:04 -07:00
Markus Lehtonen
454d443b72 nfd-gc: check that node informer cache sync succeeded 2024-07-26 10:29:15 +03:00
Markus Lehtonen
a2068f7ce3 nfd-master: tweak list options for NodeFeature informer
Fix cache syncing problems on big clusters with thousands of NodeFeature
objects.

On the initial list (sync) the client-go cache reflector sets the
ResourceVersion to "0" (instead of leaving it empty). This causes
problems in the api server with (apiserver) logs like:

E writers.go:122] apiserver was unable to write a JSON response: http:
                  Handler timeout
E status.go:71] apiserver received an error that is not an
                metav1.Status: &errors.errorString{s:"http: Handler timeout"}:
                http: Handler timeout

On the nfd-master side we see corresponding log snippets like:

W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error
                    when reading response body, may be caused by closed
                    connection. Please retry. Original error: stream
                    error: stream ID 1521; INTERNAL_ERROR; received from
                    peer
I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time:
                61126ms): ---"Objects listed" error:stream error when
                reading response body, may be caused by closed
                connection. Please retry. Original error: stream
                error: stream ID 1521; INTERNAL_ERROR; received from
                peer 61126ms (***)

Decreasing the page size (opts.Limits) does not have any effect on the
timeouts. However, setting ResourceVersion to an empty value seems to
get the paging on its tracks, eliminating the timeouts.

TODO: investigate in Kubernetes upstream the root cause of the timeouts
with ResourceVersion="0".
2024-07-25 16:29:05 +03:00
Markus Lehtonen
ea3243fb00 nfd-master: check nfd api informer cache sync result
Bail out if there were errors in syncing the cache of any resource.
2024-07-25 09:58:40 +03:00
Markus Lehtonen
25e827a4c8 feature-gates: mark NodeFeatureAPI as GA
The feature gate is locked to true. That is, it is not possible to revert
back to the gPRC-based communication which makes the gRPC API ready for
removal.
2024-07-16 13:53:31 +03:00
Markus Lehtonen
522b87e325 nfd-worker: change TestRun to use NodeFeature API
Run nfd-worker with NodeFeature API enabled (against a fake apiserver)
instead of using the deprecated gRPC (against a nfd-master instance).

Expand the test to verify the features and labels that are advertised as
a NodeFeature object.
2024-07-12 09:50:09 +03:00
Markus Lehtonen
a269bf4d25 Drop the -enable-nodefeature-api flag
Was marked to be removed in v0.17.
2024-07-10 15:20:07 +03:00
Kubernetes Prow Robot
393af96a88
Merge pull request #1755 from ArangoGutierrez/1752
Use worker DS OwnerReference for NF's
2024-07-09 06:33:07 -07:00
Carlos Eduardo Arango Gutierrez
e33e68ad5b
Add optionable arguments to NewWorker
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-07-09 15:08:26 +02:00
Kubernetes Prow Robot
3bb7a1caff
Merge pull request #1766 from marquiz/devel/simplify
Simplify code
2024-07-09 00:19:28 -07:00
Markus Lehtonen
b5b701fbbf Simplify code
Drop unnecessry typedefs.
2024-07-09 09:05:33 +03:00
Markus Lehtonen
ecb37c01b0 nfd-master: fix typos 2024-07-09 08:55:33 +03:00
Carlos Eduardo Arango Gutierrez
5d3ee1c51f
Use worker DS OwnerReference for NF's
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-07-04 13:53:24 +02:00
IbirbyZh
2f9801b554
Fix the problem with starting the master with empty cache
We faced the problem when master deleted some of labels on start. Sometimes he doesn't gets NodeFeatures when they are present in cluster because of empty cache in informer
2024-06-10 18:06:14 +02:00
Kubernetes Prow Robot
814255b7f1
Merge pull request #1671 from marquiz/devel/nfd-api-multi-type-feature
apis/nfd: allow different types of features of the same name
2024-05-24 03:42:30 -07:00
Markus Lehtonen
3b448ae623 apis/nfd: allow different types of features of the same name
This patch changes the handling of NodeFeatureRules so that one feature
name (say "cpu.cpuid") can hold different types of features (flags,
attributes and/or instances). Requiring features to choose one single
type has not been a limitation of the API itself (and there has been no
validation on this) but an implementation decision.

The new evalutation logic of match expressions is such that "flags" and
"attributes" are basically evaluated as an union - they are both maps
but "flags" just don't have any value associated with the key. However,
"instances" are handled separately as that is basically an array of
maps and needs to be evaluated in a different way (loop over the array
of instances and evaluate expressions against the attributes of each).
Because of this difference care must be taken if mixing "instance"
features with "flag" and/or "attribute" features.

Note that the API types or their validation is not changed - just the
implementation of how the NodeFeatureRules are evaluated.
2024-05-24 13:18:31 +03:00
Kubernetes Prow Robot
5fe3433e35
Merge pull request #1713 from marquiz/devel/worker-log-fix
nfd-worker: improved log when creating NodeFeature object
2024-05-24 02:15:30 -07:00
Carlos Eduardo Arango Gutierrez
47c054e1db
Add NodeFeatureGroup CRD
The NodeFeatureGroup is an NFD-specific custom resource that is designed for
grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup
objects in the cluster and updates the status of the NodeFeatureGroup object
with the list of nodes that match the feature group rules. The NodeFeatureGroup
rules follow the same syntax as the NodeFeatureRule rules.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-05-23 16:34:08 +02:00
Markus Lehtonen
649036977e nfd-worker: improved log when creating NodeFeature object
Don't log an empty NodeFeature object.
2024-05-23 14:37:26 +03:00
Markus Lehtonen
560bd11d85 Re-add -enable-nodefeature-api cmdline flag
Bring back the -enable-nodefeature-api command line flag and the
corresponding enableNodeFeatureApi helm config value that were
removed without deprecation when the NodeFeatureAPI feature gate was
introduced. The thinking behind this change is to not break existing
users (without warning) unless totally unavoidable. Now the
-enable-nodefeature-api flag is marked as deprecated and slated for
removal in NFD v0.17.

The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag
work together so that the NodeFeature API is disabled (gRPC is enabled,
instead) if either of them is set to false.

This patch selectively reverts parts of
06c4733bc5.
2024-05-16 10:53:49 +03:00
Markus Lehtonen
121345472d nfd-master: add DisableAutoPrefix feature gate
Now that we have support for feature gates deprecate the autoDefaultNs
config option of nfd-master and replace it with a new alpha feature gate
DisableAutoPrefix (defaults to false). Using a feature gate to handle
and communicate these kind of changes, where the default behavior is
intended to be changed in a future release, feels much more natural than
using random flags/options.

The combined logic of the feature gate and the config option is a
logical OR over disabling auto-prefixing. That is, auto-prefixing is
disabled if either the feature gate or the config options is used set to
disable it:

                       | DisableAutoPrefix (feature gate)
                       | false | true
  -------------------- | --------------------------------
  autoDefaultNs   true |  ON   | OFF
  (config opt)   false |  OFF  | OFF
2024-05-15 17:01:16 +03:00
Markus Lehtonen
eb7a0ada5c nfd-master: import features package as nfdfeatures
Refactoring to prevent naming clash in future changes.
2024-05-15 11:27:46 +03:00
Markus Lehtonen
bad7d1fcb1 apis/nfd: increase unit test coverage
Cover error cases of the "match name" functions.
2024-04-30 16:02:13 +03:00
Kubernetes Prow Robot
ca13b4903d
Merge pull request #1669 from marquiz/devel/nfd-api-helpers-refactor
api/nfd: use varargs in the NewInstanceFeatures helper
2024-04-23 06:33:46 -07:00
Kubernetes Prow Robot
be44447dc0
Merge pull request #1670 from marquiz/devel/flag-feature-errors
apis/nfd: no error on ops that never match
2024-04-23 06:09:06 -07:00
Kubernetes Prow Robot
828acaa8cc
Merge pull request #1667 from marquiz/devel/api-tests
apis/nfd: add unit tests for match name functions
2024-04-23 06:08:48 -07:00
Markus Lehtonen
fbb7303562 apis/nfd: no error on ops that never match
Return false (i.e. "did not match") but no error when evaluating a match
expression against a "flag" type feature (which don't have any
associated value, just the name) if a MatchOp that never matches is
used.

This is preparation for supporting multi-type features, i.e. one
feature, like "cpu.cpuid", having e.g. "flag" and "attribute" type
features.
2024-04-23 11:07:49 +03:00
Markus Lehtonen
719c5186f6 api/nfd: use varargs in the NewInstanceFeatures helper
Make usage of this helper function more flexible.
2024-04-23 10:29:24 +03:00
TessaIO
de50ac8800 chore/nfd-master: remove warnings in nfd-master unit tests file
Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
2024-04-22 22:27:15 +02:00