1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

188 commits

Author SHA1 Message Date
Markus Lehtonen
9fad67ee39 nfd-master: cleanup updater-pool method args
We store the work queues in the updater pool struct so we don't need to
pass those as function arguments.
2024-09-16 14:50:08 +03:00
Markus Lehtonen
02b6b7395c Drop dynamic run-time reconfiguration
Simplify the code and reduce possible error scenarios by dropping
fsnotify-based reconfiguration from nfd-master and nfd-worker. Also
eliminates repeated re-configuration in scenarios where kubelet
continuosly touches the (every minute) mounted file (configmap) on the
filesystem.

Also modifies the Helm and kustomize deployments so that nfd-master,
nfd-worker and nfd-topology-updater pods are restarted on configmap
updates. In kustomize, the slght downside of this is the name of the
config map(s) depends on the content, so every time a user customizes
the config data, the old unused configmap will be left and must be
garbage-collected manually.
2024-08-21 12:46:36 +03:00
Markus Lehtonen
2bb8a72532 nfd-master: proper shutdown of nfd api informers
Stop blocking on event channels when the api controller is stopped.
Ensures that the nfd API informer factory is properly shut down and all
resources released when stop() is called. This eliminates a memory leak
on re-configure events when leader election is enabled.
2024-08-20 12:44:08 +03:00
Kubernetes Prow Robot
5a5b9e3c19
Merge pull request #1843 from marquiz/devel/master-chan
nfd-master: use only unbuffered chans in the nfd api-controller
2024-08-19 07:23:12 -07:00
Markus Lehtonen
bf6ffadf36 nfd-master: use only unbuffered chans in the nfd api-controller
There's no reason why the "update all" chans should be buffered (while
the other are not).
2024-08-19 14:02:13 +03:00
Markus Lehtonen
0d3c1ac75b nfd-master: explicit state variable for the node updater pool 2024-08-19 13:27:56 +03:00
Markus Lehtonen
a2068f7ce3 nfd-master: tweak list options for NodeFeature informer
Fix cache syncing problems on big clusters with thousands of NodeFeature
objects.

On the initial list (sync) the client-go cache reflector sets the
ResourceVersion to "0" (instead of leaving it empty). This causes
problems in the api server with (apiserver) logs like:

E writers.go:122] apiserver was unable to write a JSON response: http:
                  Handler timeout
E status.go:71] apiserver received an error that is not an
                metav1.Status: &errors.errorString{s:"http: Handler timeout"}:
                http: Handler timeout

On the nfd-master side we see corresponding log snippets like:

W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error
                    when reading response body, may be caused by closed
                    connection. Please retry. Original error: stream
                    error: stream ID 1521; INTERNAL_ERROR; received from
                    peer
I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time:
                61126ms): ---"Objects listed" error:stream error when
                reading response body, may be caused by closed
                connection. Please retry. Original error: stream
                error: stream ID 1521; INTERNAL_ERROR; received from
                peer 61126ms (***)

Decreasing the page size (opts.Limits) does not have any effect on the
timeouts. However, setting ResourceVersion to an empty value seems to
get the paging on its tracks, eliminating the timeouts.

TODO: investigate in Kubernetes upstream the root cause of the timeouts
with ResourceVersion="0".
2024-07-25 16:29:05 +03:00
Markus Lehtonen
ea3243fb00 nfd-master: check nfd api informer cache sync result
Bail out if there were errors in syncing the cache of any resource.
2024-07-25 09:58:40 +03:00
Markus Lehtonen
a269bf4d25 Drop the -enable-nodefeature-api flag
Was marked to be removed in v0.17.
2024-07-10 15:20:07 +03:00
Markus Lehtonen
ecb37c01b0 nfd-master: fix typos 2024-07-09 08:55:33 +03:00
IbirbyZh
2f9801b554
Fix the problem with starting the master with empty cache
We faced the problem when master deleted some of labels on start. Sometimes he doesn't gets NodeFeatures when they are present in cluster because of empty cache in informer
2024-06-10 18:06:14 +02:00
Carlos Eduardo Arango Gutierrez
47c054e1db
Add NodeFeatureGroup CRD
The NodeFeatureGroup is an NFD-specific custom resource that is designed for
grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup
objects in the cluster and updates the status of the NodeFeatureGroup object
with the list of nodes that match the feature group rules. The NodeFeatureGroup
rules follow the same syntax as the NodeFeatureRule rules.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-05-23 16:34:08 +02:00
Markus Lehtonen
560bd11d85 Re-add -enable-nodefeature-api cmdline flag
Bring back the -enable-nodefeature-api command line flag and the
corresponding enableNodeFeatureApi helm config value that were
removed without deprecation when the NodeFeatureAPI feature gate was
introduced. The thinking behind this change is to not break existing
users (without warning) unless totally unavoidable. Now the
-enable-nodefeature-api flag is marked as deprecated and slated for
removal in NFD v0.17.

The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag
work together so that the NodeFeature API is disabled (gRPC is enabled,
instead) if either of them is set to false.

This patch selectively reverts parts of
06c4733bc5.
2024-05-16 10:53:49 +03:00
Markus Lehtonen
121345472d nfd-master: add DisableAutoPrefix feature gate
Now that we have support for feature gates deprecate the autoDefaultNs
config option of nfd-master and replace it with a new alpha feature gate
DisableAutoPrefix (defaults to false). Using a feature gate to handle
and communicate these kind of changes, where the default behavior is
intended to be changed in a future release, feels much more natural than
using random flags/options.

The combined logic of the feature gate and the config option is a
logical OR over disabling auto-prefixing. That is, auto-prefixing is
disabled if either the feature gate or the config options is used set to
disable it:

                       | DisableAutoPrefix (feature gate)
                       | false | true
  -------------------- | --------------------------------
  autoDefaultNs   true |  ON   | OFF
  (config opt)   false |  OFF  | OFF
2024-05-15 17:01:16 +03:00
Markus Lehtonen
eb7a0ada5c nfd-master: import features package as nfdfeatures
Refactoring to prevent naming clash in future changes.
2024-05-15 11:27:46 +03:00
TessaIO
de50ac8800 chore/nfd-master: remove warnings in nfd-master unit tests file
Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
2024-04-22 22:27:15 +02:00
Kubernetes Prow Robot
a7c58b121c
Merge pull request #1660 from marquiz/devel/master-reconfigure-change
nfd-master: stop node-updater pool before reconfiguring api-controller
2024-04-15 11:10:32 -07:00
Kubernetes Prow Robot
91d3d5a7b0
Merge pull request #1653 from marquiz/devel/master-multiple-k8sclients
nfd-master: use separate k8s api clients for each updater
2024-04-15 09:18:51 -07:00
Markus Lehtonen
8ad6210d5c nfd-master: use separate k8s api clients for each updater
Sharing the same client between updater threads virtually serializes
access, in practice making the effective parallelism close to 1.

With this patch, in my bench cluster of 300 nodes, the time taken by
updating all nodes drops from ~2 minutes to ~12 seconds (with the
default parallelism of 10 node updater threads). This demonstrates the
10-fold increased parallelism from ~1 to 10.

There might be other solutions that could be explored, e.g. caching
nodes with an indexer/lister but otoh nfd doesn't necessarily need/want
to watch every little change in each node. We only need to get the node
when something in our own CRDs change (we don't react to any changes in
the node object itself). Using multiple clients was the most obvious
choice to solve the problem for now.
2024-04-15 19:00:30 +03:00
Kubernetes Prow Robot
6b80f654d4
Merge pull request #1600 from ArangoGutierrez/e2e-not-k8s
Move NFD api to a separate go mod
2024-04-09 02:06:06 -07:00
Markus Lehtonen
ba4cebb29e nfd-master: stop node-updater pool before reconfiguring api-controller
Prevents potential race between node-updater pool and the api-controller
when re-configuring nfd-master. Reconfiguration causes a new
api-controller instance to be created so nfd api lister might change in
the midst of processing a node update (if the pool was running). No
actual issues related to this have been identified but races (like this)
should still be avoided.
2024-04-09 10:45:07 +03:00
Markus Lehtonen
8709cccf71 nfd-master: parse kubeconfig even with NoPublish set
Don't try to be too smart when kubeconfig is needed. In practice, the
nfd-master really doesn't work anymore (with the NodeFeature API
enabled) without a kubeconfig set. This patch fixes crashes happening
when NoPublish is enabled, e.g. in listing all nodes in the nfd api
handler and in getting single node objects in the node updater pool.

This patch changes the kubeconfig parsing to happen at the creation of
the nfd-master instance. We don't need to do that at reconfigure time as
none of the dynamic config options affect it. Unit tests are adjusted,
accordingly.
2024-04-08 14:25:27 +03:00
Markus Lehtonen
fcb8d3cda4 nfd-master: implement opts for modifying NfdMaster instance
This provides a more controlled way for setting up the NfdMaster
instance for testing.
2024-04-05 20:21:19 +03:00
Kubernetes Prow Robot
199d665046
Merge pull request #1656 from marquiz/devel/channel-simplify
Tidy up usage of channels for signaling
2024-04-05 07:51:34 -07:00
Carlos Eduardo Arango Gutierrez
3434557d7c
Move NFD api to a separate go mod
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-04-05 16:35:47 +02:00
Kubernetes Prow Robot
cb24f7c234
Merge pull request #1657 from marquiz/devel/master-label-whitelist
nfd-master: prevent crash on empty config struct
2024-04-05 05:36:52 -07:00
Markus Lehtonen
26a80cf142 Tidy up usage of channels for signaling
This started as a small effort to simplify the usage of "ready" channel
in nfd-master. It extended into a wider simplification/unification of
the channel usage.
2024-04-05 14:39:58 +03:00
Markus Lehtonen
b27676451a nfd-master: prevent crash on empty config struct
Change the handling of LabelWhiteList config option to use a pointer to
detect when the option is unset. This doesn't fix any detected crash but
is merely general improvement and stabilization, serving easier testing.

Also, use the regexp type from the core libs for the config struct -
dropping the unmasrhalling code for our custom regexp type - as the core
regexp now implements unmarshaller itself.
2024-04-05 14:19:44 +03:00
Kubernetes Prow Robot
ad96c301a4
Merge pull request #1642 from marquiz/devel/master-updater-pool-lock
nfd-master: protect node updater pool queueing with a lock
2024-04-05 03:31:10 -07:00
Markus Lehtonen
44a5a5b4a8 nfd-master: get node object only once when updating node
Prevent excess queries of node objects from the Kubernetes apiserver.
This significantly speeds up node updates (and reduces the load on the
apiserver) as the client-side throttling (which is good) does not bite
us that hard.
2024-04-04 14:44:52 +03:00
Markus Lehtonen
bce446c5b6 nfd-master: protect node updater pool queueing with a lock
Prevents races when (re-)starting the queue. There are no reports on
issues related to this (and I haven't come up with any actual failure
path in the current code) but better to be safe and follow the best
practices.
2024-03-27 16:53:34 +02:00
Markus Lehtonen
c4e010eafd nfd-master: do nfd API scheme registration in an init function
Prevents (rare) races on nfd-master reconfigurartion. Previously the
scheme was registered at nfd API controller creation/startup time. This
caused a race with some lister/informer goroutines of the previous
(stoppped) controller still running and accessing (reading) the sceme
while we were updating (writing) it.
2024-03-27 15:26:16 +02:00
Markus Lehtonen
e7f87de6df nfd-master: retry node updates indefinitely
Treat node updates like a reconciliation loop. Keep trying on node
update as long as it fails. Node update permafailing likely indicates a
bug in the nfd code (there should be no reason for it to fail forever)
and it's better to clearly see it in the logs/metrics rather than giving
up after a few retries.
2024-03-18 18:14:24 +02:00
Kubernetes Prow Robot
4790962123
Merge pull request #1595 from marquiz/devel/master-check-node-existence
nfd-master: check if node exists before trying update
2024-03-18 04:19:57 -07:00
Carlos Eduardo Arango Gutierrez
06c4733bc5
Add FeatureGate framework to handle new features
Code inspired on https://github.com/kubernetes/component-base/tree/master/featuregate

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-03-15 19:11:32 +01:00
Markus Lehtonen
70fd3757c4 nfd-master: fix memory leak in nfd api-controller
Fixes a memory leak that happened when stopping (and then re-starting)
the nfd api controller. The stop channel was not used properly which
caused the underlying informer to keep on running.
2024-03-14 15:39:10 +02:00
Markus Lehtonen
a7bd22a75b nfd-master: check if node exists before trying update
Make the node-updater-pool worker fail fast (and not retry updates) if
a node does not exist.
2024-02-20 11:04:46 +02:00
Markus Lehtonen
044fd4a3fd nfd-master: log errors on node update retries 2024-02-16 15:51:04 +02:00
Markus Lehtonen
2382c34697 nfd-master: fix node status patching
Correctly patch the "status" subresource. This got broken when
refactoring the code in 7a050e7cf9 and
wasn't even catched by the unit tests as the fake kubernetes client
doesn't handle subresources as the real apiserver does.
2024-01-26 22:00:13 +02:00
Markus Lehtonen
7a050e7cf9 nfd-master: ditch apihelper
Implement some of frequently used helper functions inpackage.

This patch also contains big changes to the nfd-master unit tests. Much
of this is about migrating from the mocked apihelper interface to fake
kubernetes client that provides a bit more apiserver'ish functionality.
At the same time there is quite a bit of renaming in the tests,
shortening and unifying naming and getting rid of the extensive usage of
"mock" everywhere.
2024-01-26 16:09:22 +02:00
Markus Lehtonen
53003cbf69 pkg/utils: move JsonPatch from pkg/apihelper 2024-01-25 17:23:14 +02:00
Markus Lehtonen
acf815fb10 pkg/utils: move GetKubeconfig from pkg/apihelper here
This change is part of an effort to remove the pkg/apihelper package.
GetKubeconfig is useful helper functionality shared accross the codebase
so move it into a "safe" location.
2024-01-24 16:10:02 +02:00
Markus Lehtonen
57b7a3c6a8 Wrap nested errors 2024-01-22 22:45:15 +02:00
Markus Lehtonen
58ae81804c go.mod: update dependencies 2024-01-15 21:29:32 +02:00
Markus Lehtonen
a053efda64 nfd-master: run a separate gRPC health server
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.

The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).

The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.
2024-01-04 13:58:26 +02:00
Markus Lehtonen
97bf841140 apis/nfd: split rule processing into a separate package
This patch tidies up the nfdv1alpha1 API package by refactoring out the
implementation of (NodeFeature)Rule evaluation into a separate package.
2023-12-20 12:52:15 +02:00
Markus Lehtonen
cb0a46ec0e Use generics for maps and slices 2023-12-13 12:09:53 +02:00
Markus Lehtonen
a77983556f nfd-master: remove default denied ns from config
These are now handled by the validate package. If we have them here in
nfd-master, the default namespace (feature.node.kubernetes.io) gets
denied.
2023-12-12 16:12:53 +02:00
Carlos Eduardo Arango Gutierrez
affb93ea50
Create a Validate pkg
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-12-11 16:54:22 +01:00
Markus Lehtonen
1d012a28cd Option to stop implicitly adding default prefix to names
Add new autoDefaultNs (default is "true") config option to nfd-master.
Setting the config option to false stops NFD from automatically adding
the "feature.node.kubernetes.io/" prefix to labels, annotations and
extended resources. Taints are not affected as for them no prefix is
automatically added. The user-visible part of enabling the option change
is that NodeFeatureRules, local feature files, hooks and configuration
of the "custom" may need to be altereda (if the auto-prefixing is
relied on).

For now, the config option defaults to "true", meaning no change in
default behavior. However, the intent is to change the default to
"false" in a future release, deprecating the option and eventually
removing it (forcing it to "false").

The goal of stopping doing "auto-prefixing" is to simplify the operation
(of nfd and users). Make the naming more straightforward and easier to
understand and debug (kind of WYSIWYG), eliminating peculiar corner
cases:

1. Make validation simpler and unambiguous
2. Remove "overloading" of names, i.e. the mapping two values to the
   same actual name. E.g. previously something like

      labels:
        feature.node.kubernetes.io/foo: bar
        foo: baz

   Could actually result in node label:

     feature.node.kubernetes.io/foo: baz

3. Make the processing/usagee of the "rule.matched" and "local.labels"
   feature in NodeFeatureRules unambiguous and more understadable. E.g.
   previously you could have node label
   "feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule
   you'd need to use the unprefixed name "local-foo" or the fully
   prefixed name, depending on what was specified in the feature file (or
   hook) on the node(s).

NOTE: setting autoDefaultNs to false is a breaking change for users who
rely on automatic prefixing with the default feature.node.kubernetes.io/
namespace. NodeFeatureRules, feature files, hooks and custom rules
(configuration of the "custom" source of nfd-worker) will need to be
altered.  Unprefixed labels, annoations and extended resources will be
denied by nfd-master.
2023-11-24 12:48:20 +02:00