node-feature-discovery

mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00

Author	SHA1	Message	Date
Markus Lehtonen	9fad67ee39	nfd-master: cleanup updater-pool method args We store the work queues in the updater pool struct so we don't need to pass those as function arguments.	2024-09-16 14:50:08 +03:00
Markus Lehtonen	02b6b7395c	Drop dynamic run-time reconfiguration Simplify the code and reduce possible error scenarios by dropping fsnotify-based reconfiguration from nfd-master and nfd-worker. Also eliminates repeated re-configuration in scenarios where kubelet continuosly touches the (every minute) mounted file (configmap) on the filesystem. Also modifies the Helm and kustomize deployments so that nfd-master, nfd-worker and nfd-topology-updater pods are restarted on configmap updates. In kustomize, the slght downside of this is the name of the config map(s) depends on the content, so every time a user customizes the config data, the old unused configmap will be left and must be garbage-collected manually.	2024-08-21 12:46:36 +03:00
Markus Lehtonen	2bb8a72532	nfd-master: proper shutdown of nfd api informers Stop blocking on event channels when the api controller is stopped. Ensures that the nfd API informer factory is properly shut down and all resources released when stop() is called. This eliminates a memory leak on re-configure events when leader election is enabled.	2024-08-20 12:44:08 +03:00
Kubernetes Prow Robot	5a5b9e3c19	Merge pull request #1843 from marquiz/devel/master-chan nfd-master: use only unbuffered chans in the nfd api-controller	2024-08-19 07:23:12 -07:00
Markus Lehtonen	bf6ffadf36	nfd-master: use only unbuffered chans in the nfd api-controller There's no reason why the "update all" chans should be buffered (while the other are not).	2024-08-19 14:02:13 +03:00
Markus Lehtonen	0d3c1ac75b	nfd-master: explicit state variable for the node updater pool	2024-08-19 13:27:56 +03:00
Markus Lehtonen	a2068f7ce3	nfd-master: tweak list options for NodeFeature informer Fix cache syncing problems on big clusters with thousands of NodeFeature objects. On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like: E writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout E status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout On the nfd-master side we see corresponding log snippets like: W reflector.go:547] failed to list v1alpha1.NodeFeature: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer I trace.go:236] "Reflector ListAndWatch" name: () (total time: 61126ms): ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer 61126ms (**) Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts. TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".	2024-07-25 16:29:05 +03:00
Markus Lehtonen	ea3243fb00	nfd-master: check nfd api informer cache sync result Bail out if there were errors in syncing the cache of any resource.	2024-07-25 09:58:40 +03:00
Markus Lehtonen	a269bf4d25	Drop the -enable-nodefeature-api flag Was marked to be removed in v0.17.	2024-07-10 15:20:07 +03:00
Markus Lehtonen	ecb37c01b0	nfd-master: fix typos	2024-07-09 08:55:33 +03:00
IbirbyZh	2f9801b554	Fix the problem with starting the master with empty cache We faced the problem when master deleted some of labels on start. Sometimes he doesn't gets NodeFeatures when they are present in cluster because of empty cache in informer	2024-06-10 18:06:14 +02:00
Carlos Eduardo Arango Gutierrez	47c054e1db	Add NodeFeatureGroup CRD The NodeFeatureGroup is an NFD-specific custom resource that is designed for grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup objects in the cluster and updates the status of the NodeFeatureGroup object with the list of nodes that match the feature group rules. The NodeFeatureGroup rules follow the same syntax as the NodeFeatureRule rules. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-05-23 16:34:08 +02:00
Markus Lehtonen	560bd11d85	Re-add -enable-nodefeature-api cmdline flag Bring back the -enable-nodefeature-api command line flag and the corresponding enableNodeFeatureApi helm config value that were removed without deprecation when the NodeFeatureAPI feature gate was introduced. The thinking behind this change is to not break existing users (without warning) unless totally unavoidable. Now the -enable-nodefeature-api flag is marked as deprecated and slated for removal in NFD v0.17. The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag work together so that the NodeFeature API is disabled (gRPC is enabled, instead) if either of them is set to false. This patch selectively reverts parts of `06c4733bc5`.	2024-05-16 10:53:49 +03:00
Markus Lehtonen	121345472d	nfd-master: add DisableAutoPrefix feature gate Now that we have support for feature gates deprecate the autoDefaultNs config option of nfd-master and replace it with a new alpha feature gate DisableAutoPrefix (defaults to false). Using a feature gate to handle and communicate these kind of changes, where the default behavior is intended to be changed in a future release, feels much more natural than using random flags/options. The combined logic of the feature gate and the config option is a logical OR over disabling auto-prefixing. That is, auto-prefixing is disabled if either the feature gate or the config options is used set to disable it: \| DisableAutoPrefix (feature gate) \| false \| true -------------------- \| -------------------------------- autoDefaultNs true \| ON \| OFF (config opt) false \| OFF \| OFF	2024-05-15 17:01:16 +03:00
Markus Lehtonen	eb7a0ada5c	nfd-master: import features package as nfdfeatures Refactoring to prevent naming clash in future changes.	2024-05-15 11:27:46 +03:00
TessaIO	de50ac8800	chore/nfd-master: remove warnings in nfd-master unit tests file Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>	2024-04-22 22:27:15 +02:00
Kubernetes Prow Robot	a7c58b121c	Merge pull request #1660 from marquiz/devel/master-reconfigure-change nfd-master: stop node-updater pool before reconfiguring api-controller	2024-04-15 11:10:32 -07:00
Kubernetes Prow Robot	91d3d5a7b0	Merge pull request #1653 from marquiz/devel/master-multiple-k8sclients nfd-master: use separate k8s api clients for each updater	2024-04-15 09:18:51 -07:00
Markus Lehtonen	8ad6210d5c	nfd-master: use separate k8s api clients for each updater Sharing the same client between updater threads virtually serializes access, in practice making the effective parallelism close to 1. With this patch, in my bench cluster of 300 nodes, the time taken by updating all nodes drops from ~2 minutes to ~12 seconds (with the default parallelism of 10 node updater threads). This demonstrates the 10-fold increased parallelism from ~1 to 10. There might be other solutions that could be explored, e.g. caching nodes with an indexer/lister but otoh nfd doesn't necessarily need/want to watch every little change in each node. We only need to get the node when something in our own CRDs change (we don't react to any changes in the node object itself). Using multiple clients was the most obvious choice to solve the problem for now.	2024-04-15 19:00:30 +03:00
Kubernetes Prow Robot	6b80f654d4	Merge pull request #1600 from ArangoGutierrez/e2e-not-k8s Move NFD api to a separate go mod	2024-04-09 02:06:06 -07:00
Markus Lehtonen	ba4cebb29e	nfd-master: stop node-updater pool before reconfiguring api-controller Prevents potential race between node-updater pool and the api-controller when re-configuring nfd-master. Reconfiguration causes a new api-controller instance to be created so nfd api lister might change in the midst of processing a node update (if the pool was running). No actual issues related to this have been identified but races (like this) should still be avoided.	2024-04-09 10:45:07 +03:00
Markus Lehtonen	8709cccf71	nfd-master: parse kubeconfig even with NoPublish set Don't try to be too smart when kubeconfig is needed. In practice, the nfd-master really doesn't work anymore (with the NodeFeature API enabled) without a kubeconfig set. This patch fixes crashes happening when NoPublish is enabled, e.g. in listing all nodes in the nfd api handler and in getting single node objects in the node updater pool. This patch changes the kubeconfig parsing to happen at the creation of the nfd-master instance. We don't need to do that at reconfigure time as none of the dynamic config options affect it. Unit tests are adjusted, accordingly.	2024-04-08 14:25:27 +03:00
Markus Lehtonen	fcb8d3cda4	nfd-master: implement opts for modifying NfdMaster instance This provides a more controlled way for setting up the NfdMaster instance for testing.	2024-04-05 20:21:19 +03:00
Kubernetes Prow Robot	199d665046	Merge pull request #1656 from marquiz/devel/channel-simplify Tidy up usage of channels for signaling	2024-04-05 07:51:34 -07:00
Carlos Eduardo Arango Gutierrez	3434557d7c	Move NFD api to a separate go mod Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-04-05 16:35:47 +02:00
Kubernetes Prow Robot	cb24f7c234	Merge pull request #1657 from marquiz/devel/master-label-whitelist nfd-master: prevent crash on empty config struct	2024-04-05 05:36:52 -07:00
Markus Lehtonen	26a80cf142	Tidy up usage of channels for signaling This started as a small effort to simplify the usage of "ready" channel in nfd-master. It extended into a wider simplification/unification of the channel usage.	2024-04-05 14:39:58 +03:00
Markus Lehtonen	b27676451a	nfd-master: prevent crash on empty config struct Change the handling of LabelWhiteList config option to use a pointer to detect when the option is unset. This doesn't fix any detected crash but is merely general improvement and stabilization, serving easier testing. Also, use the regexp type from the core libs for the config struct - dropping the unmasrhalling code for our custom regexp type - as the core regexp now implements unmarshaller itself.	2024-04-05 14:19:44 +03:00
Kubernetes Prow Robot	ad96c301a4	Merge pull request #1642 from marquiz/devel/master-updater-pool-lock nfd-master: protect node updater pool queueing with a lock	2024-04-05 03:31:10 -07:00
Markus Lehtonen	44a5a5b4a8	nfd-master: get node object only once when updating node Prevent excess queries of node objects from the Kubernetes apiserver. This significantly speeds up node updates (and reduces the load on the apiserver) as the client-side throttling (which is good) does not bite us that hard.	2024-04-04 14:44:52 +03:00
Markus Lehtonen	bce446c5b6	nfd-master: protect node updater pool queueing with a lock Prevents races when (re-)starting the queue. There are no reports on issues related to this (and I haven't come up with any actual failure path in the current code) but better to be safe and follow the best practices.	2024-03-27 16:53:34 +02:00
Markus Lehtonen	c4e010eafd	nfd-master: do nfd API scheme registration in an init function Prevents (rare) races on nfd-master reconfigurartion. Previously the scheme was registered at nfd API controller creation/startup time. This caused a race with some lister/informer goroutines of the previous (stoppped) controller still running and accessing (reading) the sceme while we were updating (writing) it.	2024-03-27 15:26:16 +02:00
Markus Lehtonen	e7f87de6df	nfd-master: retry node updates indefinitely Treat node updates like a reconciliation loop. Keep trying on node update as long as it fails. Node update permafailing likely indicates a bug in the nfd code (there should be no reason for it to fail forever) and it's better to clearly see it in the logs/metrics rather than giving up after a few retries.	2024-03-18 18:14:24 +02:00
Kubernetes Prow Robot	4790962123	Merge pull request #1595 from marquiz/devel/master-check-node-existence nfd-master: check if node exists before trying update	2024-03-18 04:19:57 -07:00
Carlos Eduardo Arango Gutierrez	06c4733bc5	Add FeatureGate framework to handle new features Code inspired on https://github.com/kubernetes/component-base/tree/master/featuregate Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-03-15 19:11:32 +01:00
Markus Lehtonen	70fd3757c4	nfd-master: fix memory leak in nfd api-controller Fixes a memory leak that happened when stopping (and then re-starting) the nfd api controller. The stop channel was not used properly which caused the underlying informer to keep on running.	2024-03-14 15:39:10 +02:00
Markus Lehtonen	a7bd22a75b	nfd-master: check if node exists before trying update Make the node-updater-pool worker fail fast (and not retry updates) if a node does not exist.	2024-02-20 11:04:46 +02:00
Markus Lehtonen	044fd4a3fd	nfd-master: log errors on node update retries	2024-02-16 15:51:04 +02:00
Markus Lehtonen	2382c34697	nfd-master: fix node status patching Correctly patch the "status" subresource. This got broken when refactoring the code in `7a050e7cf9` and wasn't even catched by the unit tests as the fake kubernetes client doesn't handle subresources as the real apiserver does.	2024-01-26 22:00:13 +02:00
Markus Lehtonen	7a050e7cf9	nfd-master: ditch apihelper Implement some of frequently used helper functions inpackage. This patch also contains big changes to the nfd-master unit tests. Much of this is about migrating from the mocked apihelper interface to fake kubernetes client that provides a bit more apiserver'ish functionality. At the same time there is quite a bit of renaming in the tests, shortening and unifying naming and getting rid of the extensive usage of "mock" everywhere.	2024-01-26 16:09:22 +02:00
Markus Lehtonen	53003cbf69	pkg/utils: move JsonPatch from pkg/apihelper	2024-01-25 17:23:14 +02:00
Markus Lehtonen	acf815fb10	pkg/utils: move GetKubeconfig from pkg/apihelper here This change is part of an effort to remove the pkg/apihelper package. GetKubeconfig is useful helper functionality shared accross the codebase so move it into a "safe" location.	2024-01-24 16:10:02 +02:00
Markus Lehtonen	57b7a3c6a8	Wrap nested errors	2024-01-22 22:45:15 +02:00
Markus Lehtonen	58ae81804c	go.mod: update dependencies	2024-01-15 21:29:32 +02:00
Markus Lehtonen	a053efda64	nfd-master: run a separate gRPC health server This patch separates the gRPC health server from the deprecated gRPC server (disabled by default, replaced by the NodeFeature CRD API) used for node labeling requests. The new health server runs on hardcoded TCP port number 8082. The main motivation for this change is to make the Kubernetes' built-in gRPC liveness probes to function if TLS is enabled (as they don't support TLS). The health server itself is a naive implementation (as it was before), basically only checking that nfd-master has started and hasn't crashed. The patch adds a TODO note to improve the functionality.	2024-01-04 13:58:26 +02:00
Markus Lehtonen	97bf841140	apis/nfd: split rule processing into a separate package This patch tidies up the nfdv1alpha1 API package by refactoring out the implementation of (NodeFeature)Rule evaluation into a separate package.	2023-12-20 12:52:15 +02:00
Markus Lehtonen	cb0a46ec0e	Use generics for maps and slices	2023-12-13 12:09:53 +02:00
Markus Lehtonen	a77983556f	nfd-master: remove default denied ns from config These are now handled by the validate package. If we have them here in nfd-master, the default namespace (feature.node.kubernetes.io) gets denied.	2023-12-12 16:12:53 +02:00
Carlos Eduardo Arango Gutierrez	affb93ea50	Create a Validate pkg Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-12-11 16:54:22 +01:00
Markus Lehtonen	1d012a28cd	Option to stop implicitly adding default prefix to names Add new autoDefaultNs (default is "true") config option to nfd-master. Setting the config option to false stops NFD from automatically adding the "feature.node.kubernetes.io/" prefix to labels, annotations and extended resources. Taints are not affected as for them no prefix is automatically added. The user-visible part of enabling the option change is that NodeFeatureRules, local feature files, hooks and configuration of the "custom" may need to be altereda (if the auto-prefixing is relied on). For now, the config option defaults to "true", meaning no change in default behavior. However, the intent is to change the default to "false" in a future release, deprecating the option and eventually removing it (forcing it to "false"). The goal of stopping doing "auto-prefixing" is to simplify the operation (of nfd and users). Make the naming more straightforward and easier to understand and debug (kind of WYSIWYG), eliminating peculiar corner cases: 1. Make validation simpler and unambiguous 2. Remove "overloading" of names, i.e. the mapping two values to the same actual name. E.g. previously something like labels: feature.node.kubernetes.io/foo: bar foo: baz Could actually result in node label: feature.node.kubernetes.io/foo: baz 3. Make the processing/usagee of the "rule.matched" and "local.labels" feature in NodeFeatureRules unambiguous and more understadable. E.g. previously you could have node label "feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule you'd need to use the unprefixed name "local-foo" or the fully prefixed name, depending on what was specified in the feature file (or hook) on the node(s). NOTE: setting autoDefaultNs to false is a breaking change for users who rely on automatic prefixing with the default feature.node.kubernetes.io/ namespace. NodeFeatureRules, feature files, hooks and custom rules (configuration of the "custom" source of nfd-worker) will need to be altered. Unprefixed labels, annoations and extended resources will be denied by nfd-master.	2023-11-24 12:48:20 +02:00

1 2 3 4

188 commits