node-feature-discovery

mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-15 17:50:49 +00:00

Author	SHA1	Message	Date
Markus Lehtonen	45164f580a	nfd-gc: use paging when listing CRs List NodeFeature and NodeResourceTopology objects in pages of 200 items. This reduces memory consumption and eliminates timeouts (on the apiserver side) in big clusters of thousands of nodes.	2024-08-02 08:20:17 +03:00
Kubernetes Prow Robot	57f1b79856	Merge pull request #1813 from marquiz/devel/gc-metalister nfd-gc: only fetch object metadata	2024-08-01 12:53:33 -07:00
Markus Lehtonen	54befffa94	nfd-gc: only fetch object metadata Significantly reduce the apiserver and network load by only listing/getting the object metadata.	2024-07-30 16:01:04 +03:00
Kubernetes Prow Robot	2d24a4bee4	Merge pull request #1811 from marquiz/devel/informer-listopts nfd-master: tweak list options for NodeFeature informer	2024-07-30 03:56:04 -07:00
Markus Lehtonen	454d443b72	nfd-gc: check that node informer cache sync succeeded	2024-07-26 10:29:15 +03:00
Markus Lehtonen	a2068f7ce3	nfd-master: tweak list options for NodeFeature informer Fix cache syncing problems on big clusters with thousands of NodeFeature objects. On the initial list (sync) the client-go cache reflector sets the ResourceVersion to "0" (instead of leaving it empty). This causes problems in the api server with (apiserver) logs like: E writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout E status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout On the nfd-master side we see corresponding log snippets like: W reflector.go:547] failed to list v1alpha1.NodeFeature: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer I trace.go:236] "Reflector ListAndWatch" name: () (total time: 61126ms): ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1521; INTERNAL_ERROR; received from peer 61126ms (**) Decreasing the page size (opts.Limits) does not have any effect on the timeouts. However, setting ResourceVersion to an empty value seems to get the paging on its tracks, eliminating the timeouts. TODO: investigate in Kubernetes upstream the root cause of the timeouts with ResourceVersion="0".	2024-07-25 16:29:05 +03:00
Markus Lehtonen	ea3243fb00	nfd-master: check nfd api informer cache sync result Bail out if there were errors in syncing the cache of any resource.	2024-07-25 09:58:40 +03:00
Markus Lehtonen	25e827a4c8	feature-gates: mark NodeFeatureAPI as GA The feature gate is locked to true. That is, it is not possible to revert back to the gPRC-based communication which makes the gRPC API ready for removal.	2024-07-16 13:53:31 +03:00
Markus Lehtonen	522b87e325	nfd-worker: change TestRun to use NodeFeature API Run nfd-worker with NodeFeature API enabled (against a fake apiserver) instead of using the deprecated gRPC (against a nfd-master instance). Expand the test to verify the features and labels that are advertised as a NodeFeature object.	2024-07-12 09:50:09 +03:00
Markus Lehtonen	a269bf4d25	Drop the -enable-nodefeature-api flag Was marked to be removed in v0.17.	2024-07-10 15:20:07 +03:00
Kubernetes Prow Robot	393af96a88	Merge pull request #1755 from ArangoGutierrez/1752 Use worker DS OwnerReference for NF's	2024-07-09 06:33:07 -07:00
Carlos Eduardo Arango Gutierrez	e33e68ad5b	Add optionable arguments to NewWorker Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-07-09 15:08:26 +02:00
Kubernetes Prow Robot	3bb7a1caff	Merge pull request #1766 from marquiz/devel/simplify Simplify code	2024-07-09 00:19:28 -07:00
Markus Lehtonen	b5b701fbbf	Simplify code Drop unnecessry typedefs.	2024-07-09 09:05:33 +03:00
Markus Lehtonen	ecb37c01b0	nfd-master: fix typos	2024-07-09 08:55:33 +03:00
Carlos Eduardo Arango Gutierrez	5d3ee1c51f	Use worker DS OwnerReference for NF's Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-07-04 13:53:24 +02:00
IbirbyZh	2f9801b554	Fix the problem with starting the master with empty cache We faced the problem when master deleted some of labels on start. Sometimes he doesn't gets NodeFeatures when they are present in cluster because of empty cache in informer	2024-06-10 18:06:14 +02:00
Kubernetes Prow Robot	814255b7f1	Merge pull request #1671 from marquiz/devel/nfd-api-multi-type-feature apis/nfd: allow different types of features of the same name	2024-05-24 03:42:30 -07:00
Markus Lehtonen	3b448ae623	apis/nfd: allow different types of features of the same name This patch changes the handling of NodeFeatureRules so that one feature name (say "cpu.cpuid") can hold different types of features (flags, attributes and/or instances). Requiring features to choose one single type has not been a limitation of the API itself (and there has been no validation on this) but an implementation decision. The new evalutation logic of match expressions is such that "flags" and "attributes" are basically evaluated as an union - they are both maps but "flags" just don't have any value associated with the key. However, "instances" are handled separately as that is basically an array of maps and needs to be evaluated in a different way (loop over the array of instances and evaluate expressions against the attributes of each). Because of this difference care must be taken if mixing "instance" features with "flag" and/or "attribute" features. Note that the API types or their validation is not changed - just the implementation of how the NodeFeatureRules are evaluated.	2024-05-24 13:18:31 +03:00
Kubernetes Prow Robot	5fe3433e35	Merge pull request #1713 from marquiz/devel/worker-log-fix nfd-worker: improved log when creating NodeFeature object	2024-05-24 02:15:30 -07:00
Carlos Eduardo Arango Gutierrez	47c054e1db	Add NodeFeatureGroup CRD The NodeFeatureGroup is an NFD-specific custom resource that is designed for grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup objects in the cluster and updates the status of the NodeFeatureGroup object with the list of nodes that match the feature group rules. The NodeFeatureGroup rules follow the same syntax as the NodeFeatureRule rules. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-05-23 16:34:08 +02:00
Markus Lehtonen	649036977e	nfd-worker: improved log when creating NodeFeature object Don't log an empty NodeFeature object.	2024-05-23 14:37:26 +03:00
Markus Lehtonen	560bd11d85	Re-add -enable-nodefeature-api cmdline flag Bring back the -enable-nodefeature-api command line flag and the corresponding enableNodeFeatureApi helm config value that were removed without deprecation when the NodeFeatureAPI feature gate was introduced. The thinking behind this change is to not break existing users (without warning) unless totally unavoidable. Now the -enable-nodefeature-api flag is marked as deprecated and slated for removal in NFD v0.17. The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag work together so that the NodeFeature API is disabled (gRPC is enabled, instead) if either of them is set to false. This patch selectively reverts parts of `06c4733bc5`.	2024-05-16 10:53:49 +03:00
Markus Lehtonen	121345472d	nfd-master: add DisableAutoPrefix feature gate Now that we have support for feature gates deprecate the autoDefaultNs config option of nfd-master and replace it with a new alpha feature gate DisableAutoPrefix (defaults to false). Using a feature gate to handle and communicate these kind of changes, where the default behavior is intended to be changed in a future release, feels much more natural than using random flags/options. The combined logic of the feature gate and the config option is a logical OR over disabling auto-prefixing. That is, auto-prefixing is disabled if either the feature gate or the config options is used set to disable it: \| DisableAutoPrefix (feature gate) \| false \| true -------------------- \| -------------------------------- autoDefaultNs true \| ON \| OFF (config opt) false \| OFF \| OFF	2024-05-15 17:01:16 +03:00
Markus Lehtonen	eb7a0ada5c	nfd-master: import features package as nfdfeatures Refactoring to prevent naming clash in future changes.	2024-05-15 11:27:46 +03:00
Markus Lehtonen	bad7d1fcb1	apis/nfd: increase unit test coverage Cover error cases of the "match name" functions.	2024-04-30 16:02:13 +03:00
Kubernetes Prow Robot	ca13b4903d	Merge pull request #1669 from marquiz/devel/nfd-api-helpers-refactor api/nfd: use varargs in the NewInstanceFeatures helper	2024-04-23 06:33:46 -07:00
Kubernetes Prow Robot	be44447dc0	Merge pull request #1670 from marquiz/devel/flag-feature-errors apis/nfd: no error on ops that never match	2024-04-23 06:09:06 -07:00
Kubernetes Prow Robot	828acaa8cc	Merge pull request #1667 from marquiz/devel/api-tests apis/nfd: add unit tests for match name functions	2024-04-23 06:08:48 -07:00
Markus Lehtonen	fbb7303562	apis/nfd: no error on ops that never match Return false (i.e. "did not match") but no error when evaluating a match expression against a "flag" type feature (which don't have any associated value, just the name) if a MatchOp that never matches is used. This is preparation for supporting multi-type features, i.e. one feature, like "cpu.cpuid", having e.g. "flag" and "attribute" type features.	2024-04-23 11:07:49 +03:00
Markus Lehtonen	719c5186f6	api/nfd: use varargs in the NewInstanceFeatures helper Make usage of this helper function more flexible.	2024-04-23 10:29:24 +03:00
TessaIO	de50ac8800	chore/nfd-master: remove warnings in nfd-master unit tests file Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>	2024-04-22 22:27:15 +02:00
Markus Lehtonen	6b1e9c7860	apis/nfd: add unit tests for match name functions	2024-04-22 17:20:33 +03:00
Kubernetes Prow Robot	a7c58b121c	Merge pull request #1660 from marquiz/devel/master-reconfigure-change nfd-master: stop node-updater pool before reconfiguring api-controller	2024-04-15 11:10:32 -07:00
Kubernetes Prow Robot	91d3d5a7b0	Merge pull request #1653 from marquiz/devel/master-multiple-k8sclients nfd-master: use separate k8s api clients for each updater	2024-04-15 09:18:51 -07:00
Markus Lehtonen	8ad6210d5c	nfd-master: use separate k8s api clients for each updater Sharing the same client between updater threads virtually serializes access, in practice making the effective parallelism close to 1. With this patch, in my bench cluster of 300 nodes, the time taken by updating all nodes drops from ~2 minutes to ~12 seconds (with the default parallelism of 10 node updater threads). This demonstrates the 10-fold increased parallelism from ~1 to 10. There might be other solutions that could be explored, e.g. caching nodes with an indexer/lister but otoh nfd doesn't necessarily need/want to watch every little change in each node. We only need to get the node when something in our own CRDs change (we don't react to any changes in the node object itself). Using multiple clients was the most obvious choice to solve the problem for now.	2024-04-15 19:00:30 +03:00
Kubernetes Prow Robot	624c02e1e2	Merge pull request #1633 from marquiz/devel/validate-tests apis/nfd/validate: loosen validation of feature annotations	2024-04-11 07:26:17 -07:00
Markus Lehtonen	7fbada8b86	apis/nfd/validate: more comprehensive unit tests Also add license header to the test.go file and fix one bug in MatchFeature validation.	2024-04-11 14:57:31 +03:00
Carlos Eduardo Arango Gutierrez	50d9874e72	Fix update_codegen Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-04-09 18:28:04 +02:00
Markus Lehtonen	a9167e6875	apis/nfd/validate: loosen validation of feature annotations Don't require that the annotation value must conform to the (strict) requirements of label values. In the Kubernetes API annotation values do not have other restrictions than that the total size (keys and values) of _all_ annotations combined of an object must not exceed 256kB. This patch sets a maximum size limit of 1kB for the value of a single feature annotation created by NFD. This limit is rather arbitrary but should be enough for the NFD usage scenarios (until proven wrong).	2024-04-09 13:30:22 +03:00
Kubernetes Prow Robot	6b80f654d4	Merge pull request #1600 from ArangoGutierrez/e2e-not-k8s Move NFD api to a separate go mod	2024-04-09 02:06:06 -07:00
Markus Lehtonen	ba4cebb29e	nfd-master: stop node-updater pool before reconfiguring api-controller Prevents potential race between node-updater pool and the api-controller when re-configuring nfd-master. Reconfiguration causes a new api-controller instance to be created so nfd api lister might change in the midst of processing a node update (if the pool was running). No actual issues related to this have been identified but races (like this) should still be avoided.	2024-04-09 10:45:07 +03:00
Markus Lehtonen	8709cccf71	nfd-master: parse kubeconfig even with NoPublish set Don't try to be too smart when kubeconfig is needed. In practice, the nfd-master really doesn't work anymore (with the NodeFeature API enabled) without a kubeconfig set. This patch fixes crashes happening when NoPublish is enabled, e.g. in listing all nodes in the nfd api handler and in getting single node objects in the node updater pool. This patch changes the kubeconfig parsing to happen at the creation of the nfd-master instance. We don't need to do that at reconfigure time as none of the dynamic config options affect it. Unit tests are adjusted, accordingly.	2024-04-08 14:25:27 +03:00
Markus Lehtonen	fcb8d3cda4	nfd-master: implement opts for modifying NfdMaster instance This provides a more controlled way for setting up the NfdMaster instance for testing.	2024-04-05 20:21:19 +03:00
Kubernetes Prow Robot	199d665046	Merge pull request #1656 from marquiz/devel/channel-simplify Tidy up usage of channels for signaling	2024-04-05 07:51:34 -07:00
Carlos Eduardo Arango Gutierrez	3434557d7c	Move NFD api to a separate go mod Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2024-04-05 16:35:47 +02:00
Kubernetes Prow Robot	cb24f7c234	Merge pull request #1657 from marquiz/devel/master-label-whitelist nfd-master: prevent crash on empty config struct	2024-04-05 05:36:52 -07:00
Markus Lehtonen	26a80cf142	Tidy up usage of channels for signaling This started as a small effort to simplify the usage of "ready" channel in nfd-master. It extended into a wider simplification/unification of the channel usage.	2024-04-05 14:39:58 +03:00
Markus Lehtonen	b27676451a	nfd-master: prevent crash on empty config struct Change the handling of LabelWhiteList config option to use a pointer to detect when the option is unset. This doesn't fix any detected crash but is merely general improvement and stabilization, serving easier testing. Also, use the regexp type from the core libs for the config struct - dropping the unmasrhalling code for our custom regexp type - as the core regexp now implements unmarshaller itself.	2024-04-05 14:19:44 +03:00
Kubernetes Prow Robot	ad96c301a4	Merge pull request #1642 from marquiz/devel/master-updater-pool-lock nfd-master: protect node updater pool queueing with a lock	2024-04-05 03:31:10 -07:00

1 2 3 4 5 ...

474 commits