node-feature-discovery

mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00

Author	SHA1	Message	Date
Markus Lehtonen	d8f167d8a9	nfd-master: remove one stale empty line	2023-08-01 16:38:32 +03:00
Markus Lehtonen	47f621d970	metrics: improve the node updates gauge Rename the metric, better describe what we're measuring and better comply with prometheus naming conventions. Also change it to represent actual updates of the node object on the Kubernetes apiserver.	2023-07-31 19:45:22 +03:00
Markus Lehtonen	945e7fcb3f	metrics: improve nfr processing time metric Change the metric from a simple gauge (that basically was a single value for the whole cluster) into a HistogramVec, aligning with the feature discovery duration metric in nfd-worker. This improved metric now has prometheus labels for the NFR name and node name, i.e. it is tracking per-NFR metric for each node being processed. Also, change the naming to better comply with prometheus suggested conventions.	2023-07-31 19:45:22 +03:00
Kubernetes Prow Robot	77d869c4f7	Merge pull request #1242 from ArangoGutierrez/metrics Enable metrics via prometheus operator	2023-07-21 02:26:08 -07:00
Carlos Eduardo Arango Gutierrez	e3aedd33e2	Enable metrics via prometheus operator Expose metrics via prometheus.monitoring.coreos.com/v1 The exposed metrics are \| Metric \| Type \| Meaning \| \| --------------- \| ---------------- \| ---------------- \| \| `nfd_master_build_info` \| Gauge \| Version from which nfd-master was built. \| \| `nfd_worker_build_info` \| Gauge \| Version from which nfd-worker was built. \| \| `nfd_updated_nodes` \| Counter \| Time taken to label a node \| \| `nfd_crd_processing_time` \| Gauge \| Time taken to process a NodeFeatureRule CRD \| \| `nfd_feature_discovery_duration_seconds` \| HistogramVec \| Time taken to discover features on a node \| Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>	2023-07-21 10:59:52 +02:00
Markus Lehtonen	dac45be28c	nfd-master: check for nil references in nfdAPIUpdateAllNodes Just a safeguard.	2023-07-17 17:49:44 +03:00
Kubernetes Prow Robot	306969a945	Merge pull request #1133 from AhmedGrati/feat-parallelize-nodes-update feat: parallelize nodes update	2023-06-02 05:28:57 -07:00
AhmedGrati	b3cfe17392	feat: parallelize nodes update This PR aims to optimize the process of updating nodes with corresponding features. In fact, previously, we were updating nodes sequentially even though they are independent from each other. Therefore, we integrated new components: LabelersNodePool which is responsible for spininng a goroutine whenever there's a request for updating nodes, and a Workqueue which is responsible for holding nodes names that should be updated. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-06-02 11:41:50 +01:00
AhmedGrati	08b9c3486e	feat: support dynamic values for labels in the NodeFeatureRule This PR aims to support the dynamic values for labels in the NodeFeatureRule CRD, it would offer more flexible labeling for users. To achieve this, we check whether label value starts with "@", and if it's the case, we will get the value of the feature value, and update the value of the label with the feature value. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-05-31 23:30:26 +01:00
Markus Lehtonen	bf670de68d	pkg/utils: migrate KlogDump to structured logging Drop the KlogDump helper in favor of klog.InfoS. However, that patch introduces a new DelayedDumper() helper to avoid processing (marshalling) of object unless really evaluated by the logging function.	2023-05-31 14:43:08 +03:00
Markus Lehtonen	8113d651c2	nfd-master: migrate to structured logging	2023-05-31 14:43:05 +03:00
Markus Lehtonen	2a3c7e4c93	nfd-master: add validation of label names and values Validate labels before trying to update the node. Makes us fail early nad prevent useless retries in case invalid labels are tried.	2023-05-29 16:54:14 +03:00
Markus Lehtonen	1809c24314	nfd-master: use close for stop channel Simpler and more reliable (in case of multiple consumers) to just close the channel.	2023-05-24 16:51:48 +03:00
PiotrProkop	272fd4784f	Add new flag enable-leader-election for nfd-master. It allows NFD-master to be run in active-passive way when running multiple instances of NFD-master to prevent multiple components from updating same custom resources. Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-05-15 13:30:07 +02:00
Kubernetes Prow Robot	85073525c3	Merge pull request #1185 from AhmedGrati/fix-resync-period-functionality nfd-master: fix resync period config option	2023-05-02 11:14:16 -07:00
AhmedGrati	87c2d7e184	nfd-master: fix resync period config option This PR fixes the resync-period configuration option of the nfd-master. In fact, previously, changes were not reflected in the nfd-master at runtime. e2e tests are also implemented to make sure that the fix is already working as expected. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-05-02 13:17:01 +02:00
Markus Lehtonen	fb20388028	nfd-master: refactor filtering of taints	2023-04-28 18:13:54 +03:00
Markus Lehtonen	43ced0c1a1	nfd-master: refactor filtering of feature labels More consistent error messages. Also preparation for dynamic labels values (that '@' notation currently supported for extended resources).	2023-04-28 18:13:54 +03:00
Markus Lehtonen	6ca687fbef	nfd-master: refactor filtering of extended resources Simplify code a bit and get more consistent error messages (in addition to fixing some of those).	2023-04-28 18:13:54 +03:00
Markus Lehtonen	77011a775f	nfd-master: log node name when processing NodeFeatureRules	2023-04-26 07:22:30 +03:00
Kubernetes Prow Robot	2356223ffc	Merge pull request #1139 from AhmedGrati/feat-configure-master-resync feat: add master resync period configurability	2023-04-24 03:49:02 -07:00
AhmedGrati	7917434d38	feat: add master resync period configurability This PR adds a config option for setting the NFD API controller resync period. The resync period is only activated when the NodeFeature API has been enabled (with -enable-nodefeature-api). Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-24 11:52:38 +02:00
Kubernetes Prow Robot	64fe26ed37	Merge pull request #1169 from ArangoGutierrez/i1168 nfd-master: reject malformed extended resource dynamic capacity assignment	2023-04-24 00:17:15 -07:00
Carlos Eduardo Arango Gutierrez	f5df7b658c	nfd-master: reject malformed extended resource dynamic capacity assignment Reject malformed extended resource dynamic capacity assignment capacity should be in the form of domain.feature.element, add logic at func filterExtendedResources to check if true or ignore ExtendedResource, logging as an error. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-22 08:43:50 +02:00
Kubernetes Prow Robot	d5bccda7c5	Merge pull request #1171 from ArangoGutierrez/foundon_typo pkg/nfd-master/nfd-master.go: Fix typo	2023-04-21 12:21:11 -07:00
Kubernetes Prow Robot	c2c1e18908	Merge pull request #1173 from marquiz/devel/fix-master nfd-master: fix a crash when processing NodeFeatureRules	2023-04-21 09:49:11 -07:00
Markus Lehtonen	9523f1e411	nfd-master: fix a crash when processing NodeFeatureRules Fix a a bug where nfd-master with NodeFeature API enabled would crash when NodeFeatureRule objects were processed in the case where no NodeFeature objects existed. This was caused by trying to insert values into a non-initialized NodeFeatureSpec in the code. This patch adds two safety measures to prevent that from happening in the future. First, add a constructor function for the NodeFeatureSpec type, and second, check for uninitialized object in the function inserting new functions. TODO: add unit tests for the API helper functions.	2023-04-21 19:24:08 +03:00
Carlos Eduardo Arango Gutierrez	ae22031547	pkg/nfd-master/nfd-master.go: Fix typo Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-21 16:17:11 +02:00
Markus Lehtonen	37306662fe	nfd-master: don't create emtpy annotations Make the nfd.node.kubernetes.io/feature-labels and nfd.node.kubernetes.io/extended-resources annotations behave similary to the taints annotation: only create the annotations if some labels or extended resources are created.	2023-04-21 16:14:17 +03:00
Markus Lehtonen	f0f6bbcf36	nfd-master: configure before prune Otherwise prune will crash because of uninitialized configuration.	2023-04-20 20:38:11 +03:00
Markus Lehtonen	32db081f3a	nfd-master: support noPublish with -prune Better this way than to crash which is what currently happens with this combination.	2023-04-19 15:58:06 +03:00
Markus Lehtonen	6b2d10753f	nfd-master: re-try on node update failures Change the NFD API handler to re-try on node update failures. Will work around transient failures, making sure that failed nodes (i.e. nodes that we failed to update) don't need to wait for the 1 hour resync period before being tried again.	2023-04-13 16:30:31 +03:00
Kubernetes Prow Robot	ad07829d0a	Merge pull request #1099 from ArangoGutierrez/extended_resources_v2 Create extended resources with NodeFeatureRule	2023-04-07 08:09:15 -07:00
Fabiano Fidêncio	250aea4741	Create extended resources with NodeFeatureRule Add support for management of Extended Resources via the NodeFeatureRule CRD API. There are usage scenarios where users want to advertise features as extended resources instead of labels (or annotations). This patch enables the discovery of extended resources, via annotation and patch of node.status.capacity and node.status.allocatable. By using the NodeFeatureRule API. Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com> Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-07 16:14:56 +02:00
Markus Lehtonen	f64c23968a	nfd-master: fix node update Update node status before node metadata. This fixes a problem where we lose track of NFD-managed extended resources in case patching node status fails. Previously we removed all labels and annotations (including the one listing our ERs) and only after that updated node status. If node status update failed we had lost the annotation but extended resources were still there, leaving them orphaned.	2023-04-06 22:04:35 +03:00
Markus Lehtonen	cc6c20ff5f	nfd-master: disallow unprefixed and kubernetes taints Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/" prefix. This is a precaution to protect the user from messing up with the "official" well-known taints from Kubernetes itself. The only exception is that the "nfd.node.kubernetes.io/" prefix is allowed. However, there is one allowed NFD-specific namespace (and its sub-namespaces) i.e. "feature.node.kubernetes.io" under the kubernetes.io domain that can be used for NFD-managed taints. Also disallow unprefixed taint keys. We don't add a default prefix to unprefixed taints (like we do for labels) from NodeFeatureRules. This is to prevent unpleasant surprises to users that need to manage matching tolerations for their workloads.	2023-04-06 16:12:37 +03:00
AhmedGrati	3fff409f6d	Add master config file Similar to the nfd-worker, in this PR we want to support the dynamic run-time configurability through a config file for the nfd-master. We'll use a json or yaml configuration file along with the fsnotify in order to watch for changes in the config file. As a result, we're allowing dynamic control of logging params, allowed namespaces, extended resources, label whitelisting, and denied namespaces. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-03 09:52:09 +01:00
AhmedGrati	b499799364	feat: add deny-label-ns flag which supports wildcard Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-02-15 09:47:00 +01:00
Carlos Eduardo Arango Gutierrez	9b3171bce2	nfd-master: always start gRPC server Don't register gRPC LabelServer when using the NodeFeature option, only turn the gRPC server on for Health and Readiness probes.	2023-01-16 19:33:15 +01:00
Markus Lehtonen	aa97105854	Add common utility function for getting node name	2022-12-23 09:50:15 +02:00
Markus Lehtonen	f5ae3fe2c7	Simplify usage of ObjectMeta fields No need to explicitly spell out ObjectMeta as it's embedded in the object types.	2022-12-19 17:40:10 +02:00
Kubernetes Prow Robot	28a5daa338	Merge pull request #999 from marquiz/fixes/nodefeature-missing nfd-master: update node if no NodeFeature objects are present	2022-12-19 00:39:44 -08:00
Markus Lehtonen	4c955ad72c	nfd-master: update node if no NodeFeature objects are present Correctly handle the case where no NodeFeature objects exist for certain node (and NodeFeature API has been enabled with -enable-nodefeature-api). In this case all the labels should be removed.	2022-12-19 10:22:04 +02:00
Markus Lehtonen	b9c09e6674	nfd-master: update all nodes at startup when NodeFeature API enabled We want to always update all nodes at startup. Without this patch we don't get any update event from the controller if no NodeFeature or NodeFeatureRule objects exist in the cluster. Thus all nodes would stay untouched whereas we really want to remove all labels from all nodes in this case.	2022-12-14 21:49:50 +02:00
Kubernetes Prow Robot	d1b314842c	Merge pull request #989 from marquiz/devel/nodefeature-multi-object nfd-master: handle multiple NodeFeature objects	2022-12-14 07:51:34 -08:00
Markus Lehtonen	740e3af681	nfd-master: implement ratelimiter for nfd api updates Implement a naive ratelimiter for node update events originating from the nfd API. We might get a ton of events in short interval. The simplest example is startup when we get a separate Add event for every NodeFeature and NodeFeatureRule object. Without rate limiting we run "update all nodes" separately for each NodeFeatureRule object, plus, we would run "update node X" separately for each NodeFeature object targeting node X. This is a huge amount of wasted work because in principle just running "update all nodes" once should be enough.	2022-12-14 15:45:43 +02:00
Markus Lehtonen	79ed747be8	nfd-master: handle multiple NodeFeature objects Implement handling of multiple NodeFeature objects by merging all objects (targeting a certain node) into one before processing the data. This patch implements MergeInto() methods for all required data types. With support for multiple NodeFeature objects per node, The "nfd api workflow" can be easily demonstrated and tested from the command line. Creating the folloiwing object (assuming node-n exists in the cluster): apiVersion: nfd.k8s-sigs.io/v1alpha1 kind: NodeFeature metadata: labels: nfd.node.kubernetes.io/node-name: node-n name: my-features-for-node-n spec: # Features for NodeFeatureRule matching features: flags: vendor.domain-a: elements: feature-x: {} attributes: vendor.domain-b: elements: feature-y: "foo" feature-z: "123" instances: vendor.domain-c: elements: - attributes: name: "elem-1" vendor: "acme" - attributes: name: "elem-2" vendor: "acme" # Labels to be created labels: vendor-feature.enabled: "true" vendor-setting.value: "100" will create two feature labes: feature.node.kubernetes.io/vendor-feature.enabled: "true" feature.node.kubernetes.io/vendor-setting.value: "100" In addition it will advertise hidden/raw features that can be used for custom rules in NodeFeatureRule objects. Now, creating a NodeFeatureRule object: apiVersion: nfd.k8s-sigs.io/v1alpha1 kind: NodeFeatureRule metadata: name: my-rule spec: rules: - name: "my feature rule" labels: "my-feature": "true" matchFeatures: - feature: vendor.domain-a matchExpressions: feature-x: {op: Exists} - feature: vendor.domain-c matchExpressions: vendor: {op: In, value: ["acme"]} will match the features in the NodeFeature object above and cause one more label to be created: feature.node.kubernetes.io/my-feature: "true"	2022-12-14 15:44:52 +02:00
Markus Lehtonen	9f0806593d	nfd-master: rename -featurerules-controller flag to -crd-controller Deprecate the '-featurerules-controller' command line flag as the name does not describe the functionality anymore: in practice it controls the CRD controller handling both NodeFeature and NodeFeatureRule objects. The patch introduces a duplicate, more generally named, flag '-crd-controller'. A warning is printed in the log if '-featurerules-controller' flag is encountered.	2022-12-14 10:23:45 +02:00
Markus Lehtonen	6ddd87e465	nfd-master: support NodeFeature objects Add initial support for handling NodeFeature objects. With this patch nfd-master watches NodeFeature objects in all namespaces and reacts to changes in any of these. The node which a certain NodeFeature object affects is determined by the "nfd.node.kubernetes.io/node-name" annotation of the object. When a NodeFeature object targeting certain node is changed, nfd-master needs to process all other objects targeting the same node, too, because there may be dependencies between them. Add a new command line flag for selecting between gRPC and NodeFeature CRD API as the source of feature requests. Enabling NodeFeature API disables the gRPC interface. -enable-nodefeature-api enable NodeFeature CRD API for incoming feature requests, will disable the gRPC interface (defaults to false) It is not possible to serve gRPC and watch NodeFeature objects at the same time. This is deliberate to avoid labeling races e.g. by nfd-worker sending gRPC requests but NodeFeature objects in the cluster "overriding" those changes (labels from the gRPC requests will get overridden when NodeFeature objects are processed).	2022-12-14 07:31:28 +02:00
Markus Lehtonen	079655b42c	nfd-master: add error checking for CRD controller creation	2022-12-14 00:27:27 +02:00

1 2 3

112 commits