node-feature-discovery

mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00

Author	SHA1	Message	Date
Markus Lehtonen	6b2d10753f	nfd-master: re-try on node update failures Change the NFD API handler to re-try on node update failures. Will work around transient failures, making sure that failed nodes (i.e. nodes that we failed to update) don't need to wait for the 1 hour resync period before being tried again.	2023-04-13 16:30:31 +03:00
Markus Lehtonen	70ac19ea66	nfd-master: increase controller resync period to 1 hour Increase the NFD API controller resync period from 5 minutes to 1 hour. The resync causes nfd-master to replay all NodeFeature and NodeFeatureRule objects, being effectively a "big hammer reset all" button. This should only be needed as an "insurance" to fix labels et al in case they have been manually tampered (outside NFD) and against certain bugs in nfd itself. NFD is not supposed to manage anything fast-changing so 1 hour should be enough. This change only affects behavior when the NodeFeature API has been enabled (with -enable-nodefeature-api).	2023-04-12 16:38:47 +03:00
Kubernetes Prow Robot	ad07829d0a	Merge pull request #1099 from ArangoGutierrez/extended_resources_v2 Create extended resources with NodeFeatureRule	2023-04-07 08:09:15 -07:00
Fabiano Fidêncio	250aea4741	Create extended resources with NodeFeatureRule Add support for management of Extended Resources via the NodeFeatureRule CRD API. There are usage scenarios where users want to advertise features as extended resources instead of labels (or annotations). This patch enables the discovery of extended resources, via annotation and patch of node.status.capacity and node.status.allocatable. By using the NodeFeatureRule API. Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com> Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-07 16:14:56 +02:00
Markus Lehtonen	f64c23968a	nfd-master: fix node update Update node status before node metadata. This fixes a problem where we lose track of NFD-managed extended resources in case patching node status fails. Previously we removed all labels and annotations (including the one listing our ERs) and only after that updated node status. If node status update failed we had lost the annotation but extended resources were still there, leaving them orphaned.	2023-04-06 22:04:35 +03:00
Markus Lehtonen	cc6c20ff5f	nfd-master: disallow unprefixed and kubernetes taints Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/" prefix. This is a precaution to protect the user from messing up with the "official" well-known taints from Kubernetes itself. The only exception is that the "nfd.node.kubernetes.io/" prefix is allowed. However, there is one allowed NFD-specific namespace (and its sub-namespaces) i.e. "feature.node.kubernetes.io" under the kubernetes.io domain that can be used for NFD-managed taints. Also disallow unprefixed taint keys. We don't add a default prefix to unprefixed taints (like we do for labels) from NodeFeatureRules. This is to prevent unpleasant surprises to users that need to manage matching tolerations for their workloads.	2023-04-06 16:12:37 +03:00
Kubernetes Prow Robot	193c552b33	Merge pull request #1084 from AhmedGrati/feat-add-master-config-file feat: add master config file	2023-04-04 10:41:40 -07:00
AhmedGrati	3fff409f6d	Add master config file Similar to the nfd-worker, in this PR we want to support the dynamic run-time configurability through a config file for the nfd-master. We'll use a json or yaml configuration file along with the fsnotify in order to watch for changes in the config file. As a result, we're allowing dynamic control of logging params, allowed namespaces, extended resources, label whitelisting, and denied namespaces. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-03 09:52:09 +01:00
AhmedGrati	d0a6289c0f	chore: add debug dump of nfd worker configuration Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-03-18 00:49:07 +01:00
Kubernetes Prow Robot	13f92faa77	Merge pull request #1031 from k8stopologyawareschedwg/reactive_updates topology-updater: reactive updates	2023-03-17 10:13:17 -07:00
Talor Itzhak	5c6be580f4	reactive updates: add an option to disable the feature Access to the kubelet state directory may raise concerns in some setups, added an option to disable it. The feature is enabled by default. Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-16 11:53:16 +02:00
Kubernetes Prow Robot	a06e44ef0b	Merge pull request #1083 from fmuyassarov/mockery codegen: fix code-generation	2023-03-15 06:46:16 -07:00
Markus Lehtonen	4a8fc811be	pkg/utils: add UnmarshalJSON method to StringSetVal Make it possible to specify values in yaml as an array like conf: - foo - bar Instead of unwieldy map like conf: foo: bar:	2023-03-14 10:53:24 +02:00
Talor Itzhak	8924213d14	topology-updater: make it possible to disable sleep-interval Especially convenient for testing porpuses and completely harmless Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-12 12:43:17 +02:00
Talor Itzhak	1c12876815	topology-updater: log event type that triggered update Specify the event type as part of the log message. In order to reduce the log volume, make it V4 Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-12 12:37:24 +02:00
Talor Itzhak	7b248ecae2	topology-updater: update CRs when notified When a message received via the channel, the main loop updates the `NodeResourceTopology` objects. The notifier will send a message via the channel if: 1. It reached the sleep timeout. 2. It detected a change in Kubelet state files Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-12 12:37:24 +02:00
Talor Itzhak	175e0c81aa	topology-updater: add kubelet-state-dir flag On different Kubernetes flavors like OpenShift for exmaple, the Kubelet state directory path is different. make it configurable for maximum flexability. Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-12 12:37:24 +02:00
Talor Itzhak	0f65b87329	kubeletnotifier: introduce kubeletnotifier package Enabling reactive update for nfd-topology-updater by detecting changes in Kubelet state/checkpoint files, and signaling to the main loop to update the NodeResourceTopology objects. This has high value when scaling is an issue. Having multiple pods deployed in between single update instance might reflect incorrect resource accounting in the NRT CRs. Example: Time Interval = 5s t0 - New update sent to NRT CRs t1 - Schedule guaranteed podA t2 - Schedule guaranteed podB time elapsed between t0-t2 < 5 seconds, IOW the update on t0 is the recent update. In t2 the resource accounting reflected by NRT is not aligned with the actual accounting because NRT CRs doesn't reflect the change happened in t1. With this reactive update feature we expect an update to be trigger between t1 and t2 so the NRT objects will reflect more accurate picture. There still might be a scenario when the updates aren't fast enough, but this is an additional future planned optimization. The notifier has two event types: 1. Time based - keeping the old behavior, trigger an update per interval. 2. FS event - trigger an update when Kubelet state/checkpoint files modified. Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-12 12:37:24 +02:00
Muyassarov, Feruzjon	e3a856b405	update re-generated code with make-generate results Update generated code based on the updated from re-running make generate. Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>	2023-03-11 22:15:11 +02:00
Jose Luis Ojosnegros Manchón	b340d112a8	topology-updater:compute pod set fingerprint Add an option to compute the fingerprint of the current pod set on each node. Report this new fingerprint using an attribute in NRT object.	2023-02-22 10:22:50 +01:00
Jose Luis Ojosnegros Manchón	1a687cb286	topology-updater: Refactor Scan to expand response We are gonna add new data to Scan response so better introduce a new ScanResponse struct as Scan return value to make it easier.	2023-02-22 09:56:28 +01:00
Kubernetes Prow Robot	a92614c292	Merge pull request #1051 from AhmedGrati/feat-add-deny-label-ns-with-wildcard feat: add deny-label-ns flag which supports wildcard	2023-02-15 03:42:25 -08:00
Kubernetes Prow Robot	38cc370e69	Merge pull request #1054 from PiotrProkop/use-new-nrt-api Advertise TopologyManger policy and scope as Attributes in NRT api v1alpha2	2023-02-15 01:12:25 -08:00
AhmedGrati	b499799364	feat: add deny-label-ns flag which supports wildcard Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-02-15 09:47:00 +01:00
PiotrProkop	f76fc5bf6b	Read Kubelet configuration the same way as Kubelet to apply default values Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-02-15 09:27:25 +01:00
Ville Pihlava	b1c6b229fe	Add discovery duration logging.	2023-02-13 12:55:57 +02:00
pprokop	5484babcb1	Advertise TopologyManger policy and scope as Attributes Signed-off-by: pprokop <pprokop@nvidia.com>	2023-02-10 12:03:11 +01:00
Kubernetes Prow Robot	ac271b3c29	Merge pull request #1050 from VillePihlava/interval-fix Change nfd-worker to use Ticker instead of After.	2023-02-09 07:54:22 -08:00
Ville Pihlava	2101cb20e4	Change nfd-worker to use Ticker instead of After.	2023-02-09 17:14:39 +02:00
Jose Luis Ojosnegros Manchón	2967f3307a	nrt-api: move from v1alpha1 to v1alpha2	2023-02-09 12:29:54 +01:00
Carlos Eduardo Arango Gutierrez	9b3171bce2	nfd-master: always start gRPC server Don't register gRPC LabelServer when using the NodeFeature option, only turn the gRPC server on for Health and Readiness probes.	2023-01-16 19:33:15 +01:00
Kubernetes Prow Robot	ea921a8b14	Merge pull request #1024 from PiotrProkop/nrt-garbage-collector Add NRT garbage collector	2023-01-11 01:59:44 -08:00
PiotrProkop	59afae50ba	Add NodeResourceTopology garbage collector NodeResourceTopology(aka NRT) custom resource is used to enable NUMA aware Scheduling in Kubernetes. As of now node-feature-discovery daemons are used to advertise those resources but there is no service responsible for removing obsolete objects(without corresponding Kubernetes node). This patch adds new daemon called nfd-topology-gc which removes old NRTs. Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-01-11 10:15:21 +01:00
PiotrProkop	1bae2867e2	Release `v0.0.13` of NodeResourceTopology API added missing TopologyManagerPolicy. Expose new policies: * RestrictedContainerLevel * RestrictedPodLevel * BestEffortContainerLevel * BestEffortPodLevel Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-01-09 16:02:12 +01:00
Kubernetes Prow Robot	8eb6640754	Merge pull request #1020 from marquiz/devel/worker-refactor worker: move code	2022-12-27 00:45:34 -08:00
Kubernetes Prow Robot	e97b2c1579	Merge pull request #1017 from marquiz/devel/nfd-api-optional-fields apis/nfd: make all fields in NodeFeatureSpec optional	2022-12-27 00:45:28 -08:00
Markus Lehtonen	1026d91d12	worker: move code Simplify code bu dropping the unnecessary base client package.	2022-12-23 11:38:21 +02:00
Markus Lehtonen	0283f68702	topology-updater: move code Move and rename the Go package. It has nothing to do with NFD gRPC client anymore so move it out of the nfd-client package.	2022-12-23 11:37:46 +02:00
Markus Lehtonen	aa97105854	Add common utility function for getting node name	2022-12-23 09:50:15 +02:00
Markus Lehtonen	dfda9bccad	apis/nfd: update auto-generated code	2022-12-22 17:58:20 +02:00
Markus Lehtonen	a4fc15a424	apis/nfd: make all fields in NodeFeatureSpec optional Don't require features to be specified. The creator possibly only wants to create labels or only some types of features. No need to specify empty structs for the unused fields.	2022-12-22 17:53:42 +02:00
Markus Lehtonen	f5ae3fe2c7	Simplify usage of ObjectMeta fields No need to explicitly spell out ObjectMeta as it's embedded in the object types.	2022-12-19 17:40:10 +02:00
Kubernetes Prow Robot	28a5daa338	Merge pull request #999 from marquiz/fixes/nodefeature-missing nfd-master: update node if no NodeFeature objects are present	2022-12-19 00:39:44 -08:00
Markus Lehtonen	4c955ad72c	nfd-master: update node if no NodeFeature objects are present Correctly handle the case where no NodeFeature objects exist for certain node (and NodeFeature API has been enabled with -enable-nodefeature-api). In this case all the labels should be removed.	2022-12-19 10:22:04 +02:00
Markus Lehtonen	b9c09e6674	nfd-master: update all nodes at startup when NodeFeature API enabled We want to always update all nodes at startup. Without this patch we don't get any update event from the controller if no NodeFeature or NodeFeatureRule objects exist in the cluster. Thus all nodes would stay untouched whereas we really want to remove all labels from all nodes in this case.	2022-12-14 21:49:50 +02:00
Kubernetes Prow Robot	d1b314842c	Merge pull request #989 from marquiz/devel/nodefeature-multi-object nfd-master: handle multiple NodeFeature objects	2022-12-14 07:51:34 -08:00
Markus Lehtonen	740e3af681	nfd-master: implement ratelimiter for nfd api updates Implement a naive ratelimiter for node update events originating from the nfd API. We might get a ton of events in short interval. The simplest example is startup when we get a separate Add event for every NodeFeature and NodeFeatureRule object. Without rate limiting we run "update all nodes" separately for each NodeFeatureRule object, plus, we would run "update node X" separately for each NodeFeature object targeting node X. This is a huge amount of wasted work because in principle just running "update all nodes" once should be enough.	2022-12-14 15:45:43 +02:00
Markus Lehtonen	79ed747be8	nfd-master: handle multiple NodeFeature objects Implement handling of multiple NodeFeature objects by merging all objects (targeting a certain node) into one before processing the data. This patch implements MergeInto() methods for all required data types. With support for multiple NodeFeature objects per node, The "nfd api workflow" can be easily demonstrated and tested from the command line. Creating the folloiwing object (assuming node-n exists in the cluster): apiVersion: nfd.k8s-sigs.io/v1alpha1 kind: NodeFeature metadata: labels: nfd.node.kubernetes.io/node-name: node-n name: my-features-for-node-n spec: # Features for NodeFeatureRule matching features: flags: vendor.domain-a: elements: feature-x: {} attributes: vendor.domain-b: elements: feature-y: "foo" feature-z: "123" instances: vendor.domain-c: elements: - attributes: name: "elem-1" vendor: "acme" - attributes: name: "elem-2" vendor: "acme" # Labels to be created labels: vendor-feature.enabled: "true" vendor-setting.value: "100" will create two feature labes: feature.node.kubernetes.io/vendor-feature.enabled: "true" feature.node.kubernetes.io/vendor-setting.value: "100" In addition it will advertise hidden/raw features that can be used for custom rules in NodeFeatureRule objects. Now, creating a NodeFeatureRule object: apiVersion: nfd.k8s-sigs.io/v1alpha1 kind: NodeFeatureRule metadata: name: my-rule spec: rules: - name: "my feature rule" labels: "my-feature": "true" matchFeatures: - feature: vendor.domain-a matchExpressions: feature-x: {op: Exists} - feature: vendor.domain-c matchExpressions: vendor: {op: In, value: ["acme"]} will match the features in the NodeFeature object above and cause one more label to be created: feature.node.kubernetes.io/my-feature: "true"	2022-12-14 15:44:52 +02:00
Markus Lehtonen	9f0806593d	nfd-master: rename -featurerules-controller flag to -crd-controller Deprecate the '-featurerules-controller' command line flag as the name does not describe the functionality anymore: in practice it controls the CRD controller handling both NodeFeature and NodeFeatureRule objects. The patch introduces a duplicate, more generally named, flag '-crd-controller'. A warning is printed in the log if '-featurerules-controller' flag is encountered.	2022-12-14 10:23:45 +02:00
Markus Lehtonen	6ddd87e465	nfd-master: support NodeFeature objects Add initial support for handling NodeFeature objects. With this patch nfd-master watches NodeFeature objects in all namespaces and reacts to changes in any of these. The node which a certain NodeFeature object affects is determined by the "nfd.node.kubernetes.io/node-name" annotation of the object. When a NodeFeature object targeting certain node is changed, nfd-master needs to process all other objects targeting the same node, too, because there may be dependencies between them. Add a new command line flag for selecting between gRPC and NodeFeature CRD API as the source of feature requests. Enabling NodeFeature API disables the gRPC interface. -enable-nodefeature-api enable NodeFeature CRD API for incoming feature requests, will disable the gRPC interface (defaults to false) It is not possible to serve gRPC and watch NodeFeature objects at the same time. This is deliberate to avoid labeling races e.g. by nfd-worker sending gRPC requests but NodeFeature objects in the cluster "overriding" those changes (labels from the gRPC requests will get overridden when NodeFeature objects are processed).	2022-12-14 07:31:28 +02:00

1 2 3 4 5 ...

261 commits