1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

261 commits

Author SHA1 Message Date
Markus Lehtonen
6b2d10753f nfd-master: re-try on node update failures
Change the NFD API handler to re-try on node update failures. Will work
around transient failures, making sure that failed nodes (i.e. nodes
that we failed to update) don't need to wait for the 1 hour resync
period before being tried again.
2023-04-13 16:30:31 +03:00
Markus Lehtonen
70ac19ea66 nfd-master: increase controller resync period to 1 hour
Increase the NFD API controller resync period from 5 minutes to 1 hour.
The resync causes nfd-master to replay all NodeFeature and
NodeFeatureRule objects, being effectively a "big hammer reset all"
button. This should only be needed as an "insurance" to fix labels et al
in case they have been manually tampered (outside NFD) and against
certain bugs in nfd itself. NFD is not supposed to manage anything
fast-changing so 1 hour should be enough.

This change only affects behavior when the NodeFeature API has been
enabled (with -enable-nodefeature-api).
2023-04-12 16:38:47 +03:00
Kubernetes Prow Robot
ad07829d0a
Merge pull request #1099 from ArangoGutierrez/extended_resources_v2
Create extended resources with NodeFeatureRule
2023-04-07 08:09:15 -07:00
Fabiano Fidêncio
250aea4741
Create extended resources with NodeFeatureRule
Add support for management of Extended Resources via the
NodeFeatureRule CRD API.

There are usage scenarios where users want to advertise features
as extended resources instead of labels (or annotations).

This patch enables the discovery of extended resources, via annotation
and patch of node.status.capacity and node.status.allocatable. By using
the NodeFeatureRule API.

Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-07 16:14:56 +02:00
Markus Lehtonen
f64c23968a nfd-master: fix node update
Update node status before node metadata. This fixes a problem where we
lose track of NFD-managed extended resources in case patching node
status fails. Previously we removed all labels and annotations
(including the one listing our ERs) and only after that updated node
status. If node status update failed we had lost the annotation but
extended resources were still there, leaving them orphaned.
2023-04-06 22:04:35 +03:00
Markus Lehtonen
cc6c20ff5f nfd-master: disallow unprefixed and kubernetes taints
Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/"
prefix. This is a precaution to protect the user from messing up with
the "official" well-known taints from Kubernetes itself. The only
exception is that the "nfd.node.kubernetes.io/" prefix is allowed.

However, there is one allowed NFD-specific namespace (and its
sub-namespaces) i.e. "feature.node.kubernetes.io" under the
kubernetes.io domain that can be used for NFD-managed taints.

Also disallow unprefixed taint keys. We don't add a default prefix to
unprefixed taints (like we do for labels) from NodeFeatureRules. This is
to prevent unpleasant surprises to users that need to manage matching
tolerations for their workloads.
2023-04-06 16:12:37 +03:00
Kubernetes Prow Robot
193c552b33
Merge pull request #1084 from AhmedGrati/feat-add-master-config-file
feat: add master config file
2023-04-04 10:41:40 -07:00
AhmedGrati
3fff409f6d Add master config file
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.

We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-03 09:52:09 +01:00
AhmedGrati
d0a6289c0f chore: add debug dump of nfd worker configuration
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-03-18 00:49:07 +01:00
Kubernetes Prow Robot
13f92faa77
Merge pull request #1031 from k8stopologyawareschedwg/reactive_updates
topology-updater: reactive updates
2023-03-17 10:13:17 -07:00
Talor Itzhak
5c6be580f4 reactive updates: add an option to disable the feature
Access to the kubelet state directory may raise concerns in some setups, added an option to disable it.
The feature is enabled by default.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-16 11:53:16 +02:00
Kubernetes Prow Robot
a06e44ef0b
Merge pull request #1083 from fmuyassarov/mockery
codegen: fix code-generation
2023-03-15 06:46:16 -07:00
Markus Lehtonen
4a8fc811be pkg/utils: add UnmarshalJSON method to StringSetVal
Make it possible to specify values in yaml as an array like

  conf:
    - foo
    - bar

Instead of unwieldy map like

  conf:
    foo:
    bar:
2023-03-14 10:53:24 +02:00
Talor Itzhak
8924213d14 topology-updater: make it possible to disable sleep-interval
Especially convenient for testing porpuses and
completely harmless

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:43:17 +02:00
Talor Itzhak
1c12876815 topology-updater: log event type that triggered update
Specify the event type as part of the log message.
In order to reduce the log volume, make it V4

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
7b248ecae2 topology-updater: update CRs when notified
When a message received via the channel,
the main loop updates the `NodeResourceTopology` objects.

The notifier will send a message via the channel if:
1. It reached the sleep timeout.
2. It detected a change in Kubelet state files

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
175e0c81aa topology-updater: add kubelet-state-dir flag
On different Kubernetes flavors like OpenShift for exmaple,
the Kubelet state directory path is different. make it configurable
for maximum flexability.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
0f65b87329 kubeletnotifier: introduce kubeletnotifier package
Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.

This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.

In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.

With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.

There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.

The notifier has two event types:
1. Time based - keeping the old behavior, trigger
an update per interval.
2. FS event - trigger an update when Kubelet state/checkpoint files modified.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Muyassarov, Feruzjon
e3a856b405 update re-generated code with make-generate results
Update generated code based on the updated from re-running make
generate.

Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
2023-03-11 22:15:11 +02:00
Jose Luis Ojosnegros Manchón
b340d112a8 topology-updater:compute pod set fingerprint
Add an option to compute the fingerprint of the current pod set on each
node.

Report this new fingerprint using an attribute in NRT object.
2023-02-22 10:22:50 +01:00
Jose Luis Ojosnegros Manchón
1a687cb286 topology-updater: Refactor Scan to expand response
We are gonna add new data to Scan response so better introduce a new
ScanResponse struct as Scan return value to make it easier.
2023-02-22 09:56:28 +01:00
Kubernetes Prow Robot
a92614c292
Merge pull request #1051 from AhmedGrati/feat-add-deny-label-ns-with-wildcard
feat: add deny-label-ns flag which supports wildcard
2023-02-15 03:42:25 -08:00
Kubernetes Prow Robot
38cc370e69
Merge pull request #1054 from PiotrProkop/use-new-nrt-api
Advertise TopologyManger policy and scope as Attributes in NRT api v1alpha2
2023-02-15 01:12:25 -08:00
AhmedGrati
b499799364 feat: add deny-label-ns flag which supports wildcard
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-02-15 09:47:00 +01:00
PiotrProkop
f76fc5bf6b Read Kubelet configuration the same way as Kubelet to apply default values
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-02-15 09:27:25 +01:00
Ville Pihlava
b1c6b229fe Add discovery duration logging. 2023-02-13 12:55:57 +02:00
pprokop
5484babcb1 Advertise TopologyManger policy and scope as Attributes
Signed-off-by: pprokop <pprokop@nvidia.com>
2023-02-10 12:03:11 +01:00
Kubernetes Prow Robot
ac271b3c29
Merge pull request #1050 from VillePihlava/interval-fix
Change nfd-worker to use Ticker instead of After.
2023-02-09 07:54:22 -08:00
Ville Pihlava
2101cb20e4 Change nfd-worker to use Ticker instead of After. 2023-02-09 17:14:39 +02:00
Jose Luis Ojosnegros Manchón
2967f3307a nrt-api: move from v1alpha1 to v1alpha2 2023-02-09 12:29:54 +01:00
Carlos Eduardo Arango Gutierrez
9b3171bce2
nfd-master: always start gRPC server
Don't register gRPC LabelServer when using the NodeFeature option, only
turn the gRPC server on for Health and Readiness probes.
2023-01-16 19:33:15 +01:00
Kubernetes Prow Robot
ea921a8b14
Merge pull request #1024 from PiotrProkop/nrt-garbage-collector
Add NRT garbage collector
2023-01-11 01:59:44 -08:00
PiotrProkop
59afae50ba Add NodeResourceTopology garbage collector
NodeResourceTopology(aka NRT) custom resource is used to enable NUMA aware Scheduling in Kubernetes.
As of now node-feature-discovery daemons are used to advertise those
resources but there is no service responsible for removing obsolete
objects(without corresponding Kubernetes node).

This patch adds new daemon called nfd-topology-gc which removes old
NRTs.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-01-11 10:15:21 +01:00
PiotrProkop
1bae2867e2 Release v0.0.13 of NodeResourceTopology API added missing TopologyManagerPolicy.
Expose new policies:
* RestrictedContainerLevel
* RestrictedPodLevel
* BestEffortContainerLevel
* BestEffortPodLevel

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-01-09 16:02:12 +01:00
Kubernetes Prow Robot
8eb6640754
Merge pull request #1020 from marquiz/devel/worker-refactor
worker: move code
2022-12-27 00:45:34 -08:00
Kubernetes Prow Robot
e97b2c1579
Merge pull request #1017 from marquiz/devel/nfd-api-optional-fields
apis/nfd: make all fields in NodeFeatureSpec optional
2022-12-27 00:45:28 -08:00
Markus Lehtonen
1026d91d12 worker: move code
Simplify code bu dropping the unnecessary base client package.
2022-12-23 11:38:21 +02:00
Markus Lehtonen
0283f68702 topology-updater: move code
Move and rename the Go package. It has nothing to do with NFD gRPC
client anymore so move it out of the nfd-client package.
2022-12-23 11:37:46 +02:00
Markus Lehtonen
aa97105854 Add common utility function for getting node name 2022-12-23 09:50:15 +02:00
Markus Lehtonen
dfda9bccad apis/nfd: update auto-generated code 2022-12-22 17:58:20 +02:00
Markus Lehtonen
a4fc15a424 apis/nfd: make all fields in NodeFeatureSpec optional
Don't require features to be specified. The creator possibly only wants
to create labels or only some types of features. No need to specify
empty structs for the unused fields.
2022-12-22 17:53:42 +02:00
Markus Lehtonen
f5ae3fe2c7 Simplify usage of ObjectMeta fields
No need to explicitly spell out ObjectMeta as it's embedded in the
object types.
2022-12-19 17:40:10 +02:00
Kubernetes Prow Robot
28a5daa338
Merge pull request #999 from marquiz/fixes/nodefeature-missing
nfd-master: update node if no NodeFeature objects are present
2022-12-19 00:39:44 -08:00
Markus Lehtonen
4c955ad72c nfd-master: update node if no NodeFeature objects are present
Correctly handle the case where no NodeFeature objects exist for certain
node (and NodeFeature API has been enabled with
-enable-nodefeature-api). In this case all the labels should be removed.
2022-12-19 10:22:04 +02:00
Markus Lehtonen
b9c09e6674 nfd-master: update all nodes at startup when NodeFeature API enabled
We want to always update all nodes at startup. Without this patch we
don't get any update event from the controller if no NodeFeature or
NodeFeatureRule objects exist in the cluster. Thus all nodes would stay
untouched whereas we really want to remove all labels from all nodes in
this case.
2022-12-14 21:49:50 +02:00
Kubernetes Prow Robot
d1b314842c
Merge pull request #989 from marquiz/devel/nodefeature-multi-object
nfd-master: handle multiple NodeFeature objects
2022-12-14 07:51:34 -08:00
Markus Lehtonen
740e3af681 nfd-master: implement ratelimiter for nfd api updates
Implement a naive ratelimiter for node update events originating from
the nfd API. We might get a ton of events in short interval. The
simplest example is startup when we get a separate Add event for every
NodeFeature and NodeFeatureRule object. Without rate limiting we
run "update all nodes" separately for each NodeFeatureRule object, plus,
we would run "update node X" separately for each NodeFeature object
targeting node X. This is a huge amount of wasted work because in
principle just running "update all nodes" once should be enough.
2022-12-14 15:45:43 +02:00
Markus Lehtonen
79ed747be8 nfd-master: handle multiple NodeFeature objects
Implement handling of multiple NodeFeature objects by merging all
objects (targeting a certain node) into one before processing the data.
This patch implements MergeInto() methods for all required data types.

With support for multiple NodeFeature objects per node, The "nfd api
workflow" can be easily demonstrated and tested from the command line.
Creating the folloiwing object (assuming node-n exists in the cluster):

    apiVersion: nfd.k8s-sigs.io/v1alpha1
    kind: NodeFeature
    metadata:
      labels:
        nfd.node.kubernetes.io/node-name: node-n
      name: my-features-for-node-n
    spec:
      # Features for NodeFeatureRule matching
      features:
        flags:
          vendor.domain-a:
            elements:
              feature-x: {}
        attributes:
          vendor.domain-b:
            elements:
              feature-y: "foo"
              feature-z: "123"
        instances:
          vendor.domain-c:
            elements:
            - attributes:
                name: "elem-1"
                vendor: "acme"
            - attributes:
                name: "elem-2"
                vendor: "acme"
      # Labels to be created
      labels:
        vendor-feature.enabled: "true"
        vendor-setting.value: "100"

will create two feature labes:

    feature.node.kubernetes.io/vendor-feature.enabled: "true"
    feature.node.kubernetes.io/vendor-setting.value: "100"

In addition it will advertise hidden/raw features that can be used for
custom rules in NodeFeatureRule objects. Now, creating a NodeFeatureRule
object:

    apiVersion: nfd.k8s-sigs.io/v1alpha1
    kind: NodeFeatureRule
    metadata:
      name: my-rule
    spec:
      rules:
        - name: "my feature rule"
          labels:
            "my-feature": "true"
          matchFeatures:
            - feature: vendor.domain-a
              matchExpressions:
                feature-x: {op: Exists}
            - feature: vendor.domain-c
              matchExpressions:
                vendor: {op: In, value: ["acme"]}

will match the features in the NodeFeature object above and cause one
more label to be created:

    feature.node.kubernetes.io/my-feature: "true"
2022-12-14 15:44:52 +02:00
Markus Lehtonen
9f0806593d nfd-master: rename -featurerules-controller flag to -crd-controller
Deprecate the '-featurerules-controller' command line flag as the name
does not describe the functionality anymore: in practice it controls the
CRD controller handling both NodeFeature and NodeFeatureRule objects.
The patch introduces a duplicate, more generally named, flag
'-crd-controller'. A warning is printed in the log if
'-featurerules-controller' flag is encountered.
2022-12-14 10:23:45 +02:00
Markus Lehtonen
6ddd87e465 nfd-master: support NodeFeature objects
Add initial support for handling NodeFeature objects. With this patch
nfd-master watches NodeFeature objects in all namespaces and reacts to
changes in any of these. The node which a certain NodeFeature object
affects is determined by the "nfd.node.kubernetes.io/node-name"
annotation of the object. When a NodeFeature object targeting certain
node is changed, nfd-master needs to process all other objects targeting
the same node, too, because there may be dependencies between them.

Add a new command line flag for selecting between gRPC and NodeFeature
CRD API as the source of feature requests. Enabling NodeFeature API
disables the gRPC interface.

 -enable-nodefeature-api   enable NodeFeature CRD API for incoming
                           feature requests, will disable the gRPC
                           interface (defaults to false)

It is not possible to serve gRPC and watch NodeFeature objects at the
same time. This is deliberate to avoid labeling races e.g. by nfd-worker
sending gRPC requests but NodeFeature objects in the cluster
"overriding" those changes (labels from the gRPC requests will get
overridden when NodeFeature objects are processed).
2022-12-14 07:31:28 +02:00