1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

283 commits

Author SHA1 Message Date
Kubernetes Prow Robot
d84248bc7d
Merge pull request #1190 from marquiz/devel/api-unit-tests
apis/nfd: add unit tests for Feature type
2023-04-26 23:32:15 -07:00
Markus Lehtonen
77011a775f nfd-master: log node name when processing NodeFeatureRules 2023-04-26 07:22:30 +03:00
Markus Lehtonen
dda7b195ee apis/nfd: add unit tests for Feature type 2023-04-25 19:40:35 +03:00
Kubernetes Prow Robot
54bd4c5d74
Merge pull request #1167 from PiotrProkop/fix-reactive-updates
nfd-topology-updater: fix wrong kubelet_internal_checkpoint path and compare basename to full path
2023-04-24 04:41:01 -07:00
pprokop
5a9a12151c nfd-topology-updater: fix kubelet state file notifier
- kubelet_internal_checkpoint file is in /var/lib/kubelet/device-plugins not /var/lib/kubelet
  fsWatcher doesn't watch dirs recursively
- e.Name returned from fsWatcher events is a full path not a basename

Signed-off-by: pprokop <pprokop@nvidia.com>
2023-04-24 13:21:56 +02:00
Kubernetes Prow Robot
2356223ffc
Merge pull request #1139 from AhmedGrati/feat-configure-master-resync
feat: add master resync period configurability
2023-04-24 03:49:02 -07:00
AhmedGrati
7917434d38 feat: add master resync period configurability
This PR adds a config option for setting the NFD API controller resync period.
The resync period is only activated when the NodeFeature API has been
enabled (with -enable-nodefeature-api).

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-24 11:52:38 +02:00
Kubernetes Prow Robot
64fe26ed37
Merge pull request #1169 from ArangoGutierrez/i1168
nfd-master: reject malformed extended resource dynamic capacity assignment
2023-04-24 00:17:15 -07:00
Carlos Eduardo Arango Gutierrez
f5df7b658c
nfd-master: reject malformed extended resource dynamic capacity assignment
Reject malformed extended resource dynamic capacity assignment
capacity should be in the form of domain.feature.element,
add logic at func filterExtendedResources to check if true or ignore
ExtendedResource, logging as an error.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-22 08:43:50 +02:00
Kubernetes Prow Robot
d5bccda7c5
Merge pull request #1171 from ArangoGutierrez/foundon_typo
pkg/nfd-master/nfd-master.go: Fix typo
2023-04-21 12:21:11 -07:00
Kubernetes Prow Robot
c2c1e18908
Merge pull request #1173 from marquiz/devel/fix-master
nfd-master: fix a crash when processing NodeFeatureRules
2023-04-21 09:49:11 -07:00
Markus Lehtonen
9523f1e411 nfd-master: fix a crash when processing NodeFeatureRules
Fix a a bug where nfd-master with NodeFeature API enabled would crash
when NodeFeatureRule objects were processed in the case where no
NodeFeature objects existed. This was caused by trying to insert values
into a non-initialized NodeFeatureSpec in the code.

This patch adds two safety measures to prevent that from happening in
the future. First, add a constructor function for the NodeFeatureSpec
type, and second, check for uninitialized object in the function
inserting new functions.

TODO: add unit tests for the API helper functions.
2023-04-21 19:24:08 +03:00
Carlos Eduardo Arango Gutierrez
ae22031547
pkg/nfd-master/nfd-master.go: Fix typo
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-21 16:17:11 +02:00
Markus Lehtonen
37306662fe nfd-master: don't create emtpy annotations
Make the nfd.node.kubernetes.io/feature-labels and
nfd.node.kubernetes.io/extended-resources annotations behave similary to
the taints annotation: only create the annotations if some labels or
extended resources are created.
2023-04-21 16:14:17 +03:00
Markus Lehtonen
f0f6bbcf36 nfd-master: configure before prune
Otherwise prune will crash because of uninitialized configuration.
2023-04-20 20:38:11 +03:00
Markus Lehtonen
32db081f3a nfd-master: support noPublish with -prune
Better this way than to crash which is what currently happens with this
combination.
2023-04-19 15:58:06 +03:00
Markus Lehtonen
18f7bfa8e8 generate: update mockery to v2.25.1
Bump the vektra/mockery tool to the latest release.
2023-04-19 13:33:42 +03:00
Markus Lehtonen
117baac1a6 generate: update protoc to v22.3 2023-04-19 10:44:55 +03:00
Markus Lehtonen
ca7ed04a34 generate: update auto-generated code
Re-run "make generate".
2023-04-19 09:49:17 +03:00
Markus Lehtonen
e2d5ba1a2b pkg/podres: update mocked PodResourcesListerClient
Update mocked implementation of
k8s.io/kubelet/pkg/apis/podresources/v1.PodResourcesListerClient. The
mocked implementation is moved to a separate "mocks" subpackage as it's
for an external interface.

This patch also adds code for auto-generation for the mocked interface.
2023-04-18 20:51:51 +03:00
Kubernetes Prow Robot
8d71ed6755
Merge pull request #1086 from AhmedGrati/feat-support-builtin-kernel-mods
feat: support builtin kernel mods
2023-04-13 10:30:40 -07:00
Markus Lehtonen
6b2d10753f nfd-master: re-try on node update failures
Change the NFD API handler to re-try on node update failures. Will work
around transient failures, making sure that failed nodes (i.e. nodes
that we failed to update) don't need to wait for the 1 hour resync
period before being tried again.
2023-04-13 16:30:31 +03:00
AhmedGrati
109caa1f28 feat: support builtin kernel mods
This PR adds the combination of dynamic and builtin kernel modules into
one feature called `kernel.enabledmodule`. It's a superset of the
`kernel.loadedmodule` feature.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-13 10:19:24 +01:00
Markus Lehtonen
70ac19ea66 nfd-master: increase controller resync period to 1 hour
Increase the NFD API controller resync period from 5 minutes to 1 hour.
The resync causes nfd-master to replay all NodeFeature and
NodeFeatureRule objects, being effectively a "big hammer reset all"
button. This should only be needed as an "insurance" to fix labels et al
in case they have been manually tampered (outside NFD) and against
certain bugs in nfd itself. NFD is not supposed to manage anything
fast-changing so 1 hour should be enough.

This change only affects behavior when the NodeFeature API has been
enabled (with -enable-nodefeature-api).
2023-04-12 16:38:47 +03:00
Kubernetes Prow Robot
ad07829d0a
Merge pull request #1099 from ArangoGutierrez/extended_resources_v2
Create extended resources with NodeFeatureRule
2023-04-07 08:09:15 -07:00
Fabiano Fidêncio
250aea4741
Create extended resources with NodeFeatureRule
Add support for management of Extended Resources via the
NodeFeatureRule CRD API.

There are usage scenarios where users want to advertise features
as extended resources instead of labels (or annotations).

This patch enables the discovery of extended resources, via annotation
and patch of node.status.capacity and node.status.allocatable. By using
the NodeFeatureRule API.

Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-07 16:14:56 +02:00
Markus Lehtonen
f64c23968a nfd-master: fix node update
Update node status before node metadata. This fixes a problem where we
lose track of NFD-managed extended resources in case patching node
status fails. Previously we removed all labels and annotations
(including the one listing our ERs) and only after that updated node
status. If node status update failed we had lost the annotation but
extended resources were still there, leaving them orphaned.
2023-04-06 22:04:35 +03:00
Markus Lehtonen
cc6c20ff5f nfd-master: disallow unprefixed and kubernetes taints
Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/"
prefix. This is a precaution to protect the user from messing up with
the "official" well-known taints from Kubernetes itself. The only
exception is that the "nfd.node.kubernetes.io/" prefix is allowed.

However, there is one allowed NFD-specific namespace (and its
sub-namespaces) i.e. "feature.node.kubernetes.io" under the
kubernetes.io domain that can be used for NFD-managed taints.

Also disallow unprefixed taint keys. We don't add a default prefix to
unprefixed taints (like we do for labels) from NodeFeatureRules. This is
to prevent unpleasant surprises to users that need to manage matching
tolerations for their workloads.
2023-04-06 16:12:37 +03:00
Kubernetes Prow Robot
193c552b33
Merge pull request #1084 from AhmedGrati/feat-add-master-config-file
feat: add master config file
2023-04-04 10:41:40 -07:00
AhmedGrati
3fff409f6d Add master config file
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.

We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-03 09:52:09 +01:00
AhmedGrati
d0a6289c0f chore: add debug dump of nfd worker configuration
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-03-18 00:49:07 +01:00
Kubernetes Prow Robot
13f92faa77
Merge pull request #1031 from k8stopologyawareschedwg/reactive_updates
topology-updater: reactive updates
2023-03-17 10:13:17 -07:00
Talor Itzhak
5c6be580f4 reactive updates: add an option to disable the feature
Access to the kubelet state directory may raise concerns in some setups, added an option to disable it.
The feature is enabled by default.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-16 11:53:16 +02:00
Kubernetes Prow Robot
a06e44ef0b
Merge pull request #1083 from fmuyassarov/mockery
codegen: fix code-generation
2023-03-15 06:46:16 -07:00
Markus Lehtonen
4a8fc811be pkg/utils: add UnmarshalJSON method to StringSetVal
Make it possible to specify values in yaml as an array like

  conf:
    - foo
    - bar

Instead of unwieldy map like

  conf:
    foo:
    bar:
2023-03-14 10:53:24 +02:00
Talor Itzhak
8924213d14 topology-updater: make it possible to disable sleep-interval
Especially convenient for testing porpuses and
completely harmless

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:43:17 +02:00
Talor Itzhak
1c12876815 topology-updater: log event type that triggered update
Specify the event type as part of the log message.
In order to reduce the log volume, make it V4

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
7b248ecae2 topology-updater: update CRs when notified
When a message received via the channel,
the main loop updates the `NodeResourceTopology` objects.

The notifier will send a message via the channel if:
1. It reached the sleep timeout.
2. It detected a change in Kubelet state files

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
175e0c81aa topology-updater: add kubelet-state-dir flag
On different Kubernetes flavors like OpenShift for exmaple,
the Kubelet state directory path is different. make it configurable
for maximum flexability.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
0f65b87329 kubeletnotifier: introduce kubeletnotifier package
Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.

This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.

In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.

With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.

There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.

The notifier has two event types:
1. Time based - keeping the old behavior, trigger
an update per interval.
2. FS event - trigger an update when Kubelet state/checkpoint files modified.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Muyassarov, Feruzjon
e3a856b405 update re-generated code with make-generate results
Update generated code based on the updated from re-running make
generate.

Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
2023-03-11 22:15:11 +02:00
Jose Luis Ojosnegros Manchón
b340d112a8 topology-updater:compute pod set fingerprint
Add an option to compute the fingerprint of the current pod set on each
node.

Report this new fingerprint using an attribute in NRT object.
2023-02-22 10:22:50 +01:00
Jose Luis Ojosnegros Manchón
1a687cb286 topology-updater: Refactor Scan to expand response
We are gonna add new data to Scan response so better introduce a new
ScanResponse struct as Scan return value to make it easier.
2023-02-22 09:56:28 +01:00
Kubernetes Prow Robot
a92614c292
Merge pull request #1051 from AhmedGrati/feat-add-deny-label-ns-with-wildcard
feat: add deny-label-ns flag which supports wildcard
2023-02-15 03:42:25 -08:00
Kubernetes Prow Robot
38cc370e69
Merge pull request #1054 from PiotrProkop/use-new-nrt-api
Advertise TopologyManger policy and scope as Attributes in NRT api v1alpha2
2023-02-15 01:12:25 -08:00
AhmedGrati
b499799364 feat: add deny-label-ns flag which supports wildcard
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-02-15 09:47:00 +01:00
PiotrProkop
f76fc5bf6b Read Kubelet configuration the same way as Kubelet to apply default values
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-02-15 09:27:25 +01:00
Ville Pihlava
b1c6b229fe Add discovery duration logging. 2023-02-13 12:55:57 +02:00
pprokop
5484babcb1 Advertise TopologyManger policy and scope as Attributes
Signed-off-by: pprokop <pprokop@nvidia.com>
2023-02-10 12:03:11 +01:00
Kubernetes Prow Robot
ac271b3c29
Merge pull request #1050 from VillePihlava/interval-fix
Change nfd-worker to use Ticker instead of After.
2023-02-09 07:54:22 -08:00