1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

1617 commits

Author SHA1 Message Date
Kubernetes Prow Robot
ad07829d0a
Merge pull request #1099 from ArangoGutierrez/extended_resources_v2
Create extended resources with NodeFeatureRule
2023-04-07 08:09:15 -07:00
Fabiano Fidêncio
250aea4741
Create extended resources with NodeFeatureRule
Add support for management of Extended Resources via the
NodeFeatureRule CRD API.

There are usage scenarios where users want to advertise features
as extended resources instead of labels (or annotations).

This patch enables the discovery of extended resources, via annotation
and patch of node.status.capacity and node.status.allocatable. By using
the NodeFeatureRule API.

Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-07 16:14:56 +02:00
Kubernetes Prow Robot
6740224a13
Merge pull request #1100 from PiotrProkop/expose-L3-num-closid
Advertise RDT L3 num_closid
2023-04-07 00:49:14 -07:00
Kubernetes Prow Robot
f2569b5694
Merge pull request #1119 from marquiz/devel/fix-nfd-master
nfd-master: fix node update
2023-04-06 12:23:35 -07:00
Markus Lehtonen
f64c23968a nfd-master: fix node update
Update node status before node metadata. This fixes a problem where we
lose track of NFD-managed extended resources in case patching node
status fails. Previously we removed all labels and annotations
(including the one listing our ERs) and only after that updated node
status. If node status update failed we had lost the annotation but
extended resources were still there, leaving them orphaned.
2023-04-06 22:04:35 +03:00
Kubernetes Prow Robot
ec014f118b
Merge pull request #1118 from marquiz/devel/taints
nfd-master: disallow unprefixed and kubernetes taints
2023-04-06 06:59:48 -07:00
Markus Lehtonen
cc6c20ff5f nfd-master: disallow unprefixed and kubernetes taints
Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/"
prefix. This is a precaution to protect the user from messing up with
the "official" well-known taints from Kubernetes itself. The only
exception is that the "nfd.node.kubernetes.io/" prefix is allowed.

However, there is one allowed NFD-specific namespace (and its
sub-namespaces) i.e. "feature.node.kubernetes.io" under the
kubernetes.io domain that can be used for NFD-managed taints.

Also disallow unprefixed taint keys. We don't add a default prefix to
unprefixed taints (like we do for labels) from NodeFeatureRules. This is
to prevent unpleasant surprises to users that need to manage matching
tolerations for their workloads.
2023-04-06 16:12:37 +03:00
Kubernetes Prow Robot
621823f556
Merge pull request #1117 from marquiz/devel/e2e-refactor
test/e2e: refactor nfd pod configuration
2023-04-06 05:33:48 -07:00
PiotrProkop
0e78eba40e Advertise RDT L3 num_closid
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-04-06 11:22:55 +02:00
Markus Lehtonen
2e85b8a914 test/e2e: refactor nfd pod configuration
Make the default master pod run with no special options. Move the
customizations of the master pod to the setup functions of the tests
that actually need it.

Also, cleanup the configuration of nfd-worker of some tests.
2023-04-05 21:51:25 +03:00
Kubernetes Prow Robot
60f052f086
Merge pull request #1116 from marquiz/devel/e2e-crd-deletion
test/e2e: wait for CRD deletion to complete
2023-04-05 07:31:40 -07:00
Kubernetes Prow Robot
3c0c43b9be
Merge pull request #1114 from marquiz/devel/rdt-deprecate
source/cpu: deprecate cpu-rdt.* labels
2023-04-05 06:21:40 -07:00
Kubernetes Prow Robot
f5121a5bdd
Merge pull request #1115 from marquiz/devel/e2e-fix
test/e2e: fix node cleanup function
2023-04-05 06:09:41 -07:00
Markus Lehtonen
5793207cf2 test/e2e: wait for CRD deletion to complete
Wait for the deletion of NFD CRDs to complete before trying to re-create
them. Prevents errors in case CRDs already exist on the cluster when
e2e-tests are launched.
2023-04-05 15:56:26 +03:00
Markus Lehtonen
68c3bf317b test/e2e: fix node cleanup function
The node cleanup function was not removing all NFD-labels. It omitted
NFD-originated labels that used a non-default label namespace. This
patch fixes the issue by getting all NFD-managed labels from the special
annotation (nfd.node.kubernetes.io/feature-labels).

The patch also adds the ability to cleanup extended resources in a
similar way. This will be needed by future work.

Also changes the order of cleaning up CRs and the node. It is the right
order as cleaning up the CRs may still update the node.
2023-04-05 15:09:25 +03:00
Kubernetes Prow Robot
193c552b33
Merge pull request #1084 from AhmedGrati/feat-add-master-config-file
feat: add master config file
2023-04-04 10:41:40 -07:00
Markus Lehtonen
6cb5e99afa source/cpu: deprecate cpu-rdt.* labels
Document built-in RDT labels to be deprecated and removed in a future
release. The plan is that the default built-in RDT labels would not be
created anymore, but the RDT features would still be available for
NodeFeatureRules to consume.

The RDT labels are not very useful (they don't e.g indicate if the
features are really enabled in kernel or if the resctrlfs is mounted).
2023-04-04 11:54:57 +03:00
Kubernetes Prow Robot
27e0788bbc
Merge pull request #1112 from marquiz/devel/readme
README: update to release v0.12.2
2023-04-03 04:47:52 -07:00
Markus Lehtonen
71261e4b5f README: update to release v0.12.2 2023-04-03 14:42:19 +03:00
AhmedGrati
3fff409f6d Add master config file
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.

We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-03 09:52:09 +01:00
Kubernetes Prow Robot
285f425313
Merge pull request #1106 from marquiz/devel/deps
go.mod: update kubernetes to v1.26.3
2023-03-31 11:33:53 -07:00
Markus Lehtonen
436f679cb1 go.mod: update kubernetes to v1.26.3 2023-03-31 19:41:18 +03:00
Kubernetes Prow Robot
1d8e26f6b1
Merge pull request #1079 from fidencio/topic/tdx-expose-key-id
cpu: Expose the total number of keys for TDX
2023-03-31 00:33:56 -07:00
Fabiano Fidêncio
10672e1bba cpu: Expose the total number of keys for TDX
The total amount of keys that can be used on a specific TDX system is
exposed via the cgroups misc.capacity. See:

```
$ cat /sys/fs/cgroup/misc.capacity
tdx 31
```

The first step to properly manage the amount of keys present in a node
is exposing it via the NFD, and that's exactly what this commit does.

An example of how it ends up being exposed via the NFD:

```
$ kubectl get node 984fee00befb.jf.intel.com -o jsonpath='{.metadata.labels}'  | jq | grep tdx.total_keys
  "feature.node.kubernetes.io/cpu-security.tdx.total_keys": "31",
```

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-03-31 09:12:26 +02:00
Kubernetes Prow Robot
243c05e329
Merge pull request #1097 from ArangoGutierrez/amd_sev
cpu: expose AMD SEV support
2023-03-30 08:53:48 -07:00
Carlos Eduardo Arango Gutierrez
7171cfd4eb
cpu: expose AMD SEV support
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-03-30 15:19:43 +02:00
Kubernetes Prow Robot
821e042dbf
Merge pull request #1091 from AhmedGrati/feat-helm-enable-taints
feat: add enableTaints to helm chart
2023-03-21 02:59:09 -07:00
AhmedGrati
02b3b7c7e0 feat: add enableTaints to helm chart
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-03-21 10:49:24 +01:00
Kubernetes Prow Robot
b0a45cdb36
Merge pull request #1092 from AhmedGrati/add-debug-dump-worker-config
chore: add debug dump of nfd worker configuration
2023-03-21 01:55:08 -07:00
AhmedGrati
d0a6289c0f chore: add debug dump of nfd worker configuration
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-03-18 00:49:07 +01:00
Kubernetes Prow Robot
13f92faa77
Merge pull request #1031 from k8stopologyawareschedwg/reactive_updates
topology-updater: reactive updates
2023-03-17 10:13:17 -07:00
Talor Itzhak
5c6be580f4 reactive updates: add an option to disable the feature
Access to the kubelet state directory may raise concerns in some setups, added an option to disable it.
The feature is enabled by default.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-16 11:53:16 +02:00
Talor Itzhak
727de56191 documentaion: document the reactive updates feature
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-16 11:53:12 +02:00
Talor Itzhak
91daff3b59 deployment/helm: update helm charts
Adding kubelet state directory mount

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-16 11:51:45 +02:00
Kubernetes Prow Robot
c14be60798
Merge pull request #1085 from fmuyassarov/ignore-cov-report
gitignore: ignore codecov coverage report
2023-03-16 01:39:16 -07:00
Kubernetes Prow Robot
4af31733c3
Merge pull request #1090 from ArangoGutierrez/update_prune_helm
kustomize: trim prune overlay
2023-03-15 12:51:05 -07:00
Carlos Eduardo Arango Gutierrez
355807f98c
kustomize: trim prune overlay
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-03-15 20:36:45 +01:00
Kubernetes Prow Robot
a06e44ef0b
Merge pull request #1083 from fmuyassarov/mockery
codegen: fix code-generation
2023-03-15 06:46:16 -07:00
Kubernetes Prow Robot
6688a2f232
Merge pull request #1087 from marquiz/devel/strigsetval
pkg/utils: add UnmarshalJSON method to StringSetVal
2023-03-15 01:44:15 -07:00
Markus Lehtonen
4a8fc811be pkg/utils: add UnmarshalJSON method to StringSetVal
Make it possible to specify values in yaml as an array like

  conf:
    - foo
    - bar

Instead of unwieldy map like

  conf:
    foo:
    bar:
2023-03-14 10:53:24 +02:00
Muyassarov, Feruzjon
28a2be436f gitignore: ignore codecov coverage report
We don't necessarily need to keep the codecov coverage report on the
git. As such, adding it to the gitignore to avoid it from accidental
commiting.

Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
2023-03-13 12:08:32 +02:00
Talor Itzhak
6de13fe456 e2e: reactive updates test
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:43:17 +02:00
Talor Itzhak
8924213d14 topology-updater: make it possible to disable sleep-interval
Especially convenient for testing porpuses and
completely harmless

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:43:17 +02:00
Talor Itzhak
8afd819132 deployment/topology-updater: add mount for kubelet state dir
This mount is needed for watching the state files

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:43:13 +02:00
Talor Itzhak
1c12876815 topology-updater: log event type that triggered update
Specify the event type as part of the log message.
In order to reduce the log volume, make it V4

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
7b248ecae2 topology-updater: update CRs when notified
When a message received via the channel,
the main loop updates the `NodeResourceTopology` objects.

The notifier will send a message via the channel if:
1. It reached the sleep timeout.
2. It detected a change in Kubelet state files

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
175e0c81aa topology-updater: add kubelet-state-dir flag
On different Kubernetes flavors like OpenShift for exmaple,
the Kubelet state directory path is different. make it configurable
for maximum flexability.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Talor Itzhak
0f65b87329 kubeletnotifier: introduce kubeletnotifier package
Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.

This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.

In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.

With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.

There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.

The notifier has two event types:
1. Time based - keeping the old behavior, trigger
an update per interval.
2. FS event - trigger an update when Kubelet state/checkpoint files modified.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-03-12 12:37:24 +02:00
Muyassarov, Feruzjon
e3a856b405 update re-generated code with make-generate results
Update generated code based on the updated from re-running make
generate.

Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
2023-03-11 22:15:11 +02:00
Muyassarov, Feruzjon
99595f5fab omit go version control information (buildvcs)
Omit go version control information (buildvcs), otherwise
go command fails to obtain vcs status as shown below:

error obtaining VCS status: exit status 128
	Use -buildvcs=false to disable VCS stamping.

Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
2023-03-11 22:14:24 +02:00