Change the NFD API handler to re-try on node update failures. Will work
around transient failures, making sure that failed nodes (i.e. nodes
that we failed to update) don't need to wait for the 1 hour resync
period before being tried again.
Increase the NFD API controller resync period from 5 minutes to 1 hour.
The resync causes nfd-master to replay all NodeFeature and
NodeFeatureRule objects, being effectively a "big hammer reset all"
button. This should only be needed as an "insurance" to fix labels et al
in case they have been manually tampered (outside NFD) and against
certain bugs in nfd itself. NFD is not supposed to manage anything
fast-changing so 1 hour should be enough.
This change only affects behavior when the NodeFeature API has been
enabled (with -enable-nodefeature-api).
Add support for management of Extended Resources via the
NodeFeatureRule CRD API.
There are usage scenarios where users want to advertise features
as extended resources instead of labels (or annotations).
This patch enables the discovery of extended resources, via annotation
and patch of node.status.capacity and node.status.allocatable. By using
the NodeFeatureRule API.
Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Update node status before node metadata. This fixes a problem where we
lose track of NFD-managed extended resources in case patching node
status fails. Previously we removed all labels and annotations
(including the one listing our ERs) and only after that updated node
status. If node status update failed we had lost the annotation but
extended resources were still there, leaving them orphaned.
Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/"
prefix. This is a precaution to protect the user from messing up with
the "official" well-known taints from Kubernetes itself. The only
exception is that the "nfd.node.kubernetes.io/" prefix is allowed.
However, there is one allowed NFD-specific namespace (and its
sub-namespaces) i.e. "feature.node.kubernetes.io" under the
kubernetes.io domain that can be used for NFD-managed taints.
Also disallow unprefixed taint keys. We don't add a default prefix to
unprefixed taints (like we do for labels) from NodeFeatureRules. This is
to prevent unpleasant surprises to users that need to manage matching
tolerations for their workloads.
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.
We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Access to the kubelet state directory may raise concerns in some setups, added an option to disable it.
The feature is enabled by default.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
When a message received via the channel,
the main loop updates the `NodeResourceTopology` objects.
The notifier will send a message via the channel if:
1. It reached the sleep timeout.
2. It detected a change in Kubelet state files
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
On different Kubernetes flavors like OpenShift for exmaple,
the Kubelet state directory path is different. make it configurable
for maximum flexability.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.
This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.
In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.
With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.
There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.
The notifier has two event types:
1. Time based - keeping the old behavior, trigger
an update per interval.
2. FS event - trigger an update when Kubelet state/checkpoint files modified.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
NodeResourceTopology(aka NRT) custom resource is used to enable NUMA aware Scheduling in Kubernetes.
As of now node-feature-discovery daemons are used to advertise those
resources but there is no service responsible for removing obsolete
objects(without corresponding Kubernetes node).
This patch adds new daemon called nfd-topology-gc which removes old
NRTs.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Don't require features to be specified. The creator possibly only wants
to create labels or only some types of features. No need to specify
empty structs for the unused fields.
Correctly handle the case where no NodeFeature objects exist for certain
node (and NodeFeature API has been enabled with
-enable-nodefeature-api). In this case all the labels should be removed.
We want to always update all nodes at startup. Without this patch we
don't get any update event from the controller if no NodeFeature or
NodeFeatureRule objects exist in the cluster. Thus all nodes would stay
untouched whereas we really want to remove all labels from all nodes in
this case.
Implement a naive ratelimiter for node update events originating from
the nfd API. We might get a ton of events in short interval. The
simplest example is startup when we get a separate Add event for every
NodeFeature and NodeFeatureRule object. Without rate limiting we
run "update all nodes" separately for each NodeFeatureRule object, plus,
we would run "update node X" separately for each NodeFeature object
targeting node X. This is a huge amount of wasted work because in
principle just running "update all nodes" once should be enough.
Implement handling of multiple NodeFeature objects by merging all
objects (targeting a certain node) into one before processing the data.
This patch implements MergeInto() methods for all required data types.
With support for multiple NodeFeature objects per node, The "nfd api
workflow" can be easily demonstrated and tested from the command line.
Creating the folloiwing object (assuming node-n exists in the cluster):
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeature
metadata:
labels:
nfd.node.kubernetes.io/node-name: node-n
name: my-features-for-node-n
spec:
# Features for NodeFeatureRule matching
features:
flags:
vendor.domain-a:
elements:
feature-x: {}
attributes:
vendor.domain-b:
elements:
feature-y: "foo"
feature-z: "123"
instances:
vendor.domain-c:
elements:
- attributes:
name: "elem-1"
vendor: "acme"
- attributes:
name: "elem-2"
vendor: "acme"
# Labels to be created
labels:
vendor-feature.enabled: "true"
vendor-setting.value: "100"
will create two feature labes:
feature.node.kubernetes.io/vendor-feature.enabled: "true"
feature.node.kubernetes.io/vendor-setting.value: "100"
In addition it will advertise hidden/raw features that can be used for
custom rules in NodeFeatureRule objects. Now, creating a NodeFeatureRule
object:
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: my-rule
spec:
rules:
- name: "my feature rule"
labels:
"my-feature": "true"
matchFeatures:
- feature: vendor.domain-a
matchExpressions:
feature-x: {op: Exists}
- feature: vendor.domain-c
matchExpressions:
vendor: {op: In, value: ["acme"]}
will match the features in the NodeFeature object above and cause one
more label to be created:
feature.node.kubernetes.io/my-feature: "true"
Deprecate the '-featurerules-controller' command line flag as the name
does not describe the functionality anymore: in practice it controls the
CRD controller handling both NodeFeature and NodeFeatureRule objects.
The patch introduces a duplicate, more generally named, flag
'-crd-controller'. A warning is printed in the log if
'-featurerules-controller' flag is encountered.
Add initial support for handling NodeFeature objects. With this patch
nfd-master watches NodeFeature objects in all namespaces and reacts to
changes in any of these. The node which a certain NodeFeature object
affects is determined by the "nfd.node.kubernetes.io/node-name"
annotation of the object. When a NodeFeature object targeting certain
node is changed, nfd-master needs to process all other objects targeting
the same node, too, because there may be dependencies between them.
Add a new command line flag for selecting between gRPC and NodeFeature
CRD API as the source of feature requests. Enabling NodeFeature API
disables the gRPC interface.
-enable-nodefeature-api enable NodeFeature CRD API for incoming
feature requests, will disable the gRPC
interface (defaults to false)
It is not possible to serve gRPC and watch NodeFeature objects at the
same time. This is deliberate to avoid labeling races e.g. by nfd-worker
sending gRPC requests but NodeFeature objects in the cluster
"overriding" those changes (labels from the gRPC requests will get
overridden when NodeFeature objects are processed).