The feature gate is locked to true. That is, it is not possible to revert
back to the gPRC-based communication which makes the gRPC API ready for
removal.
The NodeFeatureGroup is an NFD-specific custom resource that is designed for
grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup
objects in the cluster and updates the status of the NodeFeatureGroup object
with the list of nodes that match the feature group rules. The NodeFeatureGroup
rules follow the same syntax as the NodeFeatureRule rules.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Extend the format of feature matcher terms (the elements of the
arrayspecified under under matchFeatures field) with new matchName
field. The value of this field is an expression that is evaluated
against the names of feature elements instead of their values (values
are matched with the matchExpressions field, instead).
The matchName field is useful e.g. in template rules for creating
per-feature-element labels based on feature names (instead of values)
and in non-template rules for checking if (at least) one of certain
feature element names are present.
If both matchExpressions and matchName for certain feature matcher term
is specified, they both must match in order to get an overall match.
Also, in this case the list of matched features (used in templating) is
the union of the results from matchExpressions and matchName.
An example of creating an "avx512" label if any AVX512* CPUID feature is
present:
- name: "avx wildcard rule"
labels:
avx512: "true"
matchFeatures:
- feature: cpu.cpuid
matchName: {op: InRegexp, value: ["^AVX512"]}
An example of a template rule creating a dynamic set of labels based on
the existence of certain kconfig options.
- name: "kconfig template rule"
labelsTemplate: |
{{ range .kernel.config }}kconfig-{{ .Name }}={{ .Value }}
{{ end }}
matchFeatures:
- feature: kernel.config
matchName: {op: In, value: ["SWAP", "X86", "ARM"]}
NOTE: this patch changes the corner case of nil/null match expressions
with instance features (i.e. "matchExpressions: null"). Previously, we
returned all instances for templating but now a nil match expression is
not evaluated and no instances for templating are returned.
In some occasions the node status (capacity) takes a lot of time to
update. Increase the timeout on extended resource tests. Revert the default
timeout back to 10s.
This patch addresses issues with slow node status (extended resources)
updates. Previously we did just a few retries in quick succession which
could result in the node update failing, just because node status was
updated slower than our retry window. The patch mitigates the issue by
increasing the number of tries to 15. In addition, it creates a
ratelimiter with a longer per-item (per-node) base delay.
The patch also fixes the e2e-tests to expose the issue.
We now have metrics for getting detailed information about the NFD
instances running. There should be no need to pollute the node object
with NFD version annotations.
One problem with the annotations also that they were incomplete in the
sense that they only covered nfd-master and nfd-worker but not
nfd-topology-updater or nfd-gc.
Also, there was a problem with stale annotations, giving misleading
information. E.g. there was no way to remove old/stale master.version
annotations if nfd-master was scheduled on another node where it was
previously running.
This PR aims to support the dynamic values for labels in the
NodeFeatureRule CRD, it would offer more flexible labeling for users.
To achieve this, we check whether label value starts with "@", and if
it's the case, we will get the value of the feature value, and update
the value of the label with the feature value.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Implement a new generic type nodeListPropertyMatcher, a generic Gomega
matcher for matching basically any property of a set of node objects. We
will be using it for verifying labels, annotations, extended resources
and taints for now. This moves the tests in a more Gomega'ish direction,
leveraging code re-use and providing way more informative error messages
in case of test failures.
The patch adds a new eventuallyNonControlPlaneNodes helper assertion for
asserting all (non-control-plane) nodes in the cluster, intended to
replace the ugly simplePoll() helper function.
This patch implements a matcher for node labels and converts tests to
use it instead of the old checkForNodeLabels helper function.
This PR fixes the resync-period configuration option of the nfd-master.
In fact, previously, changes were not reflected in the nfd-master at
runtime. e2e tests are also implemented to make sure that the fix is
already working as expected.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Eliminate all context.TODO() from the e2e tests and use ginkgo context
instead. This ensures that calls involving context are properly
cancelled and return fast in case the tests get aborted.
Add support for management of Extended Resources via the
NodeFeatureRule CRD API.
There are usage scenarios where users want to advertise features
as extended resources instead of labels (or annotations).
This patch enables the discovery of extended resources, via annotation
and patch of node.status.capacity and node.status.allocatable. By using
the NodeFeatureRule API.
Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/"
prefix. This is a precaution to protect the user from messing up with
the "official" well-known taints from Kubernetes itself. The only
exception is that the "nfd.node.kubernetes.io/" prefix is allowed.
However, there is one allowed NFD-specific namespace (and its
sub-namespaces) i.e. "feature.node.kubernetes.io" under the
kubernetes.io domain that can be used for NFD-managed taints.
Also disallow unprefixed taint keys. We don't add a default prefix to
unprefixed taints (like we do for labels) from NodeFeatureRules. This is
to prevent unpleasant surprises to users that need to manage matching
tolerations for their workloads.
Make the default master pod run with no special options. Move the
customizations of the master pod to the setup functions of the tests
that actually need it.
Also, cleanup the configuration of nfd-worker of some tests.
The node cleanup function was not removing all NFD-labels. It omitted
NFD-originated labels that used a non-default label namespace. This
patch fixes the issue by getting all NFD-managed labels from the special
annotation (nfd.node.kubernetes.io/feature-labels).
The patch also adds the ability to cleanup extended resources in a
similar way. This will be needed by future work.
Also changes the order of cleaning up CRs and the node. It is the right
order as cleaning up the CRs may still update the node.
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.
We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Reduce the wait time of nfd-worker pods to be in ready-state (before
proceeding with tests) from five to two seconds. Make tests faster to
run. Two seconds should be enough for nfd-workers to do their job and
get nodes labeled.
The docker image that used during e2e test
composed of repo and tag flags that are
passed to the test itself.
The problem is that the docker image initialized
before the flags are parsed. Hence, it will always contains
the default flags value.
Moving the variable into a separate function, fixing the issue.
Also, moving the global variables to `e2e_test.go` since
it commonly used by all tests.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>