Correctly patch the "status" subresource. This got broken when
refactoring the code in 7a050e7cf9 and
wasn't even catched by the unit tests as the fake kubernetes client
doesn't handle subresources as the real apiserver does.
Implement some of frequently used helper functions inpackage.
This patch also contains big changes to the nfd-master unit tests. Much
of this is about migrating from the mocked apihelper interface to fake
kubernetes client that provides a bit more apiserver'ish functionality.
At the same time there is quite a bit of renaming in the tests,
shortening and unifying naming and getting rid of the extensive usage of
"mock" everywhere.
This change is part of an effort to remove the pkg/apihelper package.
GetKubeconfig is useful helper functionality shared accross the codebase
so move it into a "safe" location.
We need to parse kubeconfig (and initialize the apihelper) even with
-no-publish as the PodResourcesScanner accesses the k8s API even if
we're not publishing/updating NRTs.
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.
The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).
The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.
Extend the format of feature matcher terms (the elements of the
arrayspecified under under matchFeatures field) with new matchName
field. The value of this field is an expression that is evaluated
against the names of feature elements instead of their values (values
are matched with the matchExpressions field, instead).
The matchName field is useful e.g. in template rules for creating
per-feature-element labels based on feature names (instead of values)
and in non-template rules for checking if (at least) one of certain
feature element names are present.
If both matchExpressions and matchName for certain feature matcher term
is specified, they both must match in order to get an overall match.
Also, in this case the list of matched features (used in templating) is
the union of the results from matchExpressions and matchName.
An example of creating an "avx512" label if any AVX512* CPUID feature is
present:
- name: "avx wildcard rule"
labels:
avx512: "true"
matchFeatures:
- feature: cpu.cpuid
matchName: {op: InRegexp, value: ["^AVX512"]}
An example of a template rule creating a dynamic set of labels based on
the existence of certain kconfig options.
- name: "kconfig template rule"
labelsTemplate: |
{{ range .kernel.config }}kconfig-{{ .Name }}={{ .Value }}
{{ end }}
matchFeatures:
- feature: kernel.config
matchName: {op: In, value: ["SWAP", "X86", "ARM"]}
NOTE: this patch changes the corner case of nil/null match expressions
with instance features (i.e. "matchExpressions: null"). Previously, we
returned all instances for templating but now a nil match expression is
not evaluated and no instances for templating are returned.
Drop the private fields – that were supposed to be used for caching parsed
templates – from the Rule type. Keep the API typedefs cleaner and
simpler. Moreover, the caching was not even used in practice,
effectively complicating code without any benefit: the way the types
are used in nfd-master creates a local copy of Rule type storing the
cached template in the copy, wasting it from any future users.
There are also other possible caveats in caching like we tried to do it.
For example the objects returned by the api lister are supposed to be
treated as read-only - in particular if we would be to modify them there
should at least be proper locking in place as nfd-master potentially
processes the same rule (the same Go object) in parallel for multiple
nodes. If any optimization like this will be pursued it should be done
properly, probably with private type(s) at the consumer's end, not
contaminating the API types.
Drop the creation helper functions as one step in an effort to tidy up
the api package. These functions were not much used outside unit tests
anyway, the static rules of the nfd-worker custom feature source being
the only exception (and if those happened to be invalid we'd catch that
e.g. in the e2e-tests).
This patch creates a owner-dependent relationship between the
nfd-worker pod and the NodeFeature object that it creates. With this
change the orphaned NodeFeature object(s) gets automatically
garbage-collected when the nfd-worker pod goes away, without the need
for manual clean-up actions.
Drop the private field for caching parsed regexp from the
MatchExpression type. This tidies up the API type definition and not so
tied with particular implementation details. The change also elimiates
potential concurrency problems as no locking is in place in the API
types.
If caching will be desired in the future, it's better to do it properly
in a separate package, not directly in the API types.
Fix flakyness of unit tests by adding back the sorting of matched
feature elements that was unadvisedly removed in
63c22551df. This might help debugging some
corner cases in real-life scenarios (when using templating), too.
Add new autoDefaultNs (default is "true") config option to nfd-master.
Setting the config option to false stops NFD from automatically adding
the "feature.node.kubernetes.io/" prefix to labels, annotations and
extended resources. Taints are not affected as for them no prefix is
automatically added. The user-visible part of enabling the option change
is that NodeFeatureRules, local feature files, hooks and configuration
of the "custom" may need to be altereda (if the auto-prefixing is
relied on).
For now, the config option defaults to "true", meaning no change in
default behavior. However, the intent is to change the default to
"false" in a future release, deprecating the option and eventually
removing it (forcing it to "false").
The goal of stopping doing "auto-prefixing" is to simplify the operation
(of nfd and users). Make the naming more straightforward and easier to
understand and debug (kind of WYSIWYG), eliminating peculiar corner
cases:
1. Make validation simpler and unambiguous
2. Remove "overloading" of names, i.e. the mapping two values to the
same actual name. E.g. previously something like
labels:
feature.node.kubernetes.io/foo: bar
foo: baz
Could actually result in node label:
feature.node.kubernetes.io/foo: baz
3. Make the processing/usagee of the "rule.matched" and "local.labels"
feature in NodeFeatureRules unambiguous and more understadable. E.g.
previously you could have node label
"feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule
you'd need to use the unprefixed name "local-foo" or the fully
prefixed name, depending on what was specified in the feature file (or
hook) on the node(s).
NOTE: setting autoDefaultNs to false is a breaking change for users who
rely on automatic prefixing with the default feature.node.kubernetes.io/
namespace. NodeFeatureRules, feature files, hooks and custom rules
(configuration of the "custom" source of nfd-worker) will need to be
altered. Unprefixed labels, annoations and extended resources will be
denied by nfd-master.
Make the handling of unprefixed names (of labels, annotations and
extended resources) well-defined and predictable. Previously the
resulting output was not predictable in case the same name was coming in
both the unprefixed and prefixed form, say unprefixed "foo=bar" coming from
one source (be it nfd-worker or NodeFeature(Rule)) and
"feature.node.kubernetes.io/foo=baz" from a NodeFeature(Rule).
Previously the output value was randomly either "bar" or "baz".
This patch adds prefixes to all names early in the processing
"pipeline", preventing random name clashes later on.
Fix NodeFeatureRule templating in cases where multiple matchFeatures
terms are targeting the same feature. Previously, only matched feature
elements from the last matcher terms were used as the input to the
template. However, the input should contain all matched elements from
all matcher terms.
For example, consider the example rule snippet below:
...
labelsTemplate: |
{{ range .pci.device }}vendor.io/pci-device.{{ .class }}-{{ .device }}=exists
{{ end }}
matchFeatures:
- feature: pci.device
matchExpressions:
class: {op: InRegexp, value: ["^03"]}
vendor: {op: In, value: ["1234"]}
- feature: pci.device
matchExpressions:
class: {op: InRegexp, value: ["^12"]}
This rule matches if both a pci device of class 03 from vendor 1234
exists and a pci device of class 12 (from any vendor) exists.
Previously, the template would only generate labels from the devices in
class 12 (as that's the last term). With this patch the template creates
device labels from devices in both classes 03 and 12.
This patch addresses issues with slow node status (extended resources)
updates. Previously we did just a few retries in quick succession which
could result in the node update failing, just because node status was
updated slower than our retry window. The patch mitigates the issue by
increasing the number of tries to 15. In addition, it creates a
ratelimiter with a longer per-item (per-node) base delay.
The patch also fixes the e2e-tests to expose the issue.
Implements three metrics for nfd-gc:
- nfd_gc_build_info: version information of nfd-gc.
- nfd_gc_objects_deleted_total: total number of NodeFeature and
NodeResourceTopology objects deleted by nfd-gc.
- nfd_gc_object_delete_failures_total: number of errors encountered when
deleting NodeFeature and NodeResourceTopology objects.