Make it possible to disable kubelet state tracking with
--set topologyUpdater.kubeletStateFiles="" as the documentation
suggests.
Also, fix the documentation regarding the default value of
topologyUpdater.kubeletStateFiles parameter.
Change the configuration so that, by default, we use a dedicated
serviceaccount for topology-updater (similar to topology-gc, nfd-master
and nfd-worker).
Fix the templates so that the serviceaccount and clusterrolebinding are
only created when topology-updater is enabled (clusterrole was already
handled this way).
This patch also correctly documents the default value of rbac.create
parameter of topology-updater and topology-gc.
Mount kubelet podresources socket on an independent path, not under
with the kubelet state directory. Otherwise container creation may fail
on mount creation if topologyUpdater.kubeletPodResourcesSockPath and/or
topologyUpdater.kubeletConfigPath Helm parameters are specified in a
certain way.
Volume/mount setup for the ConfigMap was erroneously inside conditionals
so it was not mounted unless TLS was enabled.
(cherry picked from commit b016def8a3)
- kubelet_internal_checkpoint file is in /var/lib/kubelet/device-plugins not /var/lib/kubelet
fsWatcher doesn't watch dirs recursively
- e.Name returned from fsWatcher events is a full path not a basename
Signed-off-by: pprokop <pprokop@nvidia.com>
Reject malformed extended resource dynamic capacity assignment
capacity should be in the form of domain.feature.element,
add logic at func filterExtendedResources to check if true or ignore
ExtendedResource, logging as an error.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Fix a a bug where nfd-master with NodeFeature API enabled would crash
when NodeFeatureRule objects were processed in the case where no
NodeFeature objects existed. This was caused by trying to insert values
into a non-initialized NodeFeatureSpec in the code.
This patch adds two safety measures to prevent that from happening in
the future. First, add a constructor function for the NodeFeatureSpec
type, and second, check for uninitialized object in the function
inserting new functions.
TODO: add unit tests for the API helper functions.
This patch add SEV ASIDs and the related (but distinct) SEV Encrypted State
(SEV-ES) IDs as two quantities to be exposed via extended resources.
In a kernel built with CONFIG_CGROUP_MISC on a suitably equipped AMD CPU, the
root control group will have a misc.capacity file that shows the number of
available IDs in each category.
The added extended resources are:
- sev.asids
- sev.encrypted_state_ids
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Previously we were using the default, which even if equal to 0, still
means 10 minute timout in practice (with the way we run the tests with
invoking go test directly). With the addition of latest e2e tests we
hit the limit and got bitten by it. Set the timeout to 1 hour which
should be enough for anyone...
Change the NFD API handler to re-try on node update failures. Will work
around transient failures, making sure that failed nodes (i.e. nodes
that we failed to update) don't need to wait for the 1 hour resync
period before being tried again.
This PR adds the combination of dynamic and builtin kernel modules into
one feature called `kernel.enabledmodule`. It's a superset of the
`kernel.loadedmodule` feature.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Mark the -resource-labels flag (and the corresponding resourceLabels
config option) as deprecated. We now support managing extended resources
via NodeFeatureRule objects. This kludge deserves to go, eventually.