Prevents potential race between node-updater pool and the api-controller
when re-configuring nfd-master. Reconfiguration causes a new
api-controller instance to be created so nfd api lister might change in
the midst of processing a node update (if the pool was running). No
actual issues related to this have been identified but races (like this)
should still be avoided.
Don't try to be too smart when kubeconfig is needed. In practice, the
nfd-master really doesn't work anymore (with the NodeFeature API
enabled) without a kubeconfig set. This patch fixes crashes happening
when NoPublish is enabled, e.g. in listing all nodes in the nfd api
handler and in getting single node objects in the node updater pool.
This patch changes the kubeconfig parsing to happen at the creation of
the nfd-master instance. We don't need to do that at reconfigure time as
none of the dynamic config options affect it. Unit tests are adjusted,
accordingly.
This started as a small effort to simplify the usage of "ready" channel
in nfd-master. It extended into a wider simplification/unification of
the channel usage.
Change the handling of LabelWhiteList config option to use a pointer to
detect when the option is unset. This doesn't fix any detected crash but
is merely general improvement and stabilization, serving easier testing.
Also, use the regexp type from the core libs for the config struct -
dropping the unmasrhalling code for our custom regexp type - as the core
regexp now implements unmarshaller itself.
Prevent excess queries of node objects from the Kubernetes apiserver.
This significantly speeds up node updates (and reduces the load on the
apiserver) as the client-side throttling (which is good) does not bite
us that hard.
Prevents races when (re-)starting the queue. There are no reports on
issues related to this (and I haven't come up with any actual failure
path in the current code) but better to be safe and follow the best
practices.
Prevents (rare) races on nfd-master reconfigurartion. Previously the
scheme was registered at nfd API controller creation/startup time. This
caused a race with some lister/informer goroutines of the previous
(stoppped) controller still running and accessing (reading) the sceme
while we were updating (writing) it.
APIVersion and Kind are empty in the returned namespace object
and need to be set explicitly.
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
Treat node updates like a reconciliation loop. Keep trying on node
update as long as it fails. Node update permafailing likely indicates a
bug in the nfd code (there should be no reason for it to fail forever)
and it's better to clearly see it in the logs/metrics rather than giving
up after a few retries.
Fixes a memory leak that happened when stopping (and then re-starting)
the nfd api controller. The stop channel was not used properly which
caused the underlying informer to keep on running.
Rewrite the generate.sh into update_codegen inspired in
k8s.io/code-generator documentation.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Correctly patch the "status" subresource. This got broken when
refactoring the code in 7a050e7cf9 and
wasn't even catched by the unit tests as the fake kubernetes client
doesn't handle subresources as the real apiserver does.
Implement some of frequently used helper functions inpackage.
This patch also contains big changes to the nfd-master unit tests. Much
of this is about migrating from the mocked apihelper interface to fake
kubernetes client that provides a bit more apiserver'ish functionality.
At the same time there is quite a bit of renaming in the tests,
shortening and unifying naming and getting rid of the extensive usage of
"mock" everywhere.
This change is part of an effort to remove the pkg/apihelper package.
GetKubeconfig is useful helper functionality shared accross the codebase
so move it into a "safe" location.
We need to parse kubeconfig (and initialize the apihelper) even with
-no-publish as the PodResourcesScanner accesses the k8s API even if
we're not publishing/updating NRTs.
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.
The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).
The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.