This started as a small effort to simplify the usage of "ready" channel
in nfd-master. It extended into a wider simplification/unification of
the channel usage.
Change the handling of LabelWhiteList config option to use a pointer to
detect when the option is unset. This doesn't fix any detected crash but
is merely general improvement and stabilization, serving easier testing.
Also, use the regexp type from the core libs for the config struct -
dropping the unmasrhalling code for our custom regexp type - as the core
regexp now implements unmarshaller itself.
Prevent excess queries of node objects from the Kubernetes apiserver.
This significantly speeds up node updates (and reduces the load on the
apiserver) as the client-side throttling (which is good) does not bite
us that hard.
Problem: memory requests and limits has been set for `master` process in
PR #1631. It does not follow best practices for setting those values,
but the intention was provide default values for a wide variety of
clusters, including small ones.
Solution: provide solid documentation about the problems that might
happen in production environments when
`resource.memory.requests << resource.memory.limits`. Add a link to
relevant external sources, which includes the advise from Tim Hockin:
> Always set memory limit == request
Signed-off-by: cmontemuino <1761056+cmontemuino@users.noreply.github.com>
Prevents races when (re-)starting the queue. There are no reports on
issues related to this (and I haven't come up with any actual failure
path in the current code) but better to be safe and follow the best
practices.
Prevents (rare) races on nfd-master reconfigurartion. Previously the
scheme was registered at nfd API controller creation/startup time. This
caused a race with some lister/informer goroutines of the previous
(stoppped) controller still running and accessing (reading) the sceme
while we were updating (writing) it.
APIVersion and Kind are empty in the returned namespace object
and need to be set explicitly.
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
Treat node updates like a reconciliation loop. Keep trying on node
update as long as it fails. Node update permafailing likely indicates a
bug in the nfd code (there should be no reason for it to fail forever)
and it's better to clearly see it in the logs/metrics rather than giving
up after a few retries.
Fixes a memory leak that happened when stopping (and then re-starting)
the nfd api controller. The stop channel was not used properly which
caused the underlying informer to keep on running.
Plan the removal of the -crd-controller flag along with the gRPC API.
This flag does not make much sense after that as all communication with
nfd-worker is based on CRDs - with the CRD controller disabled
nfd-master is virtually a functionless stub.