This patch mitigates inadvertent termination of nfd-master pods by the
liveness probe on big clusters.
With a recent change nfd-master started to wait (block) for informer
caches to sync before starting the main loop. Consequently, this change
also made the gRPC health enpoint to not respond until the caches have
been synced. In big clusters the syncing the NodeFeature object cache
takes a long time as the objects are big and there's (at least) one per
each node in the cluster. Thus, in big clusters, the liveness probe
kicks in and kills the nfd-master pod before it's ready.
In some cases it's desirable to control automatic garbage collection
of NodeFeature object.
Add an option to disable setting the owner references to Pod
for NodeFeature object.
Closes: 1817
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
Drop the resourceLabels config file option and the corresponding
-resource-labels command line flag. They were deprecated in NFD v0.13 so
it's time to let them go. NodeFeatureRule(s) should be used to manage
ERs, instead.
The issue is due to the k3d/kind cluster created by ctlptl will run
inside containers(it will serve as the virtual hosts).
Host folders which will be scaned by the nfd feature discovery should
be mounted into the container ( the virtual host). otherwise the nfd-worker
container which run inside the virtual host will just see the default base
image rootfs /boot, /lib folders, which are usually empty, leads to the
discovey failure.
Signed-off-by: Chaoyi Huang <joehuang.sweden@gmail.com>
We have to run our NFD workers in the host network.
Also we need additional env variables such as KUBERNETES_SERVICE_HOST and _PORT.
To achieve this we can simply add generic helm values. The default behavior is not changed.
Signed-off-by: Tobias Giese <tgiese@nvidia.com>
Simplify the code and reduce possible error scenarios by dropping
fsnotify-based reconfiguration from nfd-master and nfd-worker. Also
eliminates repeated re-configuration in scenarios where kubelet
continuosly touches the (every minute) mounted file (configmap) on the
filesystem.
Also modifies the Helm and kustomize deployments so that nfd-master,
nfd-worker and nfd-topology-updater pods are restarted on configmap
updates. In kustomize, the slght downside of this is the name of the
config map(s) depends on the content, so every time a user customizes
the config data, the old unused configmap will be left and must be
garbage-collected manually.
The upstream repo (and the release downloads)
github.com/rundocs/jekyll-rtd-theme has been deleted. This broke our
docs generation as the remote theme configuration depended on
downloading the release artefact.
This patch changes the docs building to use a Ruby gem instead of the
remote theme setting. To complicate matters, the gem has an seemingly
incorrect (too strict) version dependency. To mitigate this, we now
install bundler-override plugin to ignore this particular dependency.
The netlify conf is a hack, but I wasn't able to figure out a way how to
install the bundler-override plugin without doing all ruby
initialization in the build command.
The link to feature gates documentation is pointing to the
feature-gates.md in master-commandline-reference.html and
worker-commandline-reference.html, it should be updated to
linking html file.
Signed-off-by: joehuang <joehuang.sweden@gmail.com>
The link to feature gates documentation is pointing to the
upward folder in master-commandline-reference.md, it should
be updated to linking file in the same folder.
Signed-off-by: joehuang <joehuang.sweden@gmail.com>
The feature gate is locked to true. That is, it is not possible to revert
back to the gPRC-based communication which makes the gRPC API ready for
removal.
Bump the required Kubernetes version to v1.24. In practice this is the
minimum Kubernetes version as our deployment (both kustomize and Helm)
depend on the gRPC container probes feature of Kubernetes.
Disable AVX10 as unnecessary as AVX10_LEVEL is better suited for
checking AVX10 compatibility. There is not yet any hardware with the
feature so disabling it shouldn't cause problems for users.