Drop references to the gRPC API and don't suggest that NodeFeatureAPI
could be disabled.
Also update the developer guide for instructions running nfd components
outside the cluster.
This patch mitigates inadvertent termination of nfd-master pods by the
liveness probe on big clusters.
With a recent change nfd-master started to wait (block) for informer
caches to sync before starting the main loop. Consequently, this change
also made the gRPC health enpoint to not respond until the caches have
been synced. In big clusters the syncing the NodeFeature object cache
takes a long time as the objects are big and there's (at least) one per
each node in the cluster. Thus, in big clusters, the liveness probe
kicks in and kills the nfd-master pod before it's ready.
Drop the resourceLabels config file option and the corresponding
-resource-labels command line flag. They were deprecated in NFD v0.13 so
it's time to let them go. NodeFeatureRule(s) should be used to manage
ERs, instead.
We have to run our NFD workers in the host network.
Also we need additional env variables such as KUBERNETES_SERVICE_HOST and _PORT.
To achieve this we can simply add generic helm values. The default behavior is not changed.
Signed-off-by: Tobias Giese <tgiese@nvidia.com>
Bump the required Kubernetes version to v1.24. In practice this is the
minimum Kubernetes version as our deployment (both kustomize and Helm)
depend on the gRPC container probes feature of Kubernetes.
Bring back the -enable-nodefeature-api command line flag and the
corresponding enableNodeFeatureApi helm config value that were
removed without deprecation when the NodeFeatureAPI feature gate was
introduced. The thinking behind this change is to not break existing
users (without warning) unless totally unavoidable. Now the
-enable-nodefeature-api flag is marked as deprecated and slated for
removal in NFD v0.17.
The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag
work together so that the NodeFeature API is disabled (gRPC is enabled,
instead) if either of them is set to false.
This patch selectively reverts parts of
06c4733bc5.
Problem: memory requests and limits has been set for `master` process in
PR #1631. It does not follow best practices for setting those values,
but the intention was provide default values for a wide variety of
clusters, including small ones.
Solution: provide solid documentation about the problems that might
happen in production environments when
`resource.memory.requests << resource.memory.limits`. Add a link to
relevant external sources, which includes the advise from Tim Hockin:
> Always set memory limit == request
Signed-off-by: cmontemuino <1761056+cmontemuino@users.noreply.github.com>
Drop the deprecated and broken sample overlay. This was an example for
enabling TLS with cert-manager. However, the overlay has been broken
(and useless) since NodeFeature API was enabled by default - and gRPC
disabled - in v0.14.
This patch adds a post-delete hook to the Helm chart that runs
"nfd-master --prune" in the cluster. This cleans up the node of labels,
annotations, taints and extended resources that were created by NFD.
The "combined" overlay, deploying nfd-master and nfd-worker in the same
pod (with a daemonset) doesn't make sense anymore as we have enabled
NodeFeature API. There is no direct communication between nfd-master and
nfd-worker anymore, Moreover, the combined deployment can be seen as
broken as there is one NodeFeature controller (i.e. nfd-master) on each
node, causing them to race against each other, all processing all
NodeFeature objects.