when OwnerReferencesPermissionEnforcement validating webhook is
enabled additional permissions are required to set/update owner ref
field. NFD worker sets/updates NodeFeature owner ref field to
the worker pod and owning daemonset.
owner reference can only be updated if the worker has delete permissions
for NodeFeatures.
if owner reference has blockOwnerDeletion (as the case for the daemonset
owner reference) then it requires update permissions to the finalizers
of the owner, to avoid this, we set blockOwnerDeleteion to false for all
owners referenced from NFD worker pod when setting/updating NodeFeature
owner ref.
Signed-off-by: adrianc <adrianc@nvidia.com>
MatchStatus provides details about successful expressions and their results,
which are the matched host features. Additionally, a new flag controls
rule processing behavior: it can either stop at the first error or
continue processing all expressions and rules.
Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com>
Drop references to the gRPC API and don't suggest that NodeFeatureAPI
could be disabled.
Also update the developer guide for instructions running nfd components
outside the cluster.
In some cases it's desirable to control automatic garbage collection
of NodeFeature object.
Add an option to disable setting the owner references to Pod
for NodeFeature object.
Closes: 1817
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
Drop the resourceLabels config file option and the corresponding
-resource-labels command line flag. They were deprecated in NFD v0.13 so
it's time to let them go. NodeFeatureRule(s) should be used to manage
ERs, instead.
Simplify the code and reduce possible error scenarios by dropping
fsnotify-based reconfiguration from nfd-master and nfd-worker. Also
eliminates repeated re-configuration in scenarios where kubelet
continuosly touches the (every minute) mounted file (configmap) on the
filesystem.
Also modifies the Helm and kustomize deployments so that nfd-master,
nfd-worker and nfd-topology-updater pods are restarted on configmap
updates. In kustomize, the slght downside of this is the name of the
config map(s) depends on the content, so every time a user customizes
the config data, the old unused configmap will be left and must be
garbage-collected manually.
Stop blocking on event channels when the api controller is stopped.
Ensures that the nfd API informer factory is properly shut down and all
resources released when stop() is called. This eliminates a memory leak
on re-configure events when leader election is enabled.
List NodeFeature and NodeResourceTopology objects in pages of 200 items.
This reduces memory consumption and eliminates timeouts (on the
apiserver side) in big clusters of thousands of nodes.
Fix cache syncing problems on big clusters with thousands of NodeFeature
objects.
On the initial list (sync) the client-go cache reflector sets the
ResourceVersion to "0" (instead of leaving it empty). This causes
problems in the api server with (apiserver) logs like:
E writers.go:122] apiserver was unable to write a JSON response: http:
Handler timeout
E status.go:71] apiserver received an error that is not an
metav1.Status: &errors.errorString{s:"http: Handler timeout"}:
http: Handler timeout
On the nfd-master side we see corresponding log snippets like:
W reflector.go:547] failed to list *v1alpha1.NodeFeature: stream error
when reading response body, may be caused by closed
connection. Please retry. Original error: stream
error: stream ID 1521; INTERNAL_ERROR; received from
peer
I trace.go:236] "Reflector ListAndWatch" name:*** (***) (total time:
61126ms): ---"Objects listed" error:stream error when
reading response body, may be caused by closed
connection. Please retry. Original error: stream
error: stream ID 1521; INTERNAL_ERROR; received from
peer 61126ms (***)
Decreasing the page size (opts.Limits) does not have any effect on the
timeouts. However, setting ResourceVersion to an empty value seems to
get the paging on its tracks, eliminating the timeouts.
TODO: investigate in Kubernetes upstream the root cause of the timeouts
with ResourceVersion="0".
The feature gate is locked to true. That is, it is not possible to revert
back to the gPRC-based communication which makes the gRPC API ready for
removal.
Run nfd-worker with NodeFeature API enabled (against a fake apiserver)
instead of using the deprecated gRPC (against a nfd-master instance).
Expand the test to verify the features and labels that are advertised as
a NodeFeature object.