Drop references to the gRPC API and don't suggest that NodeFeatureAPI
could be disabled.
Also update the developer guide for instructions running nfd components
outside the cluster.
Drop the resourceLabels config file option and the corresponding
-resource-labels command line flag. They were deprecated in NFD v0.13 so
it's time to let them go. NodeFeatureRule(s) should be used to manage
ERs, instead.
Bring back the -enable-nodefeature-api command line flag and the
corresponding enableNodeFeatureApi helm config value that were
removed without deprecation when the NodeFeatureAPI feature gate was
introduced. The thinking behind this change is to not break existing
users (without warning) unless totally unavoidable. Now the
-enable-nodefeature-api flag is marked as deprecated and slated for
removal in NFD v0.17.
The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag
work together so that the NodeFeature API is disabled (gRPC is enabled,
instead) if either of them is set to false.
This patch selectively reverts parts of
06c4733bc5.
Plan the removal of the -crd-controller flag along with the gRPC API.
This flag does not make much sense after that as all communication with
nfd-worker is based on CRDs - with the CRD controller disabled
nfd-master is virtually a functionless stub.
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.
The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).
The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.
Now that the NodeFeature API has been set enabled by default, the gRPC
mode will be deprecated and with it all flags and features around it.
For nfd-master, flags
-port, -key-file, -ca-file, -cert-file, -verify-node-name, -enable-nodefeature-api
are now marked as deprecated.
For nfd-worker flags
-enable-nodefeature-api, -ca-file, -cert-file, -key-file, -server, -server-name-override
are now marked as deprecated.
Deprecated flags, as well as gRPC related code will be removed in future
releases.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Expose metrics via prometheus.monitoring.coreos.com/v1
The exposed metrics are
| Metric | Type | Meaning |
| --------------- | ---------------- | ---------------- |
| `nfd_master_build_info` | Gauge | Version from which nfd-master was built. |
| `nfd_worker_build_info` | Gauge | Version from which nfd-worker was built. |
| `nfd_updated_nodes` | Counter | Time taken to label a node |
| `nfd_crd_processing_time` | Gauge | Time taken to process a NodeFeatureRule CRD |
| `nfd_feature_discovery_duration_seconds` | HistogramVec | Time taken to discover features on a node |
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
This PR aims to optimize the process of updating nodes with
corresponding features. In fact, previously, we were updating nodes
sequentially even though they are independent from each other.
Therefore, we integrated new components: LabelersNodePool which is
responsible for spininng a goroutine whenever there's a request for
updating nodes, and a Workqueue which is responsible for holding nodes names
that should be updated.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
It allows NFD-master to be run in active-passive way when running
multiple instances of NFD-master to prevent multiple components
from updating same custom resources.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
This PR fixes the resync-period configuration option of the nfd-master.
In fact, previously, changes were not reflected in the nfd-master at
runtime. e2e tests are also implemented to make sure that the fix is
already working as expected.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
This PR adds a config option for setting the NFD API controller resync period.
The resync period is only activated when the NodeFeature API has been
enabled (with -enable-nodefeature-api).
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Mark the -resource-labels flag (and the corresponding resourceLabels
config option) as deprecated. We now support managing extended resources
via NodeFeatureRule objects. This kludge deserves to go, eventually.
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.
We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Deprecate the '-featurerules-controller' command line flag as the name
does not describe the functionality anymore: in practice it controls the
CRD controller handling both NodeFeature and NodeFeatureRule objects.
The patch introduces a duplicate, more generally named, flag
'-crd-controller'. A warning is printed in the log if
'-featurerules-controller' flag is encountered.
Add initial support for handling NodeFeature objects. With this patch
nfd-master watches NodeFeature objects in all namespaces and reacts to
changes in any of these. The node which a certain NodeFeature object
affects is determined by the "nfd.node.kubernetes.io/node-name"
annotation of the object. When a NodeFeature object targeting certain
node is changed, nfd-master needs to process all other objects targeting
the same node, too, because there may be dependencies between them.
Add a new command line flag for selecting between gRPC and NodeFeature
CRD API as the source of feature requests. Enabling NodeFeature API
disables the gRPC interface.
-enable-nodefeature-api enable NodeFeature CRD API for incoming
feature requests, will disable the gRPC
interface (defaults to false)
It is not possible to serve gRPC and watch NodeFeature objects at the
same time. This is deliberate to avoid labeling races e.g. by nfd-worker
sending gRPC requests but NodeFeature objects in the cluster
"overriding" those changes (labels from the gRPC requests will get
overridden when NodeFeature objects are processed).
This commits extends NFD master code to support adding node taints
from NodeFeatureRule CR. We also introduce a new annotation for
taints which helps to identify if the taint set on node is owned
by NFD or not. When user deletes the taint entry from
NodeFeatureRule CR, NFD will remove the taint from the node. But
to avoid accidental deletion of taints not owned by the NFD, it
needs to know the owner. Keeping track of NFD set taints in the
annotation can be used during the filtering of the owner. Also
enable-taints flag is added to allow users opt in/out for node
tainting feature. The flag takes precedence over taints defined
in NodeFeatureRule CR. In other words, if enbale-taints is set to
false(disabled) and user still defines taints on the CR, NFD will
ignore those taints and skip them from setting on the node.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
Add a new command line flag for disabling/enabling the controller for
NodeFeatureRule objects. In practice, disabling the controller disables
all labels generated from rules in NodeFeatureRule objects.
The NodeResourceTopology API has been made cluster
scoped as in the current context a CR corresponds to
a Node and since Node is a cluster scoped resource it
makes sense to make NRT cluster scoped as well.
Ref: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/pull/18
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This patch allows to expose Resource Hardware Topology information
through CRDs in Node Feature Discovery.
- In order to do this we introduce another software component called
nfd-topology-updater in addition to the already existing software
components nfd-master and nfd-worker.
- nfd-master was enhanced to communicate with nfd-topology-updater
over gRPC followed by creation of CRs corresponding to the nodes
in the cluster exposing resource hardware topology information
of that node.
- Pin kubernetes dependency to one that include pod resource implementation
- This code is responsible for obtaining hardware information from the system
as well as pod resource information from the Pod Resource API in order to
determine the allocatable resource information for each NUMA zone. This
information along with Costs for NUMA zones (obtained by reading NUMA distances)
is gathered by nfd-topology-updater running on all the nodes
of the cluster and propagate NUMA zone costs to master in order to populate
that information in the CRs corresponding to the nodes.
- We use GHW facilities for obtaining system information like CPUs, topology,
NUMA distances etc.
- This also includes updates made to Makefile and Dockerfile and Manifests for
deploying nfd-topology-updater.
- This patch includes unit tests
- As part of the Topology Aware Scheduling work, this patch captures
the configured Topology manager scope in addition to the Topology manager policy.
Based on the value of both attribues a single string will be populated to the CRD.
The string value will be on of the following {SingleNUMANodeContainerLevel,
SingleNUMANodePodLevel, BestEffort, Restricted, None}
Co-Authored-by: Artyom Lukianov <alukiano@redhat.com>
Co-Authored-by: Francesco Romani <fromani@redhat.com>
Co-Authored-by: Talor Itzhak <titzhak@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This can be used to help running multiple parallel NFD deployments in
the same cluster. The flag changes the node annotation namespace to
<instance>.nfd.node.kubernetes.io allowing different nfd-master intances
to store metadata in separate annotations.
For historical reasons the labels in the default nfd namespace have been
internally represented without the namespace part. I.e. instead of
"feature.node.kubernetes.io/foo" we just use "foo". NFD worker uses this
representation, too, both internally and over the gRPC requests. The
same scheme has been used for annotations.
This patch changes NFD master to use fully namespaced label and
annotation names internally. This hopefully makes the code a bit more
understandable. It also addresses some corner cases making the handling
of label names consistent, making it possible to use both "truncated"
and fully namespaced names over the gRPC interface (and in the
annotations).
A new sub-command like flag for cleaning up a cluster. When --prune is
specified nfd-master removes all NFD related labels, annotations and
extended resources from all nodes of the cluster and exits.
This should help undeployment of NFD and be useful for development.