Add a counter for errors encountered when processing NodeFeatureRules.
Another simple counter without any additional prometheus labels -
nfd-master logs can provide further details.
Add a new counter for tracking node update failures from nfd-master.
This tracks both normal feature updates and the --prune sub-command.
This is a simple counter without any additional labels - nfd-master logs
can be used for further diagnostics.
Rename the "NodeName" prometheus label to "node", aligning with
common prometheus/kubernetes conventions. Also reconfigure the
prometheus histogram buckets (now 10ms to 1s) to better match the
expected sample range.
Rename the metric, better describe what we're measuring and better
comply with prometheus naming conventions. Also change it to represent
actual updates of the node object on the Kubernetes apiserver.
Change the metric from a simple gauge (that basically was a single value
for the whole cluster) into a HistogramVec, aligning with the feature
discovery duration metric in nfd-worker. This improved metric now has
prometheus labels for the NFR name and node name, i.e. it is tracking
per-NFR metric for each node being processed. Also, change the naming to
better comply with prometheus suggested conventions.
Expose metrics via prometheus.monitoring.coreos.com/v1
The exposed metrics are
| Metric | Type | Meaning |
| --------------- | ---------------- | ---------------- |
| `nfd_master_build_info` | Gauge | Version from which nfd-master was built. |
| `nfd_worker_build_info` | Gauge | Version from which nfd-worker was built. |
| `nfd_updated_nodes` | Counter | Time taken to label a node |
| `nfd_crd_processing_time` | Gauge | Time taken to process a NodeFeatureRule CRD |
| `nfd_feature_discovery_duration_seconds` | HistogramVec | Time taken to discover features on a node |
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
NFD is only detecting policy and scope of Topology Manager when NRT object doesn't exist.
This means that topologyManagerScope and topologyManagerPolicy attributes won't be updated
even if kubelet config was changed to use other TopologyManager policy and scope.
Signed-off-by: pprokop <pprokop@nvidia.com>
This PR aims to optimize the process of updating nodes with
corresponding features. In fact, previously, we were updating nodes
sequentially even though they are independent from each other.
Therefore, we integrated new components: LabelersNodePool which is
responsible for spininng a goroutine whenever there's a request for
updating nodes, and a Workqueue which is responsible for holding nodes names
that should be updated.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
This PR aims to support the dynamic values for labels in the
NodeFeatureRule CRD, it would offer more flexible labeling for users.
To achieve this, we check whether label value starts with "@", and if
it's the case, we will get the value of the feature value, and update
the value of the label with the feature value.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Drop the KlogDump helper in favor of klog.InfoS. However, that patch
introduces a new DelayedDumper() helper to avoid processing
(marshalling) of object unless really evaluated by the logging function.
It allows NFD-master to be run in active-passive way when running
multiple instances of NFD-master to prevent multiple components
from updating same custom resources.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
This PR fixes the resync-period configuration option of the nfd-master.
In fact, previously, changes were not reflected in the nfd-master at
runtime. e2e tests are also implemented to make sure that the fix is
already working as expected.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Split out resolving of node name (of the node to be updated) into a
separate function. Makes it possible to add unit tests. Also. do
unconditional type casting in the handler functions – that shouldn't
fail unless there is a really serious internal inconsistency in the
codebase so it should be ok to panic.
- kubelet_internal_checkpoint file is in /var/lib/kubelet/device-plugins not /var/lib/kubelet
fsWatcher doesn't watch dirs recursively
- e.Name returned from fsWatcher events is a full path not a basename
Signed-off-by: pprokop <pprokop@nvidia.com>
This PR adds a config option for setting the NFD API controller resync period.
The resync period is only activated when the NodeFeature API has been
enabled (with -enable-nodefeature-api).
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Reject malformed extended resource dynamic capacity assignment
capacity should be in the form of domain.feature.element,
add logic at func filterExtendedResources to check if true or ignore
ExtendedResource, logging as an error.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Fix a a bug where nfd-master with NodeFeature API enabled would crash
when NodeFeatureRule objects were processed in the case where no
NodeFeature objects existed. This was caused by trying to insert values
into a non-initialized NodeFeatureSpec in the code.
This patch adds two safety measures to prevent that from happening in
the future. First, add a constructor function for the NodeFeatureSpec
type, and second, check for uninitialized object in the function
inserting new functions.
TODO: add unit tests for the API helper functions.