Add a counter for total number of node update/sync requests. In
practice, this counts the number of gRPC requests received if the gRPC
API is in use. If the NodeFeature API is enabled, this counts the
requests initiated by the NFD API controller, i.e. updates triggered by
changes in NodeFeature or NodeFeatureRule objects plus updates initiated
by the controller resync period.
Add a counter for errors encountered when processing NodeFeatureRules.
Another simple counter without any additional prometheus labels -
nfd-master logs can provide further details.
Add a new counter for tracking node update failures from nfd-master.
This tracks both normal feature updates and the --prune sub-command.
This is a simple counter without any additional labels - nfd-master logs
can be used for further diagnostics.
Rename the "NodeName" prometheus label to "node", aligning with
common prometheus/kubernetes conventions. Also reconfigure the
prometheus histogram buckets (now 10ms to 1s) to better match the
expected sample range.
Rename the metric, better describe what we're measuring and better
comply with prometheus naming conventions. Also change it to represent
actual updates of the node object on the Kubernetes apiserver.
Change the metric from a simple gauge (that basically was a single value
for the whole cluster) into a HistogramVec, aligning with the feature
discovery duration metric in nfd-worker. This improved metric now has
prometheus labels for the NFR name and node name, i.e. it is tracking
per-NFR metric for each node being processed. Also, change the naming to
better comply with prometheus suggested conventions.
Expose metrics via prometheus.monitoring.coreos.com/v1
The exposed metrics are
| Metric | Type | Meaning |
| --------------- | ---------------- | ---------------- |
| `nfd_master_build_info` | Gauge | Version from which nfd-master was built. |
| `nfd_worker_build_info` | Gauge | Version from which nfd-worker was built. |
| `nfd_updated_nodes` | Counter | Time taken to label a node |
| `nfd_crd_processing_time` | Gauge | Time taken to process a NodeFeatureRule CRD |
| `nfd_feature_discovery_duration_seconds` | HistogramVec | Time taken to discover features on a node |
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
NFD is only detecting policy and scope of Topology Manager when NRT object doesn't exist.
This means that topologyManagerScope and topologyManagerPolicy attributes won't be updated
even if kubelet config was changed to use other TopologyManager policy and scope.
Signed-off-by: pprokop <pprokop@nvidia.com>
Let's refactor part of the getCgroupMiscCapacity() out to its own
retrieveCgroupMiscCapacityValue(), for the legibility sake.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We've been only considering cgroupsv2 when trying to read misc.capacity.
However, there are still a bunch of systems out there relying on
cgroupsv1.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>