This patch mitigates inadvertent termination of nfd-master pods by the
liveness probe on big clusters.
With a recent change nfd-master started to wait (block) for informer
caches to sync before starting the main loop. Consequently, this change
also made the gRPC health enpoint to not respond until the caches have
been synced. In big clusters the syncing the NodeFeature object cache
takes a long time as the objects are big and there's (at least) one per
each node in the cluster. Thus, in big clusters, the liveness probe
kicks in and kills the nfd-master pod before it's ready.
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.
The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).
The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.
Kubernetes 1.23 has introduced native health probes for gRPC which
can replace grpc_health_probe utility. This commit removes baking
in grpc_health_probe binary into the image and updates related
health checks to use k8s native gRPC.
Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
Expose metrics via prometheus.monitoring.coreos.com/v1
The exposed metrics are
| Metric | Type | Meaning |
| --------------- | ---------------- | ---------------- |
| `nfd_master_build_info` | Gauge | Version from which nfd-master was built. |
| `nfd_worker_build_info` | Gauge | Version from which nfd-worker was built. |
| `nfd_updated_nodes` | Counter | Time taken to label a node |
| `nfd_crd_processing_time` | Gauge | Time taken to process a NodeFeatureRule CRD |
| `nfd_feature_discovery_duration_seconds` | HistogramVec | Time taken to discover features on a node |
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Similar to the nfd-worker, in this PR we want to support the
dynamic run-time configurability through a config file for the nfd-master.
We'll use a json or yaml configuration file along with the fsnotify in
order to watch for changes in the config file. As a result, we're
allowing dynamic control of logging params, allowed namespaces,
extended resources, label whitelisting, and denied namespaces.
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
Implement functionality virtually replicating deployment templates for
nfd-master and nfd-worker daemonset (nfd-master.yaml.template and
nfd-worker-daemonset.yaml.template) by adding a kustomize overlay named
"default".
We split the resources into multiple bases (rbac, master and
worker-daemonset) so that relevant parts are re-usable in
other deployment scenarios added later (e.g. "one-shot job", and
"combined daemonset").
This patch adds one component (components/common) doing the required
kustomization for the example deployment.