1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

207 commits

Author SHA1 Message Date
Markus Lehtonen
b3d6282d2c api/nfd: document all undocumented fields in the types 2024-05-23 23:49:49 +03:00
Carlos Eduardo Arango Gutierrez
47c054e1db
Add NodeFeatureGroup CRD
The NodeFeatureGroup is an NFD-specific custom resource that is designed for
grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup
objects in the cluster and updates the status of the NodeFeatureGroup object
with the list of nodes that match the feature group rules. The NodeFeatureGroup
rules follow the same syntax as the NodeFeatureRule rules.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-05-23 16:34:08 +02:00
Markus Lehtonen
560bd11d85 Re-add -enable-nodefeature-api cmdline flag
Bring back the -enable-nodefeature-api command line flag and the
corresponding enableNodeFeatureApi helm config value that were
removed without deprecation when the NodeFeatureAPI feature gate was
introduced. The thinking behind this change is to not break existing
users (without warning) unless totally unavoidable. Now the
-enable-nodefeature-api flag is marked as deprecated and slated for
removal in NFD v0.17.

The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag
work together so that the NodeFeature API is disabled (gRPC is enabled,
instead) if either of them is set to false.

This patch selectively reverts parts of
06c4733bc5.
2024-05-16 10:53:49 +03:00
Kubernetes Prow Robot
391865bbb2
Merge pull request #1651 from cmontemuino/doc-resource-limits
docs: document trade-offs in memory configuration
2024-04-25 06:41:29 -07:00
Kubernetes Prow Robot
af8a41cc02
Merge pull request #1639 from TessaIO/chore-add-prometheus-pod-monitor-interval
chore/deploy: make interval property in PodMonitor configurable
2024-04-05 03:03:26 -07:00
Carlos M
cc53b604c5
chore: include suggestions from code review
Co-authored-by: Carlos Eduardo Arango Gutierrez <arangogutierrez@gmail.com>
2024-04-05 10:01:08 +02:00
Oleg Zhurakivskyy
f2e9557a2d nfd-topology-updater: Add liveness probe
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-04-03 13:15:54 +03:00
cmontemuino
54b01a2576
docs: document trade-offs in memory configuration
Problem: memory requests and limits has been set for `master` process in
PR #1631. It does not follow best practices for setting those values,
but the intention was provide default values for a wide variety of
clusters, including small ones.

Solution: provide solid documentation about the problems that might
happen in production environments when
`resource.memory.requests << resource.memory.limits`. Add a link to
relevant external sources, which includes the advise from Tim Hockin:
> Always set memory limit == request

Signed-off-by: cmontemuino <1761056+cmontemuino@users.noreply.github.com>
2024-04-02 19:01:50 +02:00
Kubernetes Prow Robot
7938e81c33
Merge pull request #1631 from TessaIO/chore-add-resources-limits-and-requests
chore/deployment: add resources requests and limits for helm and Kustomize
2024-04-02 02:03:59 -07:00
TessaIO
74153e11b5 chore/deploy: make interval property in PodMonitor configurable
Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
2024-03-26 08:36:52 +01:00
TessaIO
d02414cf61 chore/deployment: add resources requests and limits for helm and Kustomize
Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
2024-03-22 14:27:44 +01:00
Markus Lehtonen
9b3d273a18 helm: fix invalid name of host-swaps volume 2024-03-20 21:15:02 +02:00
Kubernetes Prow Robot
0ad5e50f24
Merge pull request #1609 from ozhuraki/worker-health
nfd-worker: Add liveness probe
2024-03-19 06:57:23 -07:00
Oleg Zhurakivskyy
8b63d17af7 nfd-worker: Add liveness probe
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-03-19 15:34:53 +02:00
Kubernetes Prow Robot
7df0f17f68
Merge pull request #1602 from ozhuraki/nrt-owner-ref
Add owner reference to NRT object
2024-03-19 01:12:59 -07:00
Kubernetes Prow Robot
797fada92e
Merge pull request #1585 from kannon92/add-swap-support
add swap support in nfd
2024-03-18 04:19:48 -07:00
Carlos Eduardo Arango Gutierrez
06c4733bc5
Add FeatureGate framework to handle new features
Code inspired on https://github.com/kubernetes/component-base/tree/master/featuregate

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-03-15 19:11:32 +01:00
Oleg Zhurakivskyy
c662265a47 topology-updater: Add owner reference to NRT object
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-03-15 16:36:27 +02:00
Markus Lehtonen
a562a6188a Update auto-generated code 2024-03-11 12:18:32 +02:00
Allen Mun
8bd52594ab add ability to use a custom issuer 2024-02-27 12:14:43 -05:00
Kevin Hannon
187f65f94e Add swap support in nfd 2024-02-19 10:20:56 -05:00
Carlos Eduardo Arango Gutierrez
75f0a14f2a
helm: add priorityClassName option
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-02-15 16:29:33 +01:00
Kubernetes Prow Robot
5c99ae8343
Merge pull request #1560 from leemingeer/master
nfd-topology-updater add pods fingerprint by default
2024-01-29 00:34:44 -08:00
leemingeer
b6d8ce7a5a nfd-topology-updater add pods fingerprint by default 2024-01-26 17:55:34 +08:00
Markus Lehtonen
8adb4b38da deployment/helm: don't deploy topology-updater conf unnecessarily
Only deploy the topology-updater config if topology-updater itself (the
daemon) is deployed.
2024-01-25 16:15:58 +02:00
Kubernetes Prow Robot
4501bedd61
Merge pull request #1535 from marquiz/devel/grpc-probe
nfd-master: run a separate gRPC health server
2024-01-05 15:24:28 +01:00
Markus Lehtonen
a053efda64 nfd-master: run a separate gRPC health server
This patch separates the gRPC health server from the deprecated gRPC
server (disabled by default, replaced by the NodeFeature CRD API) used
for node labeling requests. The new health server runs on hardcoded TCP
port number 8082.

The main motivation for this change is to make the Kubernetes' built-in
gRPC liveness probes to function if TLS is enabled (as they don't
support TLS).

The health server itself is a naive implementation (as it was before),
basically only checking that nfd-master has started and hasn't crashed.
The patch adds a TODO note to improve the functionality.
2024-01-04 13:58:26 +02:00
Markus Lehtonen
09b5af74de deployment/kustomize: drop the sample cert-manager overlay
Drop the deprecated and broken sample overlay. This was an example for
enabling TLS with cert-manager. However, the overlay has been broken
(and useless) since NodeFeature API was enabled by default - and gRPC
disabled - in v0.14.
2024-01-03 21:13:15 +02:00
Markus Lehtonen
889fffd7d4 helm: add post-delete hook that cleans up the node
This patch adds a post-delete hook to the Helm chart that runs
"nfd-master --prune" in the cluster. This cleans up the node of labels,
annotations, taints and extended resources that were created by NFD.
2023-12-29 15:36:41 +02:00
Markus Lehtonen
9846dede43 deployment/kustomize: enable nfd-gc in the default overlay 2023-12-21 21:30:14 +02:00
Markus Lehtonen
84fa1ed6e1 Document the NodeFeatureRule samples and move them under deployment dir 2023-12-15 13:43:26 +02:00
Markus Lehtonen
fe412a54b9 apis/nfd: add matchName field in feature matcher terms
Extend the format of feature matcher terms (the elements of the
arrayspecified under under matchFeatures field) with new matchName
field. The value of this field is an expression that is evaluated
against the names of feature elements instead of their values (values
are matched with the matchExpressions field, instead).

The matchName field is useful e.g. in template rules for creating
per-feature-element labels based on feature names (instead of values)
and in non-template rules for checking if (at least) one of certain
feature element names are present.

If both matchExpressions and matchName for certain feature matcher term
is specified, they both must match in order to get an overall match.
Also, in this case the list of matched features (used in templating) is
the union of the results from matchExpressions and matchName.

An example of creating an "avx512" label if any AVX512* CPUID feature is
present:

  - name: "avx wildcard rule"
    labels:
        avx512: "true"
    matchFeatures:
      - feature: cpu.cpuid
        matchName: {op: InRegexp, value: ["^AVX512"]}

An example of a template rule creating a dynamic set of labels  based on
the existence of certain kconfig options.

  - name: "kconfig template rule"
    labelsTemplate: |
      {{ range .kernel.config }}kconfig-{{ .Name }}={{ .Value }}
      {{ end }}
    matchFeatures:
      - feature: kernel.config
        matchName: {op: In, value: ["SWAP", "X86", "ARM"]}

NOTE: this patch changes the corner case of nil/null match expressions
with instance features (i.e. "matchExpressions: null"). Previously, we
returned all instances for templating but now a nil match expression is
not evaluated and no instances for templating are returned.
2023-12-15 11:32:23 +02:00
Markus Lehtonen
0bc1b6c28f apis/nfd: drop creation helper functions
Drop the creation helper functions as one step in an effort to tidy up
the api package. These functions were not much used outside unit tests
anyway, the static rules of the nfd-worker custom feature source being
the only exception (and if those happened to be invalid we'd catch that
e.g. in the e2e-tests).
2023-12-14 15:54:51 +02:00
Kubernetes Prow Robot
04c4725dd1
Merge pull request #1491 from marquiz/devel/nf-owner-ref
nfd-worker: set owner reference in NodeFeature objects
2023-12-08 14:51:47 +01:00
Kubernetes Prow Robot
aa26bbf964
Merge pull request #1494 from marquiz/devel/master-service
deployment/kustomize: drop nfd-master service
2023-12-08 14:31:07 +01:00
Markus Lehtonen
34574f4211 nfd-worker: set owner reference in NodeFeature objects
This patch creates a owner-dependent relationship between the
nfd-worker pod and the NodeFeature object that it creates. With this
change the orphaned NodeFeature object(s) gets automatically
garbage-collected when the nfd-worker pod goes away, without the need
for manual clean-up actions.
2023-12-08 14:57:31 +02:00
Markus Lehtonen
9624d182ab deployment/kustomize: drop nfd-master service
Not needed anymore as we're not relying on gRPC anymore.
2023-12-08 14:53:23 +02:00
Markus Lehtonen
53f5967555 deployment/kustomize: drop default-combined overlay
The "combined" overlay, deploying nfd-master and nfd-worker in the same
pod (with a daemonset) doesn't make sense anymore as we have enabled
NodeFeature API. There is no direct communication between nfd-master and
nfd-worker anymore, Moreover, the combined deployment can be seen as
broken as there is one NodeFeature controller (i.e. nfd-master) on each
node, causing them to race against each other, all processing all
NodeFeature objects.
2023-12-08 14:42:31 +02:00
Markus Lehtonen
1d012a28cd Option to stop implicitly adding default prefix to names
Add new autoDefaultNs (default is "true") config option to nfd-master.
Setting the config option to false stops NFD from automatically adding
the "feature.node.kubernetes.io/" prefix to labels, annotations and
extended resources. Taints are not affected as for them no prefix is
automatically added. The user-visible part of enabling the option change
is that NodeFeatureRules, local feature files, hooks and configuration
of the "custom" may need to be altereda (if the auto-prefixing is
relied on).

For now, the config option defaults to "true", meaning no change in
default behavior. However, the intent is to change the default to
"false" in a future release, deprecating the option and eventually
removing it (forcing it to "false").

The goal of stopping doing "auto-prefixing" is to simplify the operation
(of nfd and users). Make the naming more straightforward and easier to
understand and debug (kind of WYSIWYG), eliminating peculiar corner
cases:

1. Make validation simpler and unambiguous
2. Remove "overloading" of names, i.e. the mapping two values to the
   same actual name. E.g. previously something like

      labels:
        feature.node.kubernetes.io/foo: bar
        foo: baz

   Could actually result in node label:

     feature.node.kubernetes.io/foo: baz

3. Make the processing/usagee of the "rule.matched" and "local.labels"
   feature in NodeFeatureRules unambiguous and more understadable. E.g.
   previously you could have node label
   "feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule
   you'd need to use the unprefixed name "local-foo" or the fully
   prefixed name, depending on what was specified in the feature file (or
   hook) on the node(s).

NOTE: setting autoDefaultNs to false is a breaking change for users who
rely on automatic prefixing with the default feature.node.kubernetes.io/
namespace. NodeFeatureRules, feature files, hooks and custom rules
(configuration of the "custom" source of nfd-worker) will need to be
altered.  Unprefixed labels, annoations and extended resources will be
denied by nfd-master.
2023-11-24 12:48:20 +02:00
Carlos Eduardo Arango Gutierrez
c0063be4f4
Discover node features as annotations
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: bebc <mchf1990212@gmail.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-10-25 19:58:58 +02:00
AhmedGrati
d27eb0ac6d feat: add parameters in helm to disable/enable nfd-master and nfd-worker
Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-10-11 10:50:32 +01:00
Markus Lehtonen
98c3b0750d nfd-gc: add metrics
Implements three metrics for nfd-gc:

- nfd_gc_build_info: version information of nfd-gc.
- nfd_gc_objects_deleted_total: total number of NodeFeature and
  NodeResourceTopology objects deleted by nfd-gc.
- nfd_gc_object_delete_failures_total: number of errors encountered when
  deleting NodeFeature and NodeResourceTopology objects.
2023-10-09 13:39:28 +00:00
Shiva Krishna, Merla
6237b821f6 Fix serviceaccount handling for nfd-gc to be consistent with others
Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
2023-10-04 15:17:32 -07:00
Carlos Eduardo Arango Gutierrez
dd8d7f6725
Helm - service to be only deployed when needed (#1389)
* Helm - service to be only deployed when needed

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* Update deployment/helm/node-feature-discovery/templates/service.yaml

Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>

---------

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-10-04 16:00:43 +02:00
Carlos Eduardo Arango Gutierrez
3543aa22ce
Helm - Move remaining gPRC related flags to conditional
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-10-03 07:32:52 +02:00
Muyassarov, Feruzjon
06036a62ce Replace gRPC health probe utility with k8s built-in health probe
Kubernetes 1.23 has introduced native health probes for gRPC which
can replace grpc_health_probe utility. This commit removes baking
in grpc_health_probe binary into the image and updates related
health checks to use k8s native gRPC.

Signed-off-by: Muyassarov, Feruzjon <feruzjon.muyassarov@intel.com>
2023-09-20 12:25:36 +03:00
Kubernetes Prow Robot
8cdedf92fd
Merge pull request #1365 from marquiz/devel/helm-fix-nf-api
deployment/helm: fix handling of enableNodeFeatureApi parameter
2023-09-19 05:33:08 -07:00
Markus Lehtonen
8b207cae1f deployment/helm: fix handling of enableNodeFeatureApi parameter 2023-09-19 14:18:03 +03:00
Markus Lehtonen
759143ea3c deployment/helm: fix namespace of nfd-worker role and rolebinding
Put nfd-worker role and rolebinding in the correct namespace if
namespaceOverride parameter is used.
2023-09-19 13:53:19 +03:00
Kubernetes Prow Robot
2e6a202218
Merge pull request #1331 from andrewjamesbrown/ajb/chart_annotations
Helm: conditionally add annotations if defined
2023-09-07 01:20:59 -07:00