1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

458 commits

Author SHA1 Message Date
IbirbyZh
2f9801b554
Fix the problem with starting the master with empty cache
We faced the problem when master deleted some of labels on start. Sometimes he doesn't gets NodeFeatures when they are present in cluster because of empty cache in informer
2024-06-10 18:06:14 +02:00
Kubernetes Prow Robot
814255b7f1
Merge pull request #1671 from marquiz/devel/nfd-api-multi-type-feature
apis/nfd: allow different types of features of the same name
2024-05-24 03:42:30 -07:00
Markus Lehtonen
3b448ae623 apis/nfd: allow different types of features of the same name
This patch changes the handling of NodeFeatureRules so that one feature
name (say "cpu.cpuid") can hold different types of features (flags,
attributes and/or instances). Requiring features to choose one single
type has not been a limitation of the API itself (and there has been no
validation on this) but an implementation decision.

The new evalutation logic of match expressions is such that "flags" and
"attributes" are basically evaluated as an union - they are both maps
but "flags" just don't have any value associated with the key. However,
"instances" are handled separately as that is basically an array of
maps and needs to be evaluated in a different way (loop over the array
of instances and evaluate expressions against the attributes of each).
Because of this difference care must be taken if mixing "instance"
features with "flag" and/or "attribute" features.

Note that the API types or their validation is not changed - just the
implementation of how the NodeFeatureRules are evaluated.
2024-05-24 13:18:31 +03:00
Kubernetes Prow Robot
5fe3433e35
Merge pull request #1713 from marquiz/devel/worker-log-fix
nfd-worker: improved log when creating NodeFeature object
2024-05-24 02:15:30 -07:00
Carlos Eduardo Arango Gutierrez
47c054e1db
Add NodeFeatureGroup CRD
The NodeFeatureGroup is an NFD-specific custom resource that is designed for
grouping nodes based on their features. NFD-Master watches for NodeFeatureGroup
objects in the cluster and updates the status of the NodeFeatureGroup object
with the list of nodes that match the feature group rules. The NodeFeatureGroup
rules follow the same syntax as the NodeFeatureRule rules.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-05-23 16:34:08 +02:00
Markus Lehtonen
649036977e nfd-worker: improved log when creating NodeFeature object
Don't log an empty NodeFeature object.
2024-05-23 14:37:26 +03:00
Markus Lehtonen
560bd11d85 Re-add -enable-nodefeature-api cmdline flag
Bring back the -enable-nodefeature-api command line flag and the
corresponding enableNodeFeatureApi helm config value that were
removed without deprecation when the NodeFeatureAPI feature gate was
introduced. The thinking behind this change is to not break existing
users (without warning) unless totally unavoidable. Now the
-enable-nodefeature-api flag is marked as deprecated and slated for
removal in NFD v0.17.

The NodeFeatureAPI feature gate and the -enable-nodefeature-api flag
work together so that the NodeFeature API is disabled (gRPC is enabled,
instead) if either of them is set to false.

This patch selectively reverts parts of
06c4733bc5.
2024-05-16 10:53:49 +03:00
Markus Lehtonen
121345472d nfd-master: add DisableAutoPrefix feature gate
Now that we have support for feature gates deprecate the autoDefaultNs
config option of nfd-master and replace it with a new alpha feature gate
DisableAutoPrefix (defaults to false). Using a feature gate to handle
and communicate these kind of changes, where the default behavior is
intended to be changed in a future release, feels much more natural than
using random flags/options.

The combined logic of the feature gate and the config option is a
logical OR over disabling auto-prefixing. That is, auto-prefixing is
disabled if either the feature gate or the config options is used set to
disable it:

                       | DisableAutoPrefix (feature gate)
                       | false | true
  -------------------- | --------------------------------
  autoDefaultNs   true |  ON   | OFF
  (config opt)   false |  OFF  | OFF
2024-05-15 17:01:16 +03:00
Markus Lehtonen
eb7a0ada5c nfd-master: import features package as nfdfeatures
Refactoring to prevent naming clash in future changes.
2024-05-15 11:27:46 +03:00
Markus Lehtonen
bad7d1fcb1 apis/nfd: increase unit test coverage
Cover error cases of the "match name" functions.
2024-04-30 16:02:13 +03:00
Kubernetes Prow Robot
ca13b4903d
Merge pull request #1669 from marquiz/devel/nfd-api-helpers-refactor
api/nfd: use varargs in the NewInstanceFeatures helper
2024-04-23 06:33:46 -07:00
Kubernetes Prow Robot
be44447dc0
Merge pull request #1670 from marquiz/devel/flag-feature-errors
apis/nfd: no error on ops that never match
2024-04-23 06:09:06 -07:00
Kubernetes Prow Robot
828acaa8cc
Merge pull request #1667 from marquiz/devel/api-tests
apis/nfd: add unit tests for match name functions
2024-04-23 06:08:48 -07:00
Markus Lehtonen
fbb7303562 apis/nfd: no error on ops that never match
Return false (i.e. "did not match") but no error when evaluating a match
expression against a "flag" type feature (which don't have any
associated value, just the name) if a MatchOp that never matches is
used.

This is preparation for supporting multi-type features, i.e. one
feature, like "cpu.cpuid", having e.g. "flag" and "attribute" type
features.
2024-04-23 11:07:49 +03:00
Markus Lehtonen
719c5186f6 api/nfd: use varargs in the NewInstanceFeatures helper
Make usage of this helper function more flexible.
2024-04-23 10:29:24 +03:00
TessaIO
de50ac8800 chore/nfd-master: remove warnings in nfd-master unit tests file
Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
2024-04-22 22:27:15 +02:00
Markus Lehtonen
6b1e9c7860 apis/nfd: add unit tests for match name functions 2024-04-22 17:20:33 +03:00
Kubernetes Prow Robot
a7c58b121c
Merge pull request #1660 from marquiz/devel/master-reconfigure-change
nfd-master: stop node-updater pool before reconfiguring api-controller
2024-04-15 11:10:32 -07:00
Kubernetes Prow Robot
91d3d5a7b0
Merge pull request #1653 from marquiz/devel/master-multiple-k8sclients
nfd-master: use separate k8s api clients for each updater
2024-04-15 09:18:51 -07:00
Markus Lehtonen
8ad6210d5c nfd-master: use separate k8s api clients for each updater
Sharing the same client between updater threads virtually serializes
access, in practice making the effective parallelism close to 1.

With this patch, in my bench cluster of 300 nodes, the time taken by
updating all nodes drops from ~2 minutes to ~12 seconds (with the
default parallelism of 10 node updater threads). This demonstrates the
10-fold increased parallelism from ~1 to 10.

There might be other solutions that could be explored, e.g. caching
nodes with an indexer/lister but otoh nfd doesn't necessarily need/want
to watch every little change in each node. We only need to get the node
when something in our own CRDs change (we don't react to any changes in
the node object itself). Using multiple clients was the most obvious
choice to solve the problem for now.
2024-04-15 19:00:30 +03:00
Kubernetes Prow Robot
624c02e1e2
Merge pull request #1633 from marquiz/devel/validate-tests
apis/nfd/validate: loosen validation of feature annotations
2024-04-11 07:26:17 -07:00
Markus Lehtonen
7fbada8b86 apis/nfd/validate: more comprehensive unit tests
Also add license header to the test.go file and fix one bug in
MatchFeature validation.
2024-04-11 14:57:31 +03:00
Carlos Eduardo Arango Gutierrez
50d9874e72
Fix update_codegen
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-04-09 18:28:04 +02:00
Markus Lehtonen
a9167e6875 apis/nfd/validate: loosen validation of feature annotations
Don't require that the annotation value must conform to the (strict)
requirements of label values. In the Kubernetes API annotation values do
not have other restrictions than that the total size (keys and values)
of _all_ annotations combined of an object must not exceed 256kB.

This patch sets a maximum size limit of 1kB for the value of a single
feature annotation created by NFD. This limit is rather arbitrary but
should be enough for the NFD usage scenarios (until proven wrong).
2024-04-09 13:30:22 +03:00
Kubernetes Prow Robot
6b80f654d4
Merge pull request #1600 from ArangoGutierrez/e2e-not-k8s
Move NFD api to a separate go mod
2024-04-09 02:06:06 -07:00
Markus Lehtonen
ba4cebb29e nfd-master: stop node-updater pool before reconfiguring api-controller
Prevents potential race between node-updater pool and the api-controller
when re-configuring nfd-master. Reconfiguration causes a new
api-controller instance to be created so nfd api lister might change in
the midst of processing a node update (if the pool was running). No
actual issues related to this have been identified but races (like this)
should still be avoided.
2024-04-09 10:45:07 +03:00
Markus Lehtonen
8709cccf71 nfd-master: parse kubeconfig even with NoPublish set
Don't try to be too smart when kubeconfig is needed. In practice, the
nfd-master really doesn't work anymore (with the NodeFeature API
enabled) without a kubeconfig set. This patch fixes crashes happening
when NoPublish is enabled, e.g. in listing all nodes in the nfd api
handler and in getting single node objects in the node updater pool.

This patch changes the kubeconfig parsing to happen at the creation of
the nfd-master instance. We don't need to do that at reconfigure time as
none of the dynamic config options affect it. Unit tests are adjusted,
accordingly.
2024-04-08 14:25:27 +03:00
Markus Lehtonen
fcb8d3cda4 nfd-master: implement opts for modifying NfdMaster instance
This provides a more controlled way for setting up the NfdMaster
instance for testing.
2024-04-05 20:21:19 +03:00
Kubernetes Prow Robot
199d665046
Merge pull request #1656 from marquiz/devel/channel-simplify
Tidy up usage of channels for signaling
2024-04-05 07:51:34 -07:00
Carlos Eduardo Arango Gutierrez
3434557d7c
Move NFD api to a separate go mod
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-04-05 16:35:47 +02:00
Kubernetes Prow Robot
cb24f7c234
Merge pull request #1657 from marquiz/devel/master-label-whitelist
nfd-master: prevent crash on empty config struct
2024-04-05 05:36:52 -07:00
Markus Lehtonen
26a80cf142 Tidy up usage of channels for signaling
This started as a small effort to simplify the usage of "ready" channel
in nfd-master. It extended into a wider simplification/unification of
the channel usage.
2024-04-05 14:39:58 +03:00
Markus Lehtonen
b27676451a nfd-master: prevent crash on empty config struct
Change the handling of LabelWhiteList config option to use a pointer to
detect when the option is unset. This doesn't fix any detected crash but
is merely general improvement and stabilization, serving easier testing.

Also, use the regexp type from the core libs for the config struct -
dropping the unmasrhalling code for our custom regexp type - as the core
regexp now implements unmarshaller itself.
2024-04-05 14:19:44 +03:00
Kubernetes Prow Robot
ad96c301a4
Merge pull request #1642 from marquiz/devel/master-updater-pool-lock
nfd-master: protect node updater pool queueing with a lock
2024-04-05 03:31:10 -07:00
Markus Lehtonen
44a5a5b4a8 nfd-master: get node object only once when updating node
Prevent excess queries of node objects from the Kubernetes apiserver.
This significantly speeds up node updates (and reduces the load on the
apiserver) as the client-side throttling (which is good) does not bite
us that hard.
2024-04-04 14:44:52 +03:00
Oleg Zhurakivskyy
f2e9557a2d nfd-topology-updater: Add liveness probe
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-04-03 13:15:54 +03:00
Markus Lehtonen
bce446c5b6 nfd-master: protect node updater pool queueing with a lock
Prevents races when (re-)starting the queue. There are no reports on
issues related to this (and I haven't come up with any actual failure
path in the current code) but better to be safe and follow the best
practices.
2024-03-27 16:53:34 +02:00
Markus Lehtonen
c4e010eafd nfd-master: do nfd API scheme registration in an init function
Prevents (rare) races on nfd-master reconfigurartion. Previously the
scheme was registered at nfd API controller creation/startup time. This
caused a race with some lister/informer goroutines of the previous
(stoppped) controller still running and accessing (reading) the sceme
while we were updating (writing) it.
2024-03-27 15:26:16 +02:00
Oleg Zhurakivskyy
7bd27c757a topology-updater: Set APIVersion, Kind in the OwnerReference explicitly
APIVersion and Kind are empty in the returned namespace object
and need to be set explicitly.

Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-03-20 20:09:06 +02:00
Kubernetes Prow Robot
0ad5e50f24
Merge pull request #1609 from ozhuraki/worker-health
nfd-worker: Add liveness probe
2024-03-19 06:57:23 -07:00
Oleg Zhurakivskyy
8b63d17af7 nfd-worker: Add liveness probe
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-03-19 15:34:53 +02:00
Kubernetes Prow Robot
c4ff25de52
Merge pull request #1596 from marquiz/devel/master-infinite-retry
nfd-master: retry node updates indefinitely
2024-03-19 04:00:50 -07:00
Kubernetes Prow Robot
7df0f17f68
Merge pull request #1602 from ozhuraki/nrt-owner-ref
Add owner reference to NRT object
2024-03-19 01:12:59 -07:00
Markus Lehtonen
e7f87de6df nfd-master: retry node updates indefinitely
Treat node updates like a reconciliation loop. Keep trying on node
update as long as it fails. Node update permafailing likely indicates a
bug in the nfd code (there should be no reason for it to fail forever)
and it's better to clearly see it in the logs/metrics rather than giving
up after a few retries.
2024-03-18 18:14:24 +02:00
Kubernetes Prow Robot
4790962123
Merge pull request #1595 from marquiz/devel/master-check-node-existence
nfd-master: check if node exists before trying update
2024-03-18 04:19:57 -07:00
Kubernetes Prow Robot
797fada92e
Merge pull request #1585 from kannon92/add-swap-support
add swap support in nfd
2024-03-18 04:19:48 -07:00
Carlos Eduardo Arango Gutierrez
06c4733bc5
Add FeatureGate framework to handle new features
Code inspired on https://github.com/kubernetes/component-base/tree/master/featuregate

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-03-15 19:11:32 +01:00
Oleg Zhurakivskyy
c662265a47 topology-updater: Add owner reference to NRT object
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2024-03-15 16:36:27 +02:00
Kubernetes Prow Robot
52d4337004
Merge pull request #1615 from marquiz/devel/master-mem-leak
nfd-master: fix memory leak in nfd api-controller
2024-03-14 08:21:33 -07:00
Carlos Eduardo Arango Gutierrez
69dbfdfbc0
Use close to signal stop channedl in worker and topology-updater
Fix stop channel management on Worker and T-updater in case of multiple callers

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2024-03-14 15:28:39 +01:00