node-feature-discovery

mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00

Author	SHA1	Message	Date
Markus Lehtonen	2382c34697	nfd-master: fix node status patching Correctly patch the "status" subresource. This got broken when refactoring the code in `7a050e7cf9` and wasn't even catched by the unit tests as the fake kubernetes client doesn't handle subresources as the real apiserver does.	2024-01-26 22:00:13 +02:00
Markus Lehtonen	8a6a731eb0	Drop pkg/apihelper The code is now unused.	2024-01-26 18:50:31 +02:00
Kubernetes Prow Robot	33858b7502	Merge pull request #1567 from marquiz/devel/apihelper-refactor-3 topology-updater: ditch apihelper	2024-01-26 17:07:13 +01:00
Markus Lehtonen	7a050e7cf9	nfd-master: ditch apihelper Implement some of frequently used helper functions inpackage. This patch also contains big changes to the nfd-master unit tests. Much of this is about migrating from the mocked apihelper interface to fake kubernetes client that provides a bit more apiserver'ish functionality. At the same time there is quite a bit of renaming in the tests, shortening and unifying naming and getting rid of the extensive usage of "mock" everywhere.	2024-01-26 16:09:22 +02:00
Markus Lehtonen	c581a25a39	topology-updater: ditch apihelper Stop using pkg/apihelper for accessing the Kubernetes API. Modify unit tests to use the fake kubernetes client instead of mocked apihelper interface.	2024-01-25 22:15:20 +02:00
Markus Lehtonen	53003cbf69	pkg/utils: move JsonPatch from pkg/apihelper	2024-01-25 17:23:14 +02:00
Markus Lehtonen	2326459d05	topology-updater: get topology api client directly Stop using apihelper for getting the noderesourcetopology-api client.	2024-01-25 16:33:34 +02:00
Markus Lehtonen	acf815fb10	pkg/utils: move GetKubeconfig from pkg/apihelper here This change is part of an effort to remove the pkg/apihelper package. GetKubeconfig is useful helper functionality shared accross the codebase so move it into a "safe" location.	2024-01-24 16:10:02 +02:00
Markus Lehtonen	57b7a3c6a8	Wrap nested errors	2024-01-22 22:45:15 +02:00
Markus Lehtonen	b452ab6a5c	topology-updater: initialize properly with -no-publish We need to parse kubeconfig (and initialize the apihelper) even with -no-publish as the PodResourcesScanner accesses the k8s API even if we're not publishing/updating NRTs.	2024-01-22 14:15:12 +02:00
Kubernetes Prow Robot	3667a4d073	Merge pull request #1537 from ozhuraki/apis-nfd-test apis/nfd: Trivial typo fix in tests	2024-01-19 15:25:41 +01:00
Markus Lehtonen	58ae81804c	go.mod: update dependencies	2024-01-15 21:29:32 +02:00
Oleg Zhurakivskyy	eec05e1c7a	apis/nfd: Trivial typo fix in tests Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>	2024-01-15 18:06:58 +02:00
Markus Lehtonen	a053efda64	nfd-master: run a separate gRPC health server This patch separates the gRPC health server from the deprecated gRPC server (disabled by default, replaced by the NodeFeature CRD API) used for node labeling requests. The new health server runs on hardcoded TCP port number 8082. The main motivation for this change is to make the Kubernetes' built-in gRPC liveness probes to function if TLS is enabled (as they don't support TLS). The health server itself is a naive implementation (as it was before), basically only checking that nfd-master has started and hasn't crashed. The patch adds a TODO note to improve the functionality.	2024-01-04 13:58:26 +02:00
Carlos Eduardo Arango Gutierrez	57b6035b71	Add kubectl-nfd kubectl-nfd is a kubectl plugin for debbuging NodeFeatureRules Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-12-21 16:00:19 +01:00
Markus Lehtonen	97bf841140	apis/nfd: split rule processing into a separate package This patch tidies up the nfdv1alpha1 API package by refactoring out the implementation of (NodeFeature)Rule evaluation into a separate package.	2023-12-20 12:52:15 +02:00
Gyuho Lee	ed0418b81c	chore(nfd-worker): fix minor typo in wrong label value format error Signed-off-by: Gyuho Lee <gyuho@lepton.ai>	2023-12-19 02:29:37 +08:00
Markus Lehtonen	b28d5c1557	apis/nfd: drop unused validate function	2023-12-18 15:19:19 +02:00
Markus Lehtonen	74bc3bb2a8	apis/nfd: drop custom unmarshaller functions Not needed in the external API.	2023-12-18 15:19:19 +02:00
Kubernetes Prow Robot	884edc67eb	Merge pull request #1477 from marquiz/devel/api-cleanup apis/nfd: drop the private template caching fields	2023-12-15 15:42:31 +01:00
Markus Lehtonen	912c7dcf2c	apis/nfd: fix an error in auto-generated code Work around a bug in k8s deepcopy-gen.	2023-12-15 11:32:23 +02:00
Markus Lehtonen	fe412a54b9	apis/nfd: add matchName field in feature matcher terms Extend the format of feature matcher terms (the elements of the arrayspecified under under matchFeatures field) with new matchName field. The value of this field is an expression that is evaluated against the names of feature elements instead of their values (values are matched with the matchExpressions field, instead). The matchName field is useful e.g. in template rules for creating per-feature-element labels based on feature names (instead of values) and in non-template rules for checking if (at least) one of certain feature element names are present. If both matchExpressions and matchName for certain feature matcher term is specified, they both must match in order to get an overall match. Also, in this case the list of matched features (used in templating) is the union of the results from matchExpressions and matchName. An example of creating an "avx512" label if any AVX512* CPUID feature is present: - name: "avx wildcard rule" labels: avx512: "true" matchFeatures: - feature: cpu.cpuid matchName: {op: InRegexp, value: ["^AVX512"]} An example of a template rule creating a dynamic set of labels based on the existence of certain kconfig options. - name: "kconfig template rule" labelsTemplate: \| {{ range .kernel.config }}kconfig-{{ .Name }}={{ .Value }} {{ end }} matchFeatures: - feature: kernel.config matchName: {op: In, value: ["SWAP", "X86", "ARM"]} NOTE: this patch changes the corner case of nil/null match expressions with instance features (i.e. "matchExpressions: null"). Previously, we returned all instances for templating but now a nil match expression is not evaluated and no instances for templating are returned.	2023-12-15 11:32:23 +02:00
Markus Lehtonen	b2d9e15a00	apis/nfd: drop the private template caching fields Drop the private fields – that were supposed to be used for caching parsed templates – from the Rule type. Keep the API typedefs cleaner and simpler. Moreover, the caching was not even used in practice, effectively complicating code without any benefit: the way the types are used in nfd-master creates a local copy of Rule type storing the cached template in the copy, wasting it from any future users. There are also other possible caveats in caching like we tried to do it. For example the objects returned by the api lister are supposed to be treated as read-only - in particular if we would be to modify them there should at least be proper locking in place as nfd-master potentially processes the same rule (the same Go object) in parallel for multiple nodes. If any optimization like this will be pursued it should be done properly, probably with private type(s) at the consumer's end, not contaminating the API types.	2023-12-15 10:48:07 +02:00
Markus Lehtonen	0bc1b6c28f	apis/nfd: drop creation helper functions Drop the creation helper functions as one step in an effort to tidy up the api package. These functions were not much used outside unit tests anyway, the static rules of the nfd-worker custom feature source being the only exception (and if those happened to be invalid we'd catch that e.g. in the e2e-tests).	2023-12-14 15:54:51 +02:00
Kubernetes Prow Robot	3ce5a1b218	Merge pull request #1482 from marquiz/devel/api-cleanup-2 apis/nfd: drop the private regexp caching field	2023-12-14 12:08:58 +01:00
Markus Lehtonen	cb0a46ec0e	Use generics for maps and slices	2023-12-13 12:09:53 +02:00
Markus Lehtonen	a77983556f	nfd-master: remove default denied ns from config These are now handled by the validate package. If we have them here in nfd-master, the default namespace (feature.node.kubernetes.io) gets denied.	2023-12-12 16:12:53 +02:00
Kubernetes Prow Robot	efe5c03071	Merge pull request #1455 from ArangoGutierrez/validation Create a Validate pkg	2023-12-12 11:04:06 +01:00
Carlos Eduardo Arango Gutierrez	affb93ea50	Create a Validate pkg Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-12-11 16:54:22 +01:00
Markus Lehtonen	34574f4211	nfd-worker: set owner reference in NodeFeature objects This patch creates a owner-dependent relationship between the nfd-worker pod and the NodeFeature object that it creates. With this change the orphaned NodeFeature object(s) gets automatically garbage-collected when the nfd-worker pod goes away, without the need for manual clean-up actions.	2023-12-08 14:57:31 +02:00
Markus Lehtonen	8d40524b88	apis/nfd: drop the private regexp caching field Drop the private field for caching parsed regexp from the MatchExpression type. This tidies up the API type definition and not so tied with particular implementation details. The change also elimiates potential concurrency problems as no locking is in place in the API types. If caching will be desired in the future, it's better to do it properly in a separate package, not directly in the API types.	2023-12-01 15:28:55 +02:00
Markus Lehtonen	b988139094	apis/nfd: validate input when matching expression Don't assume that the fields are correct.	2023-12-01 09:22:32 +02:00
Markus Lehtonen	94bffbf645	generate: update kube code-gen to v1.28.4	2023-11-29 18:37:19 +02:00
Kubernetes Prow Robot	dfef0ebe4a	Merge pull request #1472 from marquiz/devel/typo-fix nfd-worker: fix typo in log message	2023-11-24 16:53:49 +01:00
Markus Lehtonen	f266533a7d	nfd-worker: fix typo in log message	2023-11-24 17:17:42 +02:00
Markus Lehtonen	f6c360188e	Use T.Run in expression unit tests The "better way" of running test cases, get e.g. better output in case of errors. Also drop some unneeded type definitions from the tests.	2023-11-24 17:14:12 +02:00
Markus Lehtonen	f489ca98b5	Reproducible output from expression matching Fix flakyness of unit tests by adding back the sorting of matched feature elements that was unadvisedly removed in `63c22551df`. This might help debugging some corner cases in real-life scenarios (when using templating), too.	2023-11-24 16:27:38 +02:00
Kubernetes Prow Robot	ed8898de6a	Merge pull request #1461 from marquiz/devel/no-implicit-ns Option to stop implicitly adding default prefix to names	2023-11-24 14:53:09 +01:00
Kubernetes Prow Robot	7154458524	Merge pull request #1468 from marquiz/devel/nfr-template-fix apis/nfd: fix multiple matcher terms targeting the same feature	2023-11-24 13:20:49 +01:00
Markus Lehtonen	1d012a28cd	Option to stop implicitly adding default prefix to names Add new autoDefaultNs (default is "true") config option to nfd-master. Setting the config option to false stops NFD from automatically adding the "feature.node.kubernetes.io/" prefix to labels, annotations and extended resources. Taints are not affected as for them no prefix is automatically added. The user-visible part of enabling the option change is that NodeFeatureRules, local feature files, hooks and configuration of the "custom" may need to be altereda (if the auto-prefixing is relied on). For now, the config option defaults to "true", meaning no change in default behavior. However, the intent is to change the default to "false" in a future release, deprecating the option and eventually removing it (forcing it to "false"). The goal of stopping doing "auto-prefixing" is to simplify the operation (of nfd and users). Make the naming more straightforward and easier to understand and debug (kind of WYSIWYG), eliminating peculiar corner cases: 1. Make validation simpler and unambiguous 2. Remove "overloading" of names, i.e. the mapping two values to the same actual name. E.g. previously something like labels: feature.node.kubernetes.io/foo: bar foo: baz Could actually result in node label: feature.node.kubernetes.io/foo: baz 3. Make the processing/usagee of the "rule.matched" and "local.labels" feature in NodeFeatureRules unambiguous and more understadable. E.g. previously you could have node label "feature.node.kubernetes.io/local-foo: bar" but in the NodeFeatureRule you'd need to use the unprefixed name "local-foo" or the fully prefixed name, depending on what was specified in the feature file (or hook) on the node(s). NOTE: setting autoDefaultNs to false is a breaking change for users who rely on automatic prefixing with the default feature.node.kubernetes.io/ namespace. NodeFeatureRules, feature files, hooks and custom rules (configuration of the "custom" source of nfd-worker) will need to be altered. Unprefixed labels, annoations and extended resources will be denied by nfd-master.	2023-11-24 12:48:20 +02:00
Markus Lehtonen	dc5af8be04	nfd-master: predictable handling of unprefixed names Make the handling of unprefixed names (of labels, annotations and extended resources) well-defined and predictable. Previously the resulting output was not predictable in case the same name was coming in both the unprefixed and prefixed form, say unprefixed "foo=bar" coming from one source (be it nfd-worker or NodeFeature(Rule)) and "feature.node.kubernetes.io/foo=baz" from a NodeFeature(Rule). Previously the output value was randomly either "bar" or "baz". This patch adds prefixes to all names early in the processing "pipeline", preventing random name clashes later on.	2023-11-23 22:16:04 +02:00
Markus Lehtonen	678d7e89cb	nfd-master: drop stale variables Remove some stale variables that were leftover from the recent removal of nfd version annotations.	2023-11-23 19:01:22 +02:00
Markus Lehtonen	63c22551df	apis/nfd: fix multiple matcher terms targeting the same feature Fix NodeFeatureRule templating in cases where multiple matchFeatures terms are targeting the same feature. Previously, only matched feature elements from the last matcher terms were used as the input to the template. However, the input should contain all matched elements from all matcher terms. For example, consider the example rule snippet below: ... labelsTemplate: \| {{ range .pci.device }}vendor.io/pci-device.{{ .class }}-{{ .device }}=exists {{ end }} matchFeatures: - feature: pci.device matchExpressions: class: {op: InRegexp, value: ["^03"]} vendor: {op: In, value: ["1234"]} - feature: pci.device matchExpressions: class: {op: InRegexp, value: ["^12"]} This rule matches if both a pci device of class 03 from vendor 1234 exists and a pci device of class 12 (from any vendor) exists. Previously, the template would only generate labels from the devices in class 12 (as that's the last term). With this patch the template creates device labels from devices in both classes 03 and 12.	2023-11-22 10:43:52 +02:00
Kubernetes Prow Robot	371ed3ff21	Merge pull request #1458 from marquiz/devel/logging-fix apis/nfd: fix logging of rule expression processing	2023-11-21 12:04:59 +01:00
Markus Lehtonen	9cbe742bfb	apis/nfd: fix incorrect comments of matching functions This patch updates the comments to correspond to the actual behavior which was changed back in `36341bf4c7`.	2023-11-20 10:11:35 +02:00
Markus Lehtonen	8ec55fe8db	apis/nfd: fix logging of rule expression processing	2023-11-10 09:40:54 +02:00
Carlos Eduardo Arango Gutierrez	c0063be4f4	Discover node features as annotations Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: bebc <mchf1990212@gmail.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>	2023-10-25 19:58:58 +02:00
Markus Lehtonen	a9849f20ff	nfd-master: fix retry of node updates This patch addresses issues with slow node status (extended resources) updates. Previously we did just a few retries in quick succession which could result in the node update failing, just because node status was updated slower than our retry window. The patch mitigates the issue by increasing the number of tries to 15. In addition, it creates a ratelimiter with a longer per-item (per-node) base delay. The patch also fixes the e2e-tests to expose the issue.	2023-10-20 17:24:01 +03:00
Markus Lehtonen	98c3b0750d	nfd-gc: add metrics Implements three metrics for nfd-gc: - nfd_gc_build_info: version information of nfd-gc. - nfd_gc_objects_deleted_total: total number of NodeFeature and NodeResourceTopology objects deleted by nfd-gc. - nfd_gc_object_delete_failures_total: number of errors encountered when deleting NodeFeature and NodeResourceTopology objects.	2023-10-09 13:39:28 +00:00
Markus Lehtonen	f5c6ce2843	nfd-gc: simplify initialization	2023-10-09 11:48:49 +03:00
Markus Lehtonen	5171ae0f90	Refactor metrics Move common boilerplate code under pkg/utils.	2023-10-09 10:49:12 +03:00
Markus Lehtonen	1d8a83b045	nfd-master: stop creating NFD version annotations We now have metrics for getting detailed information about the NFD instances running. There should be no need to pollute the node object with NFD version annotations. One problem with the annotations also that they were incomplete in the sense that they only covered nfd-master and nfd-worker but not nfd-topology-updater or nfd-gc. Also, there was a problem with stale annotations, giving misleading information. E.g. there was no way to remove old/stale master.version annotations if nfd-master was scheduled on another node where it was previously running.	2023-10-05 14:53:29 +03:00
Markus Lehtonen	9ea0a1b420	nfd-master: correctly clean up annotations Delete correct annotations if -instance is specified.	2023-10-05 11:10:06 +03:00
Markus Lehtonen	dbf00dcda6	apis/nfd: drop one stale comment line Drop a leftover "docstring" comment that wasn't removed with the type it refers to.	2023-09-27 14:23:12 +03:00
Markus Lehtonen	b09ce75c8e	nfd-master: fix filtering of extended resources Fix a bug in checking the allowed ".feature.node.kubernetes.io" ns suffix for extended resources. Also update e2e-tests to cover this case.	2023-09-27 10:55:11 +03:00
AhmedGrati	7ab6314bdc	chore: introduce a commong klog handling for cmd/nfd-* Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-09-07 22:38:15 +01:00
AhmedGrati	b0be40aa09	feat: add logging parameters in configuration file for nfd master Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-09-06 15:27:27 +01:00
Kubernetes Prow Robot	19520c079c	Merge pull request #1325 from ffromani/nfd-updater-fix-events nfd-updater: events: enable timer-only flow	2023-09-04 05:47:49 -07:00
Francesco Romani	000c919071	nfd-updater: events: enable timer-only flow The nfd-topology-updater has state-directories notification mechanism enabled by default. In theory, we can have only timer-based updates, but if the option is given to disable the state-directories event source, then all the update mechanism is mistakenly disabled, including the timer-based updates. The two updaters mechanism should be decoupled. So this PR changes this to make sure we can enable just and only the timer-based updates. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-09-04 13:05:50 +02:00
Kubernetes Prow Robot	f852c32a55	Merge pull request #1252 from AhmedGrati/test-add-updater-pool-unit-tests test: add node updater pool unit tests	2023-09-01 07:34:32 -07:00
Kubernetes Prow Robot	e1f90a233b	Merge pull request #1305 from marquiz/devel/nf-gc Garbage collection of NodeFeature objects	2023-08-28 02:59:42 -07:00
Kubernetes Prow Robot	6d95e59cd0	Merge pull request #1290 from marquiz/devel/metrics-new metrics: additional metrics for nfd-master	2023-08-28 02:07:42 -07:00
Markus Lehtonen	e3415ec484	nfd-gc: support garbage collection of NodeFeatures Hook into the same logic already exercised for NodeResourceTopology objects: GC watches for node delete events and immediately drops stale objects (NRT and now also NF). In addition there is a periodic resync to catch any missed node deletes, once every hour by default.	2023-08-22 21:24:26 +03:00
Markus Lehtonen	01c08d67b6	Rename nfd-topology-gc to nfd-gc This is preparation for making it a generic garbage collector for all nfd-managed api objects.	2023-08-21 21:46:11 +03:00
Kubernetes Prow Robot	e0c477090b	Merge pull request #1311 from marquiz/devel/refactor-gc-5 topology-gc: simplify listing of node objects	2023-08-21 11:40:05 -07:00
Markus Lehtonen	f05b0e26ea	topology-gc: move initial GC out of startNodeInformer() Small refactor. Contextually this feels more like under periodicGC().	2023-08-21 10:11:46 +03:00
Kubernetes Prow Robot	a60502a313	Merge pull request #1307 from marquiz/devel/refactor-gc topology-gc: refactor unit tests	2023-08-21 00:09:23 -07:00
Kubernetes Prow Robot	536f9d17d0	Merge pull request #1295 from marquiz/devel/topology-updater-metrics nfd-topology-updater: add metrics support	2023-08-20 23:25:24 -07:00
Markus Lehtonen	2e8da8849a	topology-gc: simplify listing of node objects Hopefully makes the code slightly more readable.	2023-08-21 09:13:41 +03:00
Markus Lehtonen	0b5e51bd35	topology-gc: refactor unit tests Remove a lot of boilerplate code by defining reusable functions. Also, test the Run() method instead of the functions callees of Run() as it is the top level functionality that was tested in practice (we don't have separate unit tests for the callee functions).	2023-08-21 09:10:24 +03:00
Kubernetes Prow Robot	4674bce27d	Merge pull request #1310 from marquiz/devel/refactor-gc-4 topology-gc: rename runGC to garbageCollect()	2023-08-18 11:26:34 -07:00
Kubernetes Prow Robot	f4cf4877f2	Merge pull request #1309 from marquiz/devel/refactor-gc-3 topology-gc: rename run()	2023-08-18 11:26:28 -07:00
Markus Lehtonen	ec51b29b3c	topology-gc: rename runGC to garbageCollect() One less function named run.	2023-08-18 17:57:05 +03:00
Markus Lehtonen	98b0b36b87	topology-gc: rename run() Too many run methods here.	2023-08-18 17:52:11 +03:00
Markus Lehtonen	108d603bdc	topology-gc: fix Stop The stop channel has multiple readers to we need to close it so that all of the readers get notified.	2023-08-18 17:46:54 +03:00
Kubernetes Prow Robot	9d61b19454	Merge pull request #1287 from freelizhun/fix-empty-hugepages fix empty hugepages in some numa nodes caused no such file or directory errors	2023-08-08 02:50:16 -07:00
lizhun	a4ad3d4411	fix empty hugepages in some numa nodes caused no such file or directory error Signed-off-by: lizhun <lizhun@kylinos.cn>	2023-08-08 15:14:44 +08:00
Markus Lehtonen	5ad2294c14	metrics: add nfd_node_update_requests_total counter Add a counter for total number of node update/sync requests. In practice, this counts the number of gRPC requests received if the gRPC API is in use. If the NodeFeature API is enabled, this counts the requests initiated by the NFD API controller, i.e. updates triggered by changes in NodeFeature or NodeFeatureRule objects plus updates initiated by the controller resync period.	2023-08-07 09:37:29 +03:00
Markus Lehtonen	4b24cc1afa	metrics: counters for rejected labels, extended resources and taints Add counters for labels, extended resources and taints rejected/filtered out by nfd-master.	2023-08-07 09:37:29 +03:00
Markus Lehtonen	a8a29e6df2	metrics: add nfd_nodefeaturerule_processing_errors_total counter Add a counter for errors encountered when processing NodeFeatureRules. Another simple counter without any additional prometheus labels - nfd-master logs can provide further details.	2023-08-07 09:37:29 +03:00
Markus Lehtonen	b90f2c318e	metrics: add nfd_node_update_failures_total counter Add a new counter for tracking node update failures from nfd-master. This tracks both normal feature updates and the --prune sub-command. This is a simple counter without any additional labels - nfd-master logs can be used for further diagnostics.	2023-08-07 09:37:27 +03:00
Markus Lehtonen	06b333db1e	nfd-topology-updater: add metrics support For now, add only one metric, a counter for the errors occurring while scanning pod resources on the node.	2023-08-04 16:48:37 +03:00
Markus Lehtonen	039378c725	nfd-master: use term node update instead of labeling Rename symbols and reword log messages to correlate with the functionality (we may do other updates than just modify labels nowadays).	2023-08-01 16:42:34 +03:00
Markus Lehtonen	d8f167d8a9	nfd-master: remove one stale empty line	2023-08-01 16:38:32 +03:00
Kubernetes Prow Robot	c1cb63243b	Merge pull request #1288 from marquiz/devel/metrics Improve metrics	2023-07-31 10:38:39 -07:00
Markus Lehtonen	5091fef84b	metrics: improve feature discovery duration metric Rename the "NodeName" prometheus label to "node", aligning with common prometheus/kubernetes conventions. Also reconfigure the prometheus histogram buckets (now 10ms to 1s) to better match the expected sample range.	2023-07-31 19:45:22 +03:00
Markus Lehtonen	47f621d970	metrics: improve the node updates gauge Rename the metric, better describe what we're measuring and better comply with prometheus naming conventions. Also change it to represent actual updates of the node object on the Kubernetes apiserver.	2023-07-31 19:45:22 +03:00
Markus Lehtonen	945e7fcb3f	metrics: improve nfr processing time metric Change the metric from a simple gauge (that basically was a single value for the whole cluster) into a HistogramVec, aligning with the feature discovery duration metric in nfd-worker. This improved metric now has prometheus labels for the NFR name and node name, i.e. it is tracking per-NFR metric for each node being processed. Also, change the naming to better comply with prometheus suggested conventions.	2023-07-31 19:45:22 +03:00
Kubernetes Prow Robot	01ca8cb91d	Merge pull request #1284 from marquiz/devel/generator-deps generate: bump tools to their latest versions	2023-07-31 06:32:39 -07:00
Kubernetes Prow Robot	e0f10a81de	Merge pull request #1256 from PiotrProkop/fix-topo-updater-policy-and-scope-advertisment Fix Topology Manager policy and scope not being updated after NRT creation	2023-07-28 00:25:54 -07:00
Markus Lehtonen	7e375ad1f0	generate: bump tools to their latest versions Bump tools versions and re-auto-generate files.	2023-07-27 14:29:48 +03:00
Kubernetes Prow Robot	77d869c4f7	Merge pull request #1242 from ArangoGutierrez/metrics Enable metrics via prometheus operator	2023-07-21 02:26:08 -07:00
Carlos Eduardo Arango Gutierrez	e3aedd33e2	Enable metrics via prometheus operator Expose metrics via prometheus.monitoring.coreos.com/v1 The exposed metrics are \| Metric \| Type \| Meaning \| \| --------------- \| ---------------- \| ---------------- \| \| `nfd_master_build_info` \| Gauge \| Version from which nfd-master was built. \| \| `nfd_worker_build_info` \| Gauge \| Version from which nfd-worker was built. \| \| `nfd_updated_nodes` \| Counter \| Time taken to label a node \| \| `nfd_crd_processing_time` \| Gauge \| Time taken to process a NodeFeatureRule CRD \| \| `nfd_feature_discovery_duration_seconds` \| HistogramVec \| Time taken to discover features on a node \| Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>	2023-07-21 10:59:52 +02:00
pprokop	6d98b6150b	Fix Topology Manager policy and scope not being updated properly NFD is only detecting policy and scope of Topology Manager when NRT object doesn't exist. This means that topologyManagerScope and topologyManagerPolicy attributes won't be updated even if kubelet config was changed to use other TopologyManager policy and scope. Signed-off-by: pprokop <pprokop@nvidia.com>	2023-07-20 16:31:12 +02:00
AhmedGrati	8e55d78d85	test: add node updater pool unit tests Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-07-19 12:03:35 +01:00
Markus Lehtonen	dac45be28c	nfd-master: check for nil references in nfdAPIUpdateAllNodes Just a safeguard.	2023-07-17 17:49:44 +03:00
hang.jiang	698031fc2d	Stop ticker in time to avoid memory leak Because it will cause memory leak if we do not stop ticker when the function has completed. Signed-off-by: hang.jiang <hang.jiang@daocloud.io>	2023-07-05 18:35:01 +08:00
guoguangwu	b946bcc0f5	nfd-master-internal_test.go rm pkg imported twice Signed-off-by: guoguangwu <guoguangwu@magic-shield.com>	2023-06-21 16:53:55 +08:00
Kubernetes Prow Robot	306969a945	Merge pull request #1133 from AhmedGrati/feat-parallelize-nodes-update feat: parallelize nodes update	2023-06-02 05:28:57 -07:00
AhmedGrati	b3cfe17392	feat: parallelize nodes update This PR aims to optimize the process of updating nodes with corresponding features. In fact, previously, we were updating nodes sequentially even though they are independent from each other. Therefore, we integrated new components: LabelersNodePool which is responsible for spininng a goroutine whenever there's a request for updating nodes, and a Workqueue which is responsible for holding nodes names that should be updated. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-06-02 11:41:50 +01:00
AhmedGrati	08b9c3486e	feat: support dynamic values for labels in the NodeFeatureRule This PR aims to support the dynamic values for labels in the NodeFeatureRule CRD, it would offer more flexible labeling for users. To achieve this, we check whether label value starts with "@", and if it's the case, we will get the value of the feature value, and update the value of the label with the feature value. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-05-31 23:30:26 +01:00
Markus Lehtonen	bf670de68d	pkg/utils: migrate KlogDump to structured logging Drop the KlogDump helper in favor of klog.InfoS. However, that patch introduces a new DelayedDumper() helper to avoid processing (marshalling) of object unless really evaluated by the logging function.	2023-05-31 14:43:08 +03:00
Markus Lehtonen	4947ebf336	pkg/util: migrate to structured logging We gRPC logging interface is not compatible with structured logging so grpcLogger is left intact.	2023-05-31 14:43:08 +03:00
Markus Lehtonen	64d5af016e	apis/nfd: migrate to structured logging	2023-05-31 14:43:08 +03:00
Markus Lehtonen	6e3b181ab4	topology-updater: migrate to structured logging	2023-05-31 14:43:08 +03:00
Markus Lehtonen	7be08f9e7f	nfd-worker: migrate to structured logging	2023-05-31 14:43:08 +03:00
Markus Lehtonen	8113d651c2	nfd-master: migrate to structured logging	2023-05-31 14:43:05 +03:00
Markus Lehtonen	2a3c7e4c93	nfd-master: add validation of label names and values Validate labels before trying to update the node. Makes us fail early nad prevent useless retries in case invalid labels are tried.	2023-05-29 16:54:14 +03:00
Markus Lehtonen	1809c24314	nfd-master: use close for stop channel Simpler and more reliable (in case of multiple consumers) to just close the channel.	2023-05-24 16:51:48 +03:00
PiotrProkop	272fd4784f	Add new flag enable-leader-election for nfd-master. It allows NFD-master to be run in active-passive way when running multiple instances of NFD-master to prevent multiple components from updating same custom resources. Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-05-15 13:30:07 +02:00
Kubernetes Prow Robot	85073525c3	Merge pull request #1185 from AhmedGrati/fix-resync-period-functionality nfd-master: fix resync period config option	2023-05-02 11:14:16 -07:00
AhmedGrati	87c2d7e184	nfd-master: fix resync period config option This PR fixes the resync-period configuration option of the nfd-master. In fact, previously, changes were not reflected in the nfd-master at runtime. e2e tests are also implemented to make sure that the fix is already working as expected. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-05-02 13:17:01 +02:00
Markus Lehtonen	fb20388028	nfd-master: refactor filtering of taints	2023-04-28 18:13:54 +03:00
Markus Lehtonen	43ced0c1a1	nfd-master: refactor filtering of feature labels More consistent error messages. Also preparation for dynamic labels values (that '@' notation currently supported for extended resources).	2023-04-28 18:13:54 +03:00
Markus Lehtonen	6ca687fbef	nfd-master: refactor filtering of extended resources Simplify code a bit and get more consistent error messages (in addition to fixing some of those).	2023-04-28 18:13:54 +03:00
Markus Lehtonen	131325fb2c	nfd-master: refactor api-controller object handling Split out resolving of node name (of the node to be updated) into a separate function. Makes it possible to add unit tests. Also. do unconditional type casting in the handler functions – that shouldn't fail unless there is a really serious internal inconsistency in the codebase so it should be ok to panic.	2023-04-28 17:33:33 +03:00
Kubernetes Prow Robot	d84248bc7d	Merge pull request #1190 from marquiz/devel/api-unit-tests apis/nfd: add unit tests for Feature type	2023-04-26 23:32:15 -07:00
Markus Lehtonen	77011a775f	nfd-master: log node name when processing NodeFeatureRules	2023-04-26 07:22:30 +03:00
Markus Lehtonen	dda7b195ee	apis/nfd: add unit tests for Feature type	2023-04-25 19:40:35 +03:00
Kubernetes Prow Robot	54bd4c5d74	Merge pull request #1167 from PiotrProkop/fix-reactive-updates nfd-topology-updater: fix wrong kubelet_internal_checkpoint path and compare basename to full path	2023-04-24 04:41:01 -07:00
pprokop	5a9a12151c	nfd-topology-updater: fix kubelet state file notifier - kubelet_internal_checkpoint file is in /var/lib/kubelet/device-plugins not /var/lib/kubelet fsWatcher doesn't watch dirs recursively - e.Name returned from fsWatcher events is a full path not a basename Signed-off-by: pprokop <pprokop@nvidia.com>	2023-04-24 13:21:56 +02:00
Kubernetes Prow Robot	2356223ffc	Merge pull request #1139 from AhmedGrati/feat-configure-master-resync feat: add master resync period configurability	2023-04-24 03:49:02 -07:00
AhmedGrati	7917434d38	feat: add master resync period configurability This PR adds a config option for setting the NFD API controller resync period. The resync period is only activated when the NodeFeature API has been enabled (with -enable-nodefeature-api). Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-24 11:52:38 +02:00
Kubernetes Prow Robot	64fe26ed37	Merge pull request #1169 from ArangoGutierrez/i1168 nfd-master: reject malformed extended resource dynamic capacity assignment	2023-04-24 00:17:15 -07:00
Carlos Eduardo Arango Gutierrez	f5df7b658c	nfd-master: reject malformed extended resource dynamic capacity assignment Reject malformed extended resource dynamic capacity assignment capacity should be in the form of domain.feature.element, add logic at func filterExtendedResources to check if true or ignore ExtendedResource, logging as an error. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-22 08:43:50 +02:00
Kubernetes Prow Robot	d5bccda7c5	Merge pull request #1171 from ArangoGutierrez/foundon_typo pkg/nfd-master/nfd-master.go: Fix typo	2023-04-21 12:21:11 -07:00
Kubernetes Prow Robot	c2c1e18908	Merge pull request #1173 from marquiz/devel/fix-master nfd-master: fix a crash when processing NodeFeatureRules	2023-04-21 09:49:11 -07:00
Markus Lehtonen	9523f1e411	nfd-master: fix a crash when processing NodeFeatureRules Fix a a bug where nfd-master with NodeFeature API enabled would crash when NodeFeatureRule objects were processed in the case where no NodeFeature objects existed. This was caused by trying to insert values into a non-initialized NodeFeatureSpec in the code. This patch adds two safety measures to prevent that from happening in the future. First, add a constructor function for the NodeFeatureSpec type, and second, check for uninitialized object in the function inserting new functions. TODO: add unit tests for the API helper functions.	2023-04-21 19:24:08 +03:00
Carlos Eduardo Arango Gutierrez	ae22031547	pkg/nfd-master/nfd-master.go: Fix typo Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-21 16:17:11 +02:00
Markus Lehtonen	37306662fe	nfd-master: don't create emtpy annotations Make the nfd.node.kubernetes.io/feature-labels and nfd.node.kubernetes.io/extended-resources annotations behave similary to the taints annotation: only create the annotations if some labels or extended resources are created.	2023-04-21 16:14:17 +03:00
Markus Lehtonen	f0f6bbcf36	nfd-master: configure before prune Otherwise prune will crash because of uninitialized configuration.	2023-04-20 20:38:11 +03:00
Markus Lehtonen	32db081f3a	nfd-master: support noPublish with -prune Better this way than to crash which is what currently happens with this combination.	2023-04-19 15:58:06 +03:00
Markus Lehtonen	18f7bfa8e8	generate: update mockery to v2.25.1 Bump the vektra/mockery tool to the latest release.	2023-04-19 13:33:42 +03:00
Markus Lehtonen	117baac1a6	generate: update protoc to v22.3	2023-04-19 10:44:55 +03:00
Markus Lehtonen	ca7ed04a34	generate: update auto-generated code Re-run "make generate".	2023-04-19 09:49:17 +03:00
Markus Lehtonen	e2d5ba1a2b	pkg/podres: update mocked PodResourcesListerClient Update mocked implementation of k8s.io/kubelet/pkg/apis/podresources/v1.PodResourcesListerClient. The mocked implementation is moved to a separate "mocks" subpackage as it's for an external interface. This patch also adds code for auto-generation for the mocked interface.	2023-04-18 20:51:51 +03:00
Kubernetes Prow Robot	8d71ed6755	Merge pull request #1086 from AhmedGrati/feat-support-builtin-kernel-mods feat: support builtin kernel mods	2023-04-13 10:30:40 -07:00
Markus Lehtonen	6b2d10753f	nfd-master: re-try on node update failures Change the NFD API handler to re-try on node update failures. Will work around transient failures, making sure that failed nodes (i.e. nodes that we failed to update) don't need to wait for the 1 hour resync period before being tried again.	2023-04-13 16:30:31 +03:00
AhmedGrati	109caa1f28	feat: support builtin kernel mods This PR adds the combination of dynamic and builtin kernel modules into one feature called `kernel.enabledmodule`. It's a superset of the `kernel.loadedmodule` feature. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-13 10:19:24 +01:00
Markus Lehtonen	70ac19ea66	nfd-master: increase controller resync period to 1 hour Increase the NFD API controller resync period from 5 minutes to 1 hour. The resync causes nfd-master to replay all NodeFeature and NodeFeatureRule objects, being effectively a "big hammer reset all" button. This should only be needed as an "insurance" to fix labels et al in case they have been manually tampered (outside NFD) and against certain bugs in nfd itself. NFD is not supposed to manage anything fast-changing so 1 hour should be enough. This change only affects behavior when the NodeFeature API has been enabled (with -enable-nodefeature-api).	2023-04-12 16:38:47 +03:00
Kubernetes Prow Robot	ad07829d0a	Merge pull request #1099 from ArangoGutierrez/extended_resources_v2 Create extended resources with NodeFeatureRule	2023-04-07 08:09:15 -07:00
Fabiano Fidêncio	250aea4741	Create extended resources with NodeFeatureRule Add support for management of Extended Resources via the NodeFeatureRule CRD API. There are usage scenarios where users want to advertise features as extended resources instead of labels (or annotations). This patch enables the discovery of extended resources, via annotation and patch of node.status.capacity and node.status.allocatable. By using the NodeFeatureRule API. Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com> Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-07 16:14:56 +02:00
Markus Lehtonen	f64c23968a	nfd-master: fix node update Update node status before node metadata. This fixes a problem where we lose track of NFD-managed extended resources in case patching node status fails. Previously we removed all labels and annotations (including the one listing our ERs) and only after that updated node status. If node status update failed we had lost the annotation but extended resources were still there, leaving them orphaned.	2023-04-06 22:04:35 +03:00
Markus Lehtonen	cc6c20ff5f	nfd-master: disallow unprefixed and kubernetes taints Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/" prefix. This is a precaution to protect the user from messing up with the "official" well-known taints from Kubernetes itself. The only exception is that the "nfd.node.kubernetes.io/" prefix is allowed. However, there is one allowed NFD-specific namespace (and its sub-namespaces) i.e. "feature.node.kubernetes.io" under the kubernetes.io domain that can be used for NFD-managed taints. Also disallow unprefixed taint keys. We don't add a default prefix to unprefixed taints (like we do for labels) from NodeFeatureRules. This is to prevent unpleasant surprises to users that need to manage matching tolerations for their workloads.	2023-04-06 16:12:37 +03:00
Kubernetes Prow Robot	193c552b33	Merge pull request #1084 from AhmedGrati/feat-add-master-config-file feat: add master config file	2023-04-04 10:41:40 -07:00
AhmedGrati	3fff409f6d	Add master config file Similar to the nfd-worker, in this PR we want to support the dynamic run-time configurability through a config file for the nfd-master. We'll use a json or yaml configuration file along with the fsnotify in order to watch for changes in the config file. As a result, we're allowing dynamic control of logging params, allowed namespaces, extended resources, label whitelisting, and denied namespaces. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-03 09:52:09 +01:00
AhmedGrati	d0a6289c0f	chore: add debug dump of nfd worker configuration Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-03-18 00:49:07 +01:00
Kubernetes Prow Robot	13f92faa77	Merge pull request #1031 from k8stopologyawareschedwg/reactive_updates topology-updater: reactive updates	2023-03-17 10:13:17 -07:00
Talor Itzhak	5c6be580f4	reactive updates: add an option to disable the feature Access to the kubelet state directory may raise concerns in some setups, added an option to disable it. The feature is enabled by default. Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-03-16 11:53:16 +02:00
Kubernetes Prow Robot	a06e44ef0b	Merge pull request #1083 from fmuyassarov/mockery codegen: fix code-generation	2023-03-15 06:46:16 -07:00

1 2 3 4 5 ...

499 commits