node-feature-discovery

mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2025-03-05 16:27:05 +00:00

Author	SHA1	Message	Date
Markus Lehtonen	4b24cc1afa	metrics: counters for rejected labels, extended resources and taints Add counters for labels, extended resources and taints rejected/filtered out by nfd-master.	2023-08-07 09:37:29 +03:00
Markus Lehtonen	a8a29e6df2	metrics: add nfd_nodefeaturerule_processing_errors_total counter Add a counter for errors encountered when processing NodeFeatureRules. Another simple counter without any additional prometheus labels - nfd-master logs can provide further details.	2023-08-07 09:37:29 +03:00
Markus Lehtonen	b90f2c318e	metrics: add nfd_node_update_failures_total counter Add a new counter for tracking node update failures from nfd-master. This tracks both normal feature updates and the --prune sub-command. This is a simple counter without any additional labels - nfd-master logs can be used for further diagnostics.	2023-08-07 09:37:27 +03:00
Markus Lehtonen	06b333db1e	nfd-topology-updater: add metrics support For now, add only one metric, a counter for the errors occurring while scanning pod resources on the node.	2023-08-04 16:48:37 +03:00
Markus Lehtonen	039378c725	nfd-master: use term node update instead of labeling Rename symbols and reword log messages to correlate with the functionality (we may do other updates than just modify labels nowadays).	2023-08-01 16:42:34 +03:00
Markus Lehtonen	d8f167d8a9	nfd-master: remove one stale empty line	2023-08-01 16:38:32 +03:00
Kubernetes Prow Robot	c1cb63243b	Merge pull request #1288 from marquiz/devel/metrics Improve metrics	2023-07-31 10:38:39 -07:00
Markus Lehtonen	5091fef84b	metrics: improve feature discovery duration metric Rename the "NodeName" prometheus label to "node", aligning with common prometheus/kubernetes conventions. Also reconfigure the prometheus histogram buckets (now 10ms to 1s) to better match the expected sample range.	2023-07-31 19:45:22 +03:00
Markus Lehtonen	47f621d970	metrics: improve the node updates gauge Rename the metric, better describe what we're measuring and better comply with prometheus naming conventions. Also change it to represent actual updates of the node object on the Kubernetes apiserver.	2023-07-31 19:45:22 +03:00
Markus Lehtonen	945e7fcb3f	metrics: improve nfr processing time metric Change the metric from a simple gauge (that basically was a single value for the whole cluster) into a HistogramVec, aligning with the feature discovery duration metric in nfd-worker. This improved metric now has prometheus labels for the NFR name and node name, i.e. it is tracking per-NFR metric for each node being processed. Also, change the naming to better comply with prometheus suggested conventions.	2023-07-31 19:45:22 +03:00
Kubernetes Prow Robot	01ca8cb91d	Merge pull request #1284 from marquiz/devel/generator-deps generate: bump tools to their latest versions	2023-07-31 06:32:39 -07:00
Kubernetes Prow Robot	e0f10a81de	Merge pull request #1256 from PiotrProkop/fix-topo-updater-policy-and-scope-advertisment Fix Topology Manager policy and scope not being updated after NRT creation	2023-07-28 00:25:54 -07:00
Markus Lehtonen	7e375ad1f0	generate: bump tools to their latest versions Bump tools versions and re-auto-generate files.	2023-07-27 14:29:48 +03:00
Kubernetes Prow Robot	77d869c4f7	Merge pull request #1242 from ArangoGutierrez/metrics Enable metrics via prometheus operator	2023-07-21 02:26:08 -07:00
Carlos Eduardo Arango Gutierrez	e3aedd33e2	Enable metrics via prometheus operator Expose metrics via prometheus.monitoring.coreos.com/v1 The exposed metrics are \| Metric \| Type \| Meaning \| \| --------------- \| ---------------- \| ---------------- \| \| `nfd_master_build_info` \| Gauge \| Version from which nfd-master was built. \| \| `nfd_worker_build_info` \| Gauge \| Version from which nfd-worker was built. \| \| `nfd_updated_nodes` \| Counter \| Time taken to label a node \| \| `nfd_crd_processing_time` \| Gauge \| Time taken to process a NodeFeatureRule CRD \| \| `nfd_feature_discovery_duration_seconds` \| HistogramVec \| Time taken to discover features on a node \| Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>	2023-07-21 10:59:52 +02:00
pprokop	6d98b6150b	Fix Topology Manager policy and scope not being updated properly NFD is only detecting policy and scope of Topology Manager when NRT object doesn't exist. This means that topologyManagerScope and topologyManagerPolicy attributes won't be updated even if kubelet config was changed to use other TopologyManager policy and scope. Signed-off-by: pprokop <pprokop@nvidia.com>	2023-07-20 16:31:12 +02:00
AhmedGrati	8e55d78d85	test: add node updater pool unit tests Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-07-19 12:03:35 +01:00
Markus Lehtonen	dac45be28c	nfd-master: check for nil references in nfdAPIUpdateAllNodes Just a safeguard.	2023-07-17 17:49:44 +03:00
hang.jiang	698031fc2d	Stop ticker in time to avoid memory leak Because it will cause memory leak if we do not stop ticker when the function has completed. Signed-off-by: hang.jiang <hang.jiang@daocloud.io>	2023-07-05 18:35:01 +08:00
guoguangwu	b946bcc0f5	nfd-master-internal_test.go rm pkg imported twice Signed-off-by: guoguangwu <guoguangwu@magic-shield.com>	2023-06-21 16:53:55 +08:00
Kubernetes Prow Robot	306969a945	Merge pull request #1133 from AhmedGrati/feat-parallelize-nodes-update feat: parallelize nodes update	2023-06-02 05:28:57 -07:00
AhmedGrati	b3cfe17392	feat: parallelize nodes update This PR aims to optimize the process of updating nodes with corresponding features. In fact, previously, we were updating nodes sequentially even though they are independent from each other. Therefore, we integrated new components: LabelersNodePool which is responsible for spininng a goroutine whenever there's a request for updating nodes, and a Workqueue which is responsible for holding nodes names that should be updated. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-06-02 11:41:50 +01:00
AhmedGrati	08b9c3486e	feat: support dynamic values for labels in the NodeFeatureRule This PR aims to support the dynamic values for labels in the NodeFeatureRule CRD, it would offer more flexible labeling for users. To achieve this, we check whether label value starts with "@", and if it's the case, we will get the value of the feature value, and update the value of the label with the feature value. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-05-31 23:30:26 +01:00
Markus Lehtonen	bf670de68d	pkg/utils: migrate KlogDump to structured logging Drop the KlogDump helper in favor of klog.InfoS. However, that patch introduces a new DelayedDumper() helper to avoid processing (marshalling) of object unless really evaluated by the logging function.	2023-05-31 14:43:08 +03:00
Markus Lehtonen	4947ebf336	pkg/util: migrate to structured logging We gRPC logging interface is not compatible with structured logging so grpcLogger is left intact.	2023-05-31 14:43:08 +03:00
Markus Lehtonen	64d5af016e	apis/nfd: migrate to structured logging	2023-05-31 14:43:08 +03:00
Markus Lehtonen	6e3b181ab4	topology-updater: migrate to structured logging	2023-05-31 14:43:08 +03:00
Markus Lehtonen	7be08f9e7f	nfd-worker: migrate to structured logging	2023-05-31 14:43:08 +03:00
Markus Lehtonen	8113d651c2	nfd-master: migrate to structured logging	2023-05-31 14:43:05 +03:00
Markus Lehtonen	2a3c7e4c93	nfd-master: add validation of label names and values Validate labels before trying to update the node. Makes us fail early nad prevent useless retries in case invalid labels are tried.	2023-05-29 16:54:14 +03:00
Markus Lehtonen	1809c24314	nfd-master: use close for stop channel Simpler and more reliable (in case of multiple consumers) to just close the channel.	2023-05-24 16:51:48 +03:00
PiotrProkop	272fd4784f	Add new flag enable-leader-election for nfd-master. It allows NFD-master to be run in active-passive way when running multiple instances of NFD-master to prevent multiple components from updating same custom resources. Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2023-05-15 13:30:07 +02:00
Kubernetes Prow Robot	85073525c3	Merge pull request #1185 from AhmedGrati/fix-resync-period-functionality nfd-master: fix resync period config option	2023-05-02 11:14:16 -07:00
AhmedGrati	87c2d7e184	nfd-master: fix resync period config option This PR fixes the resync-period configuration option of the nfd-master. In fact, previously, changes were not reflected in the nfd-master at runtime. e2e tests are also implemented to make sure that the fix is already working as expected. Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-05-02 13:17:01 +02:00
Markus Lehtonen	fb20388028	nfd-master: refactor filtering of taints	2023-04-28 18:13:54 +03:00
Markus Lehtonen	43ced0c1a1	nfd-master: refactor filtering of feature labels More consistent error messages. Also preparation for dynamic labels values (that '@' notation currently supported for extended resources).	2023-04-28 18:13:54 +03:00
Markus Lehtonen	6ca687fbef	nfd-master: refactor filtering of extended resources Simplify code a bit and get more consistent error messages (in addition to fixing some of those).	2023-04-28 18:13:54 +03:00
Markus Lehtonen	131325fb2c	nfd-master: refactor api-controller object handling Split out resolving of node name (of the node to be updated) into a separate function. Makes it possible to add unit tests. Also. do unconditional type casting in the handler functions – that shouldn't fail unless there is a really serious internal inconsistency in the codebase so it should be ok to panic.	2023-04-28 17:33:33 +03:00
Kubernetes Prow Robot	d84248bc7d	Merge pull request #1190 from marquiz/devel/api-unit-tests apis/nfd: add unit tests for Feature type	2023-04-26 23:32:15 -07:00
Markus Lehtonen	77011a775f	nfd-master: log node name when processing NodeFeatureRules	2023-04-26 07:22:30 +03:00
Markus Lehtonen	dda7b195ee	apis/nfd: add unit tests for Feature type	2023-04-25 19:40:35 +03:00
Kubernetes Prow Robot	54bd4c5d74	Merge pull request #1167 from PiotrProkop/fix-reactive-updates nfd-topology-updater: fix wrong kubelet_internal_checkpoint path and compare basename to full path	2023-04-24 04:41:01 -07:00
pprokop	5a9a12151c	nfd-topology-updater: fix kubelet state file notifier - kubelet_internal_checkpoint file is in /var/lib/kubelet/device-plugins not /var/lib/kubelet fsWatcher doesn't watch dirs recursively - e.Name returned from fsWatcher events is a full path not a basename Signed-off-by: pprokop <pprokop@nvidia.com>	2023-04-24 13:21:56 +02:00
Kubernetes Prow Robot	2356223ffc	Merge pull request #1139 from AhmedGrati/feat-configure-master-resync feat: add master resync period configurability	2023-04-24 03:49:02 -07:00
AhmedGrati	7917434d38	feat: add master resync period configurability This PR adds a config option for setting the NFD API controller resync period. The resync period is only activated when the NodeFeature API has been enabled (with -enable-nodefeature-api). Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>	2023-04-24 11:52:38 +02:00
Kubernetes Prow Robot	64fe26ed37	Merge pull request #1169 from ArangoGutierrez/i1168 nfd-master: reject malformed extended resource dynamic capacity assignment	2023-04-24 00:17:15 -07:00
Carlos Eduardo Arango Gutierrez	f5df7b658c	nfd-master: reject malformed extended resource dynamic capacity assignment Reject malformed extended resource dynamic capacity assignment capacity should be in the form of domain.feature.element, add logic at func filterExtendedResources to check if true or ignore ExtendedResource, logging as an error. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>	2023-04-22 08:43:50 +02:00
Kubernetes Prow Robot	d5bccda7c5	Merge pull request #1171 from ArangoGutierrez/foundon_typo pkg/nfd-master/nfd-master.go: Fix typo	2023-04-21 12:21:11 -07:00
Kubernetes Prow Robot	c2c1e18908	Merge pull request #1173 from marquiz/devel/fix-master nfd-master: fix a crash when processing NodeFeatureRules	2023-04-21 09:49:11 -07:00
Markus Lehtonen	9523f1e411	nfd-master: fix a crash when processing NodeFeatureRules Fix a a bug where nfd-master with NodeFeature API enabled would crash when NodeFeatureRule objects were processed in the case where no NodeFeature objects existed. This was caused by trying to insert values into a non-initialized NodeFeatureSpec in the code. This patch adds two safety measures to prevent that from happening in the future. First, add a constructor function for the NodeFeatureSpec type, and second, check for uninitialized object in the function inserting new functions. TODO: add unit tests for the API helper functions.	2023-04-21 19:24:08 +03:00

... 2 3 4 5 6 ...

471 commits