1
0
Fork 0
mirror of https://github.com/kubernetes-sigs/node-feature-discovery.git synced 2024-12-14 11:57:51 +00:00
Commit graph

282 commits

Author SHA1 Message Date
Markus Lehtonen
a8a29e6df2 metrics: add nfd_nodefeaturerule_processing_errors_total counter
Add a counter for errors encountered when processing NodeFeatureRules.
Another simple counter without any additional prometheus labels -
nfd-master logs can provide further details.
2023-08-07 09:37:29 +03:00
Markus Lehtonen
b90f2c318e metrics: add nfd_node_update_failures_total counter
Add a new counter for tracking node update failures from nfd-master.
This tracks both normal feature updates and the --prune sub-command.
This is a simple counter without any additional labels - nfd-master logs
can be used for further diagnostics.
2023-08-07 09:37:27 +03:00
Kubernetes Prow Robot
9ed191808d
Merge pull request #1296 from marquiz/docs/metrics
docs: document -metrics flag in command line reference
2023-08-05 03:06:30 -07:00
Markus Lehtonen
4b7ee47e5f docs: document -metrics flag in command line reference
Document the -metrics command line flag in the command line reference of
nfd-master and nfd-worker.
2023-08-04 16:49:03 +03:00
Markus Lehtonen
4aa7a8f8f8 source/local: support comments in input
Lines starting with '#' are treated as comments and ignored when parsing
feature files and hook output.
2023-08-04 16:46:22 +03:00
Markus Lehtonen
0a8b514d67 docs: unify formatting of NOTEs 2023-08-03 15:36:56 +03:00
Markus Lehtonen
a1406767a9 docs: align metrics documentation with latest changes on naming
Also change table formatting and fix one incorrect description.
2023-08-01 15:53:06 +03:00
Kubernetes Prow Robot
65b7216313
Merge pull request #1283 from marquiz/docs/deprecation-policy
docs: deprecation policy for Helm chart params
2023-07-25 10:46:06 -07:00
Kubernetes Prow Robot
463a737b82
Merge pull request #1277 from marquiz/docs/k8s-compat
docs: describe supported Kubernetes versions
2023-07-25 08:54:06 -07:00
Markus Lehtonen
b1328b3166 docs: describe supported Kubernetes versions 2023-07-25 17:40:06 +03:00
Markus Lehtonen
b72b537261 docs: deprecation policy for Helm chart params 2023-07-24 14:06:30 +03:00
Pat Riehecky
0523257d1a Add optional labels to the podmonitor
Signed-off-by: Pat Riehecky <riehecky@fnal.gov>
2023-07-21 10:03:50 -05:00
Kubernetes Prow Robot
c9f3550237
Merge pull request #1280 from marquiz/docs/tocs
docs: remove useless TOCs
2023-07-21 06:50:15 -07:00
Kubernetes Prow Robot
ebbea564a8
Merge pull request #1278 from marquiz/docs/fixes
docs: fix toc of topology-updater and topology-gc reference
2023-07-21 06:50:08 -07:00
Markus Lehtonen
312ef308d1 docs: remove useless TOCs
Drop table of contents from short pages where it is only cluttering the
page.
2023-07-21 16:35:12 +03:00
Markus Lehtonen
f825812229 docs: document version and deprecation policy 2023-07-21 16:28:38 +03:00
Markus Lehtonen
d4d6963473 docs: fix toc of topology-updater and topology-gc reference
Exclude the main title from to (with the empty line the "no_toc"
directive took no effect).
2023-07-21 15:41:59 +03:00
Carlos Eduardo Arango Gutierrez
e3aedd33e2
Enable metrics via prometheus operator
Expose metrics via prometheus.monitoring.coreos.com/v1

The exposed metrics are

| Metric        | Type | Meaning |
| --------------- | ---------------- | ---------------- |
|  `nfd_master_build_info`           | Gauge | Version from which nfd-master was built. |
|  `nfd_worker_build_info`           | Gauge | Version from which nfd-worker was built. |
|  `nfd_updated_nodes`           |  Counter | Time taken to label a node |
|  `nfd_crd_processing_time`          |  Gauge | Time taken to process a NodeFeatureRule CRD |
| `nfd_feature_discovery_duration_seconds` |  HistogramVec | Time taken to discover features on a node |

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
2023-07-21 10:59:52 +02:00
Kubernetes Prow Robot
407a610e0c
Merge pull request #1182 from fmuyassarov/disable-hooks-by-default
hooks: disable hooks by default from v0.14
2023-06-22 04:43:40 -07:00
Carlos Eduardo Arango Gutierrez
563cc862de
Docs: Fix typo on customization-guide
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-06-09 10:23:33 +02:00
Muyassarov, Feruzjon
19527be924
hooks: disable hooks by default
We have deprecated hooks in v0.12.0 but kept it enabled by default.
Starting from v0.14 we are starting to disable it by default and
plan to fully remove it in the near future.

Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
2023-06-07 13:04:23 +03:00
Simon Jürgensmeyer
307a865465
Fix missing apostrophe for jq 2023-06-07 09:53:02 +02:00
Hairong Chen
e8a00ba7da cpu: Discover TDX guests based on cpuid information
NFD already has the capability to discover whether baremetal / host
machines support Intel TDX.  Now, the next step is to add support for
discovering whether a node is TDX protected (as in, a virtual machine
started using Intel TDX).

In order to do so, we've decided to go for a new `cpu-security.tdx`
property, called `protected` (`cpu-security.tdx.protected`).

Signed-off-by: Hairong Chen <hairong.chen@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-06-05 11:06:28 +02:00
Kubernetes Prow Robot
306969a945
Merge pull request #1133 from AhmedGrati/feat-parallelize-nodes-update
feat: parallelize nodes update
2023-06-02 05:28:57 -07:00
AhmedGrati
b3cfe17392 feat: parallelize nodes update
This PR aims to optimize the process of updating nodes with
corresponding features. In fact, previously, we were updating nodes
sequentially even though they are independent from each other.
Therefore, we integrated new components: LabelersNodePool which is
responsible for spininng a goroutine whenever there's a request for
updating nodes, and a Workqueue which is responsible for holding nodes names
that should be updated.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-06-02 11:41:50 +01:00
AhmedGrati
08b9c3486e feat: support dynamic values for labels in the NodeFeatureRule
This PR aims to support the dynamic values for labels in the
NodeFeatureRule CRD, it would offer more flexible labeling for users.
To achieve this, we check whether label value starts with "@", and if
it's the case, we will get the value of the feature value, and update
the value of the label with the feature value.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-05-31 23:30:26 +01:00
Kubernetes Prow Robot
d28a02c5cd
Merge pull request #1222 from vaibhav2107/kustomize-type
Fixed typo in Header under deployment/kustomize.md
2023-05-22 00:42:21 -07:00
Kubernetes Prow Robot
70d5ef477f
Merge pull request #1219 from PiotrProkop/leader-elect
Add leader election for nfd-master
2023-05-22 00:36:21 -07:00
vaibhav2107
9f7854479f Fixed type in Header under deployment/kustomize.md 2023-05-18 14:59:54 +05:30
PiotrProkop
272fd4784f Add new flag enable-leader-election for nfd-master.
It allows NFD-master to be run in active-passive way when running
multiple instances of NFD-master to prevent multiple components
from updating same custom resources.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-05-15 13:30:07 +02:00
Markus Lehtonen
1200fd05c5 topology-updater: use node IP in the default configz URI
Use a separate NODE_ADDRESS environment variable in the default value of
-kubelet-config-uri (instead of NODE_NAME that was previously used).
Also change the kustomize and Helm deployments to set this variable to
node IP address. This should make the default deployment more robust,
making it work in scenarios where node name does not resolve to the node
ip, e.g. nodename != hostname.
2023-05-05 13:29:51 +03:00
Kubernetes Prow Robot
cd45baef8d
Merge pull request #1211 from marquiz/devel/helm
deployment/helm: improve handling of topologyUpdater.kubeletStateFiles
2023-05-05 00:17:13 -07:00
Markus Lehtonen
526aab87cf deployment/helm: user dedicated serviceaccount for topology-updater
Change the configuration so that, by default, we use a dedicated
serviceaccount for topology-updater (similar to topology-gc, nfd-master
and nfd-worker).

Fix the templates so that the serviceaccount and clusterrolebinding are
only created when topology-updater is enabled (clusterrole was already
handled this way).

This patch also correctly documents the default value of rbac.create
parameter of topology-updater and topology-gc.
2023-05-05 08:30:21 +03:00
Markus Lehtonen
9c2f268fd2 deployment/helm: improve handling of topologyUpdater.kubeletStateFiles
Make it possible to disable kubelet state tracking with
--set topologyUpdater.kubeletStateFiles="" as the documentation
suggests.

Also, fix the documentation regarding the default value of
topologyUpdater.kubeletStateFiles parameter.
2023-05-04 15:01:19 +03:00
Markus Lehtonen
9685d292a2 docs: add missing .md suffix to internal references
Commit bfbc47f55e added a lot of those and
this patch tries to cover all that we missed there. Having .md suffixes
in references to internal files makes it convenient to browse the
document locally, just as text files as the references work correctly.
2023-04-25 15:28:07 +03:00
Kubernetes Prow Robot
2356223ffc
Merge pull request #1139 from AhmedGrati/feat-configure-master-resync
feat: add master resync period configurability
2023-04-24 03:49:02 -07:00
AhmedGrati
7917434d38 feat: add master resync period configurability
This PR adds a config option for setting the NFD API controller resync period.
The resync period is only activated when the NodeFeature API has been
enabled (with -enable-nodefeature-api).

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-24 11:52:38 +02:00
Carlos Eduardo Arango Gutierrez
05ef5d4e9d
cpu: expose the total number of AMD SEV ASID and ES
This patch add SEV ASIDs and the related (but distinct) SEV Encrypted State
(SEV-ES) IDs as two quantities to be exposed via extended resources.
In a kernel built with CONFIG_CGROUP_MISC on a suitably equipped AMD CPU, the
root control group will have a misc.capacity file that shows the number of
available IDs in each category.

The added extended resources are:
- sev.asids
- sev.encrypted_state_ids

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-17 19:34:39 +02:00
Mikko Ylinen
de1b69a8bf cpu: make SGX EPC resource available to NodeFeatureRules
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-04-14 15:31:54 +03:00
Markus Lehtonen
3320c74472 source/cpu: don't create cpu-security.tdx.total_keys label
Just have that as a feature for NodeFeatureRules to consume.
2023-04-14 13:33:13 +03:00
Kubernetes Prow Robot
84c348b69f
Merge pull request #1126 from marquiz/devel/er-deprecation
nfd-master: deprecate the -resource-labels flag
2023-04-13 10:52:39 -07:00
Kubernetes Prow Robot
8d71ed6755
Merge pull request #1086 from AhmedGrati/feat-support-builtin-kernel-mods
feat: support builtin kernel mods
2023-04-13 10:30:40 -07:00
AhmedGrati
109caa1f28 feat: support builtin kernel mods
This PR adds the combination of dynamic and builtin kernel modules into
one feature called `kernel.enabledmodule`. It's a superset of the
`kernel.loadedmodule` feature.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>
2023-04-13 10:19:24 +01:00
Markus Lehtonen
8511980bf4 nfd-master: deprecate the -resource-labels flag
Mark the -resource-labels flag (and the corresponding resourceLabels
config option) as deprecated. We now support managing extended resources
via NodeFeatureRule objects. This kludge deserves to go, eventually.
2023-04-13 11:30:58 +03:00
Markus Lehtonen
dcbb3bc450 docs: add missing mentions of extended resources and taints
A small update to fix some missing mentions of extended resources and
taints as assets managed by NFD.
2023-04-11 20:38:21 +03:00
Kubernetes Prow Robot
ad07829d0a
Merge pull request #1099 from ArangoGutierrez/extended_resources_v2
Create extended resources with NodeFeatureRule
2023-04-07 08:09:15 -07:00
Fabiano Fidêncio
250aea4741
Create extended resources with NodeFeatureRule
Add support for management of Extended Resources via the
NodeFeatureRule CRD API.

There are usage scenarios where users want to advertise features
as extended resources instead of labels (or annotations).

This patch enables the discovery of extended resources, via annotation
and patch of node.status.capacity and node.status.allocatable. By using
the NodeFeatureRule API.

Co-authored-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Markus Lehtonen <markus.lehtonen@intel.com>
Co-authored-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
2023-04-07 16:14:56 +02:00
Kubernetes Prow Robot
6740224a13
Merge pull request #1100 from PiotrProkop/expose-L3-num-closid
Advertise RDT L3 num_closid
2023-04-07 00:49:14 -07:00
Markus Lehtonen
cc6c20ff5f nfd-master: disallow unprefixed and kubernetes taints
Disallow taints having a key with "kubernetes.io/" or "*.kubernetes.io/"
prefix. This is a precaution to protect the user from messing up with
the "official" well-known taints from Kubernetes itself. The only
exception is that the "nfd.node.kubernetes.io/" prefix is allowed.

However, there is one allowed NFD-specific namespace (and its
sub-namespaces) i.e. "feature.node.kubernetes.io" under the
kubernetes.io domain that can be used for NFD-managed taints.

Also disallow unprefixed taint keys. We don't add a default prefix to
unprefixed taints (like we do for labels) from NodeFeatureRules. This is
to prevent unpleasant surprises to users that need to manage matching
tolerations for their workloads.
2023-04-06 16:12:37 +03:00
PiotrProkop
0e78eba40e Advertise RDT L3 num_closid
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2023-04-06 11:22:55 +02:00