mirror of
https://github.com/arangodb/kube-arangodb.git
synced 2024-12-14 11:57:37 +00:00
fe66d98444
- remove duplicated docs - update old docs with new info - rework docs index page - file names not changed to make sure redirects from old site will work as expected Co-authored-by: jwierzbo <jakub.wierzbowski@arangodb.com>
453 lines
16 KiB
Markdown
453 lines
16 KiB
Markdown
# Draining Kubernetes nodes
|
|
|
|
**If Kubernetes nodes with ArangoDB pods on them are drained without care
|
|
data loss can occur!**
|
|
The recommended procedure is described below.
|
|
|
|
For maintenance work in k8s it is sometimes necessary to drain a k8s node,
|
|
which means removing all pods from it. Kubernetes offers a standard API
|
|
for this and our operator supports this - to the best of its ability.
|
|
|
|
Draining nodes is easy enough for stateless services, which can simply be
|
|
re-launched on any other node. However, for a stateful service this
|
|
operation is more difficult, and as a consequence more costly and there
|
|
are certain risks involved, if the operation is not done carefully
|
|
enough. To put it simply, the operator must first move all the data
|
|
stored on the node (which could be in a locally attached disk) to
|
|
another machine, before it can shut down the pod gracefully. Moving data
|
|
takes time, and even after the move, the distributed system ArangoDB has
|
|
to recover from this change, for example by ensuring data synchronicity
|
|
between the replicas in their new location.
|
|
|
|
Therefore, a systematic drain of all k8s nodes in sequence has to follow
|
|
a careful procedure, in particular to ensure that ArangoDB is ready to
|
|
move to the next step. This is necessary to avoid catastrophic data
|
|
loss, and is simply the price one pays for running a stateful service.
|
|
|
|
## Anatomy of a drain procedure in k8s: the grace period
|
|
|
|
When a `kubectl drain` operation is triggered for a node, k8s first
|
|
checks if there are any pods with local data on disk. Our ArangoDB pods have
|
|
this property (the _Coordinators_ do use `EmptyDir` volumes, and _Agents_
|
|
and _DB-Servers_ could have persistent volumes which are actually stored on
|
|
a locally attached disk), so one has to override this with the
|
|
`--delete-local-data=true` option.
|
|
|
|
Furthermore, quite often, the node will contain pods which are managed
|
|
by a `DaemonSet` (which is not the case for ArangoDB), which makes it
|
|
necessary to override this check with the `--ignore-daemonsets=true`
|
|
option.
|
|
|
|
Finally, it is checked if the node has any pods which are not managed by
|
|
anything, either by k8s itself (`ReplicationController`, `ReplicaSet`,
|
|
`Job`, `DaemonSet` or `StatefulSet`) or by an operator. If this is the
|
|
case, the drain operation will be refused, unless one uses the option
|
|
`--force=true`. Since the ArangoDB operator manages our pods, we do not
|
|
have to use this option for ArangoDB, but you might have to use it for
|
|
other pods.
|
|
|
|
If all these checks have been overcome, k8s proceeds as follows: All
|
|
pods are notified about this event and are put into a `Terminating`
|
|
state. During this time, they have a chance to take action, or indeed
|
|
the operator managing them has. In particular, although the pods get
|
|
termination notices, they can keep running until the operator has
|
|
removed all _finalizers_. This gives the operator a chance to sort out
|
|
things, for example in our case to move data away from the pod.
|
|
|
|
However, there is a limit to this tolerance by k8s, and that is the
|
|
grace period. If the grace period has passed but the pod has not
|
|
actually terminated, then it is killed the hard way. If this happens,
|
|
the operator has no chance but to remove the pod, drop its persistent
|
|
volume claim and persistent volume. This will obviously lead to a
|
|
failure incident in ArangoDB and must be handled by fail-over management.
|
|
Therefore, **this event should be avoided**.
|
|
|
|
## Things to check in ArangoDB before a node drain
|
|
|
|
There are basically two things one should check in an ArangoDB cluster
|
|
before a node drain operation can be started:
|
|
|
|
1. All cluster nodes are up and running and healthy.
|
|
2. For all collections and shards all configured replicas are in sync.
|
|
|
|
#### Attention:
|
|
1) If any cluster node is unhealthy, there is an increased risk that the
|
|
system does not have enough resources to cope with a failure situation.
|
|
2) If any shard replicas are not currently in sync, then there is a serious
|
|
risk that the cluster is currently not as resilient as expected.
|
|
|
|
One possibility to verify these two things is via the ArangoDB web interface.
|
|
Node health can be monitored in the _Overview_ tab under _NODES_:
|
|
|
|
![Cluster Health Screen](images/HealthyCluster.png)
|
|
|
|
**Check that all nodes are green** and that there is **no node error** in the
|
|
top right corner.
|
|
|
|
As to the shards being in sync, see the _Shards_ tab under _NODES_:
|
|
|
|
![Shard Screen](images/ShardsInSync.png)
|
|
|
|
**Check that all collections have a green check mark** on the right side.
|
|
If any collection does not have such a check mark, you can click on the
|
|
collection and see the details about shards. Please keep in
|
|
mind that this has to be done **for each database** separately!
|
|
|
|
Obviously, this might be tedious and calls for automation. Therefore, there
|
|
are APIs for this. The first one is [Cluster Health](https://docs.arangodb.com/stable/develop/http/cluster/#get-the-cluster-health):
|
|
|
|
```
|
|
POST /_admin/cluster/health
|
|
```
|
|
|
|
… which returns a JSON document looking like this:
|
|
|
|
```json
|
|
{
|
|
"Health": {
|
|
"CRDN-rxtu5pku": {
|
|
"Endpoint": "ssl://my-arangodb-cluster-coordinator-rxtu5pku.my-arangodb-cluster-int.default.svc:8529",
|
|
"LastAckedTime": "2019-02-20T08:09:22Z",
|
|
"SyncTime": "2019-02-20T08:09:21Z",
|
|
"Version": "3.4.2-1",
|
|
"Engine": "rocksdb",
|
|
"ShortName": "Coordinator0002",
|
|
"Timestamp": "2019-02-20T08:09:22Z",
|
|
"Status": "GOOD",
|
|
"SyncStatus": "SERVING",
|
|
"Host": "my-arangodb-cluster-coordinator-rxtu5pku.my-arangodb-cluster-int.default.svc",
|
|
"Role": "Coordinator",
|
|
"CanBeDeleted": false
|
|
},
|
|
"PRMR-wbsq47rz": {
|
|
"LastAckedTime": "2019-02-21T09:14:24Z",
|
|
"Endpoint": "ssl://my-arangodb-cluster-dbserver-wbsq47rz.my-arangodb-cluster-int.default.svc:8529",
|
|
"SyncTime": "2019-02-21T09:14:24Z",
|
|
"Version": "3.4.2-1",
|
|
"Host": "my-arangodb-cluster-dbserver-wbsq47rz.my-arangodb-cluster-int.default.svc",
|
|
"Timestamp": "2019-02-21T09:14:24Z",
|
|
"Status": "GOOD",
|
|
"SyncStatus": "SERVING",
|
|
"Engine": "rocksdb",
|
|
"ShortName": "DBServer0006",
|
|
"Role": "DBServer",
|
|
"CanBeDeleted": false
|
|
},
|
|
"AGNT-wrqmwpuw": {
|
|
"Endpoint": "ssl://my-arangodb-cluster-agent-wrqmwpuw.my-arangodb-cluster-int.default.svc:8529",
|
|
"Role": "Agent",
|
|
"CanBeDeleted": false,
|
|
"Version": "3.4.2-1",
|
|
"Engine": "rocksdb",
|
|
"Leader": "AGNT-oqohp3od",
|
|
"Status": "GOOD",
|
|
"LastAckedTime": 0.312
|
|
},
|
|
... [some more entries, one for each instance]
|
|
},
|
|
"ClusterId": "210a0536-fd28-46de-b77f-e8882d6d7078",
|
|
"error": false,
|
|
"code": 200
|
|
}
|
|
```
|
|
|
|
Check that each instance has a `Status` field with the value `"GOOD"`.
|
|
Here is a shell command which makes this check easy, using the
|
|
[`jq` JSON pretty printer](https://stedolan.github.io/jq/):
|
|
|
|
```bash
|
|
curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/health --user root: | jq . | grep '"Status"' | grep -v '"GOOD"'
|
|
```
|
|
|
|
For the shards being in sync there is the
|
|
[Cluster Inventory](https://docs.arangodb.com/stable/develop/http/replication/replication-dump#get-the-cluster-collections-and-indexes)
|
|
|
|
API call:
|
|
|
|
```
|
|
POST /_db/_system/_api/replication/clusterInventory
|
|
```
|
|
|
|
… which returns a JSON body like this:
|
|
|
|
```json
|
|
{
|
|
"collections": [
|
|
{
|
|
"parameters": {
|
|
"cacheEnabled": false,
|
|
"deleted": false,
|
|
"globallyUniqueId": "c2010061/",
|
|
"id": "2010061",
|
|
"isSmart": false,
|
|
"isSystem": false,
|
|
"keyOptions": {
|
|
"allowUserKeys": true,
|
|
"type": "traditional"
|
|
},
|
|
"name": "c",
|
|
"numberOfShards": 6,
|
|
"planId": "2010061",
|
|
"replicationFactor": 2,
|
|
"shardKeys": [
|
|
"_key"
|
|
],
|
|
"shardingStrategy": "hash",
|
|
"shards": {
|
|
"s2010066": [
|
|
"PRMR-vzeebvwf",
|
|
"PRMR-e6hbjob1"
|
|
],
|
|
"s2010062": [
|
|
"PRMR-e6hbjob1",
|
|
"PRMR-vzeebvwf"
|
|
],
|
|
"s2010065": [
|
|
"PRMR-e6hbjob1",
|
|
"PRMR-vzeebvwf"
|
|
],
|
|
"s2010067": [
|
|
"PRMR-vzeebvwf",
|
|
"PRMR-e6hbjob1"
|
|
],
|
|
"s2010064": [
|
|
"PRMR-vzeebvwf",
|
|
"PRMR-e6hbjob1"
|
|
],
|
|
"s2010063": [
|
|
"PRMR-e6hbjob1",
|
|
"PRMR-vzeebvwf"
|
|
]
|
|
},
|
|
"status": 3,
|
|
"type": 2,
|
|
"waitForSync": false
|
|
},
|
|
"indexes": [],
|
|
"planVersion": 132,
|
|
"isReady": true,
|
|
"allInSync": true
|
|
},
|
|
... [more collections following]
|
|
],
|
|
"views": [],
|
|
"tick": "38139421",
|
|
"state": "unused"
|
|
}
|
|
```
|
|
|
|
Check that for all collections the attributes `"isReady"` and `"allInSync"`
|
|
both have the value `true`. Note that it is necessary to do this for all
|
|
databases!
|
|
|
|
Here is a shell command which makes this check easy:
|
|
|
|
```bash
|
|
curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep '"isReady"\|"allInSync"' | sort | uniq -c
|
|
```
|
|
|
|
If all these checks are performed and are okay, then it is safe to
|
|
continue with the clean out and drain procedure as described below.
|
|
|
|
|
|
#### Attention:
|
|
If there are some collections with `replicationFactor` set to
|
|
1, the system is not resilient and cannot tolerate the failure of even a
|
|
single server! One can still perform a drain operation in this case, but
|
|
if anything goes wrong, in particular if the grace period is chosen too
|
|
short and a pod is killed the hard way, data loss can happen.
|
|
|
|
If all `replicationFactor`s of all collections are at least 2, then the
|
|
system can tolerate the failure of a single _DB-Server_. If you have set
|
|
the `Environment` to `Production` in the specs of the ArangoDB
|
|
deployment, you will only ever have one _DB-Server_ on each k8s node and
|
|
therefore the drain operation is relatively safe, even if the grace
|
|
period is chosen too small.
|
|
|
|
Furthermore, we recommend to have one k8s node more than _DB-Servers_ in
|
|
you cluster, such that the deployment of a replacement _DB-Server_ can
|
|
happen quickly and not only after the maintenance work on the drained
|
|
node has been completed. However, with the necessary care described
|
|
below, the procedure should also work without this.
|
|
|
|
Finally, one should **not run a rolling upgrade or restart operation**
|
|
at the time of a node drain.
|
|
|
|
## Clean out a DB-Server manually
|
|
|
|
In this step we clean out a _DB-Server_ manually, **before issuing the
|
|
`kubectl drain` command**. Previously, we have denoted this step as optional,
|
|
but for safety reasons, we consider it mandatory now, since it is near
|
|
impossible to choose the grace period long enough in a reliable way.
|
|
|
|
Furthermore, if this step is not performed, we must choose
|
|
the grace period long enough to avoid any risk, as explained in the
|
|
previous section. However, this has a disadvantage which has nothing to
|
|
do with ArangoDB: We have observed, that some k8s internal services like
|
|
`fluentd` and some DNS services will always wait for the full grace
|
|
period to finish a node drain. Therefore, the node drain operation will
|
|
always take as long as the grace period. Since we have to choose this
|
|
grace period long enough for ArangoDB to move all data on the _DB-Server_
|
|
pod away to some other node, this can take a considerable amount of
|
|
time, depending on the size of the data you keep in ArangoDB.
|
|
|
|
Therefore it is more time-efficient to perform the clean-out operation
|
|
beforehand. One can observe completion and as soon as it is completed
|
|
successfully, we can then issue the drain command with a relatively
|
|
small grace period and still have a nearly risk-free procedure.
|
|
|
|
To clean out a _DB-Server_ manually, we have to use this API:
|
|
|
|
```
|
|
POST /_admin/cluster/cleanOutServer
|
|
```
|
|
|
|
… and send as body a JSON document like this:
|
|
|
|
```json
|
|
{"server":"DBServer0006"}
|
|
```
|
|
|
|
The value of the `"server"` attribute should be the name of the DB-Server
|
|
which is the one in the pod which resides on the node that shall be
|
|
drained next. This uses the UI short name (`ShortName` in the
|
|
`/_admin/cluster/health` API), alternatively one can use the
|
|
internal name, which corresponds to the pod name. In our example, the
|
|
pod name is:
|
|
|
|
```
|
|
my-arangodb-cluster-prmr-wbsq47rz-5676ed
|
|
```
|
|
|
|
… where `my-arangodb-cluster` is the ArangoDB deployment name, therefore
|
|
the internal name of the _DB-Server_ is `PRMR-wbsq47rz`. Note that `PRMR`
|
|
must be all capitals since pod names are always all lower case. So, we
|
|
could use the body:
|
|
|
|
```json
|
|
{"server":"PRMR-wbsq47rz"}
|
|
```
|
|
|
|
You can use this command line to achieve this:
|
|
|
|
```bash
|
|
curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/cleanOutServer --user root: -d '{"server":"PRMR-wbsq47rz"}'
|
|
```
|
|
|
|
The API call will return immediately with a body like this:
|
|
|
|
```json
|
|
{"error":false,"id":"38029195","code":202}
|
|
```
|
|
|
|
The given `id` in this response can be used to query the outcome or
|
|
completion status of the clean out server job with this API:
|
|
|
|
```
|
|
GET /_admin/cluster/queryAgencyJob?id=38029195
|
|
```
|
|
|
|
… which will return a body like this:
|
|
|
|
```json
|
|
{
|
|
"error": false,
|
|
"id": "38029195",
|
|
"status": "Pending",
|
|
"job": {
|
|
"timeCreated": "2019-02-21T10:42:14.727Z",
|
|
"server": "PRMR-wbsq47rz",
|
|
"timeStarted": "2019-02-21T10:42:15Z",
|
|
"type": "cleanOutServer",
|
|
"creator": "CRDN-rxtu5pku",
|
|
"jobId": "38029195"
|
|
},
|
|
"code": 200
|
|
}
|
|
```
|
|
|
|
Use this command line to check progress:
|
|
|
|
```bash
|
|
curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/queryAgencyJob?id=38029195 --user root:
|
|
```
|
|
|
|
It indicates that the job is still ongoing (`"Pending"`). As soon as
|
|
the job has completed, the answer will be:
|
|
|
|
```json
|
|
{
|
|
"error": false,
|
|
"id": "38029195",
|
|
"status": "Finished",
|
|
"job": {
|
|
"timeCreated": "2019-02-21T10:42:14.727Z",
|
|
"server": "PRMR-e6hbjob1",
|
|
"jobId": "38029195",
|
|
"timeStarted": "2019-02-21T10:42:15Z",
|
|
"timeFinished": "2019-02-21T10:45:39Z",
|
|
"type": "cleanOutServer",
|
|
"creator": "CRDN-rxtu5pku"
|
|
},
|
|
"code": 200
|
|
}
|
|
```
|
|
|
|
From this moment on the _DB-Server_ can no longer be used to move
|
|
shards to. At the same time, it will no longer hold any data of the
|
|
cluster.
|
|
|
|
Now the drain operation involving a node with this pod on it is
|
|
completely risk-free, even with a small grace period.
|
|
|
|
## Performing the drain
|
|
|
|
After all above [checks before a node drain](#things-to-check-in-arangodb-before-a-node-drain)
|
|
and the [manual clean out of the DB-Server](#clean-out-a-db-server-manually)
|
|
have been done successfully, it is safe to perform the drain operation, similar to this command:
|
|
|
|
```bash
|
|
kubectl drain gke-draintest-default-pool-394fe601-glts --delete-local-data --ignore-daemonsets --grace-period=300
|
|
```
|
|
|
|
As described above, the options `--delete-local-data` for ArangoDB and
|
|
`--ignore-daemonsets` for other services have been added. A `--grace-period` of
|
|
300 seconds has been chosen because for this example we are confident that all the data on our _DB-Server_ pod
|
|
can be moved to a different server within 5 minutes. Note that this is
|
|
**not saying** that 300 seconds will always be enough. Regardless of how
|
|
much data is stored in the pod, your mileage may vary, moving a terabyte
|
|
of data can take considerably longer!
|
|
|
|
If the highly recommended step of
|
|
[cleaning out a DB-Server manually](#clean-out-a-db-server-manually)
|
|
has been performed beforehand, the grace period can easily be reduced to 60
|
|
seconds - at least from the perspective of ArangoDB, since the server is already
|
|
cleaned out, so it can be dropped readily and there is still no risk.
|
|
|
|
At the same time, this guarantees now that the drain is completed
|
|
approximately within a minute.
|
|
|
|
## Things to check after a node drain
|
|
|
|
After a node has been drained, there will usually be one of the
|
|
_DB-Servers_ gone from the cluster. As a replacement, another _DB-Server_ has
|
|
been deployed on a different node, if there is a different node
|
|
available. If not, the replacement can only be deployed when the
|
|
maintenance work on the drained node has been completed and it is
|
|
uncordoned again. In this latter case, one should wait until the node is
|
|
back up and the replacement pod has been deployed there.
|
|
|
|
After that, one should perform the same checks as described in
|
|
[things to check before a node drain](#things-to-check-in-arangodb-before-a-node-drain)
|
|
above.
|
|
|
|
Finally, it is likely that the shard distribution in the "new" cluster
|
|
is not balanced out. In particular, the new _DB-Server_ is not automatically
|
|
used to store shards. We recommend to
|
|
[re-balance](https://docs.arangodb.com/stable/deploy/deployment/cluster/administration/#movingrebalancing-_shards_) the shard distribution,
|
|
either manually by moving shards or by using the _Rebalance Shards_
|
|
button in the _Shards_ tab under _NODES_ in the web interface. This redistribution can take
|
|
some time again and progress can be monitored in the UI.
|
|
|
|
After all this has been done, **another round of checks should be done**
|
|
before proceeding to drain the next node.
|