* fix bitiop creating the dst key if result is empty
* fix replicating dst with the wrong type
* make bitop a blind update (similar to set command)
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
* chore: Forbid replicating a replica
We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.
Fixes#3679
* `replicaof_mu_`
fix: Fix flaky test `test_acl_revoke_pub_sub_while_subscribed`
The reason it failed is that, in some rare cases, the subscriber did not
get the first few messages of the publisher. This is likely due to
timing of subscribe and publish, in different connections / threads.
Given Pub/Sub has very weak guarantees, it's probably ok as is, so I
just added a sleep to get the test to pass always.
* chore: some renames + fix a typo in RETURN_ON_BAD_STATUS
Renames in transaction.h - no functional changes.
Fix a typo in error.h following #3758
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
fix 1: in multi command squasher error message was not set therefore it was not printed to log on the relevant command only on exec, fixed by setting the last error in CapturingReplyBuilder::SendError
fix 2: non clearing cached error replies before the command is Invoked
---------
Signed-off-by: adi_holden <adi@dragonflydb.io>
Co-authored-by: kostas <kostas@dragonflydb.io>
* fix: improve BreakStalledFlowsInShard heuristic
Before this change - we wrote in a single call whatever record chunks we pulled from the channel.
This can be problematic for 1GB chunks for example, which might take 10sec to write.
Lately we added a replication breaker on the master side that breaks the fully sync after
a predefined threshold has passed. By default it was 10sec. To improve the robustness of this
breaker, we now write chunks of upto 1MB and update last_write_time_ns_ more frequently.
Also, we added more logs to make replication delays on both sides more visible.
We also added logs of breaking the replication on the master sides.
Unfortunately, this did not help making BreakStalledFlowsInShard more robust because now the
problem moved to replica side which may take 20s+ seconds to parse huge values.
Therefore, I increased the threshold for breaking the replication to 30s.
Finally, instrument GetMetrics call as it takes sometimes more than 1 sec.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* add(helm): add hostNetwork, topologySpreadConstraint and clusterIP support
* parameters hostNetwork and clusterIP shouold not be templated if they are not explicitly used
---------
Signed-off-by: Stefan Roman <elegant.frog3113@fastmail.com>
Co-authored-by: Stefan Roman <elegant.frog3113@fastmail.com>
* chore: introduce a Clone function for the dense set
We use a state machine to prefetch data in batches.
After this change, the hot spots are predominantly inside ObjectClone and
Hash methods.
All in all benchmarks show ~45% CPU reduction:
```
BM_Clone/elements:32000 1517322 ns 1517338 ns 2772
BM_Fill/elements:32000 841087 ns 841097 ns 4900
```
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
The test assumed any shutdown will take not more than 1s. This doesn't
always hold, and also waiting for 1s isn't ideal because usually it
takes less than that.
Changed to use `assert_eventually` instead.
Fixes#3684
There are 2 minor issues with this test:
1. It specified `cmdstat_replconf` as `cmd_stats` instead of `cmd`,
that's clearly a typo as `cmd_stats` is a map with stats, while
`replconf` is a Dragonfly command
2. Command `MULTI` is allowed to run even when the server is in paused
state, see
[here](https://github.com/dragonflydb/dragonfly/blob/main/src/server/main_service.cc#L1197):
```
// Don't interrupt running multi commands or admin connections.
```
Fixes#3675
Pre-this change, whenever Dragonfly was paused (either by a user or by
internal processes like takeover or slot migration finalization),
migrations and replications were also paused.
This could cause timing issues, which sometime result in migration
failures. Specifically, when 2 nodes have migrations from one to the
other **in parallel** (A->B and B->A), the `Pause()` that happens on A
(which happens because it's a source node) will stop it from processing
incoming traffic from B (incoming because it is also a target node).
If timed correctly, it will be locked until it times out, and so the
migration will fail.
The fix is to prevent replications and migrations from adhering to
`Pause()`s, which I think should not have happened in the first place
because they should use the admin port anyway.
Fixes#3319