Until now, we only tested Dragonfly against Redis 6.2. It appears that
something has changed in the way Redis sends stable sync commands, and
now they also forward `MULTI` and `EXEC` as part of their replication.
Since we do not allow all commands to run under `MULTI`/`EXEC`,
specifically `SELECT`, a Dragonfly replica of such servers failed these
commands and became inconsistent with the data on the master.
The proposed fix is to simply ignore (i.e. not execute) `MULTI`/`EXEC`
coming from a Redis/Valkey master, and run the commands within those
transactions individually, like we do for other transactions.
To test this we randomly choose a redis/valkey server based on 3
available installed binaries and test against them.
* fix: Do not publish to connections without context
This is a rare case where a closed connection is kept alive while the
handling fiber yields, therefore leaving `cc_` (the connection context)
pointing to null for other fibers to see.
As far as I can see, this can only happen during server shutdown, but
there could be other cases that I have missed.
The test on its own does _not_ reproduce the crash, however with added
`ThisFiber::SleepFor()`s I could reproduce the crash:
* Right before `DispatchBrief()`
[here](e3214cb603/src/server/channel_store.cc (L154))
* Right after connection context `reset()`
[here](2ab480e160/src/facade/dragonfly_connection.cc (L750))
In any case, calling `SendPubMessageAsync()` to a connection where `cc_`
is null is a bug, and we fix that here.
* rewording
A common case is that we need to clean up a connection before we exit a test via .close() method. This is needed because otherwise the connection will raise a warning that it is left unclosed. However, remembering to call .close() at each connection at the end of the test is cumbersome! Luckily, fixtures in python can be marked as async which allow us to:
* cache all clients created by DflyInstance.client()
* clean them all at the end of the fixture in one go
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
We do not allow notify_keyspace_events to be set at runtime via config set command.
* allow notify_keyspace_events in config set command
* add tests
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
* fix bitiop creating the dst key if result is empty
* fix replicating dst with the wrong type
* make bitop a blind update (similar to set command)
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
* chore: Forbid replicating a replica
We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.
Fixes#3679
* `replicaof_mu_`
fix: Fix flaky test `test_acl_revoke_pub_sub_while_subscribed`
The reason it failed is that, in some rare cases, the subscriber did not
get the first few messages of the publisher. This is likely due to
timing of subscribe and publish, in different connections / threads.
Given Pub/Sub has very weak guarantees, it's probably ok as is, so I
just added a sleep to get the test to pass always.
The test assumed any shutdown will take not more than 1s. This doesn't
always hold, and also waiting for 1s isn't ideal because usually it
takes less than that.
Changed to use `assert_eventually` instead.
Fixes#3684
There are 2 minor issues with this test:
1. It specified `cmdstat_replconf` as `cmd_stats` instead of `cmd`,
that's clearly a typo as `cmd_stats` is a map with stats, while
`replconf` is a Dragonfly command
2. Command `MULTI` is allowed to run even when the server is in paused
state, see
[here](https://github.com/dragonflydb/dragonfly/blob/main/src/server/main_service.cc#L1197):
```
// Don't interrupt running multi commands or admin connections.
```
Fixes#3675
We disable address space randomization when building the binary
and use addr2line to symbolize the stacktrace if it exists.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
chore: deprecate RecordsPopper and serialize channel records during push
Records channel is redundant for DFS/replication because we have single producer/consumer
scenario and both running on the same thread. Unfortunately we need it for RDB snapshotting.
For non-rdb cases we could just pass a io sink to the snapshot producer,
so that it would use it directly instead of StringFile inside FlushChannelRecord.
This would reduce memory usage, eliminate yet another memory copy and generally would make everything simpler.
For that to work, we must serialize the order of FlushChannelRecord, and that's implemented by
this PR. Also fixes#3658.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: crash when passing empty arguments
Fix the case where we pass an empty argument, which then is parsed as an
empty string view with null pointer. The null pointer is not handled correctly
by our low level c code, hence switch to using ""sv for that.
* chore: add more list asserts + improve test_hypothesis
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. The offset value can be negative, in that case we should return an empty array.
2. Fix edge cases of inf*0 and -inf + inf, so they will result in 0 and non NaN (similarily to Valkey).
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
For legacy mode:
1. For mutate commands, an empty result should throw an error
2. For read commands, it returns nil if path was not found, but if it was matched
but only with a wrong types, it will throw an error.
For non-legacy mode, objlen should throw an error for non existing key.
Simplified JsonCallbackResult a bit and made sure more fakeredis tests are passing.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat: add slave_repl_offset to the replication section.
In Valkey slave_repl_offset denotes the replication offset on replica site during stable sync phase.
During fullsync phase it appears with 0 value.
In Dragonfly this field appears only after full sync has completed, thus it allows
to check whether Dragonfly reached stable sync phase. The value of this field describes the cumulative progress
of all the replication flows and it does not directly correspond to master side metrics.
In addition, this PR fixes the bug in wait_available_async() function in our replication tests.
This function is intended to wait until a replica reaches stable state and it did by sending pings until they do not
respond with LOADING error, hence the assumption is that the replica is in full sync state already.
However it can happen that master_link_status is "up" but replica has not reached full sync state, and the PING will succeed
just because wait_available_async() was called before full sync started. The whole approach of polling the state is fragile.
Now we use `slave_repl_offset` explicitly to see if the replica reaches stable state.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: simplify wait_available_async
* chore: comments
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: enable experimental_new_io by default.
It has been running for weeks with the flag on, so enabled it also for community.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Vladislav Oleshko <vlad@dragonflydb.io>
Co-authored-by: Vladislav Oleshko <vlad@dragonflydb.io>
* fix: xreadgroup replies as a map for RESP3
Moreover, it returns data for all the strings, irrespective whether they have results or not
(unlike with XREAD)
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: properly handle xpending with 0 results
Also reject ENTRIESREAD instead of silently accepting it.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: JSON.STRAPPEND
JSON.STRAPPEND was completely broken.
First, it accepts exactly 3 arguments, i.e. a single value to append.
Secondly, the value must be a json string, not the regular string. Meaning it must be in double quotes.
So, before we parsed: `JSON.STRAPPEND key $.field bar` and now we parse:
`JSON.STRAPPEND key $.field "bar"`
In addition fixed the behavior of JSON.STRLEN to return "no such key" error in case the
json key does not exist and path is specified.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: improve compatibility of set and ping commands
smismember should return an array of longs and not array of strings.
ping in subscribe mode returns an array for resp2.
Also, fix double rounding for legacy float mode.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. Fix corner cases around non existing keys
2. Fix matching logic for * glob, as well as '' glob.
3. Improve SORT option parsing.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat(cluster): Allow appending RDB to existing store
The goal of this PR is to support the loadoing of multiple RDB files into a single server, like when migrating from a Valkey cluster to Dragonfly with a different number of nodes.
It makes the following changes:
* Removes `DEBUG LOAD`, as we already have `DFLY LOAD`
* Adds `APPEND` option to `DFLY LOAD` (i.e. `DFLY LOAD <filename> APPEND`) that loads an RDB without first flushing the data store, overriding existing keys
* Does not load keys belonging to unowned slots, if in cluster mode
Fixes#2840
Background
We tried to be compatible with Valkey in their support of Lua flags, but we generally failed:
We are not really compatible with Valkey because our flags are different (we reject unknown flags, and Valkey flags are unknown to us)
The #!lua syntax doesn't work with Lua (# is not a comment), so scripts written for older versions of Redis can't be used with Dragonfly (i.e. they can't add Dragonfly flags and remain compatible with older Redis versions)
Changes
Instead of the previous syntax:
#!lua flags=allow-undeclared-keys,disable-atomicity
We now use this syntax:
--!df flags=allow-undeclared-keys,disable-atomicity
It is not backwards compatible (with older versions of Dragonfly), but it should be very easy to adapt to, and doesn't suffer from the disadvantages above.
Related to #3512
* fix truncating the timeout red dots on CI failures
* fix deprecated use of with timeout warnings
* remove @pytest.mark.dbg_only as it doesn't exist
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
There are some problematic flows. First we did not handle deletions, so all sorts of consistency issues could arise while calling DbSlice::Traverse() and DbSlice::Del(). Second, we did not handle FlushAll (same as before, Traverse() preempts and FlushAll() kicks in. Third we did not handle expirations.
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
**Background**
In v1.21.0 we introduced support for `--announce_ip` for replicas to
announce their public IP addresses.
Like Valkey, this uses `REPLCONF IP-ADDRESS` to announce their IP
address.
**The issue**
Older Dragonfly releases (<1.21) did not support this feature. The
master side simply returned an error for such `REPLCONF` attempts,
however the replica code failed the replication, resulting in
incompatible versions.
**The fix**
The fix is simple, just log an error if the master did not respect
`REPLCONF IP-ADDRESS`. We can make this non-optional in the future
again.
However, in addition, I added a regression test to make sure we are
backwards compatible with v1.19.2. We'll bump this up every once in a
while.
* chore: add timeout fo replication sockets
Master will stop the replication flow if writes could not progress for more than K millis.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
os.remove(LAST_LOGS) might throw an exception if the file does not exist which we do not handle. Wrap it in try/catch block
* wrap in try/catch os.remove
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
The env variables exported when regression tests timeout are not working properly and the if statement on the action step Print last log on timeout would fail to read and upload the files set in /tmp/last_log_file.txt. Furthermore, another problem is the job.timeout argument that kills the whole job/matrix before the upload log step has a chance to run. For that, we need manual timeouts on the workflow similar to what we do in regression tests action.
* remove print last log on timeout action step
* copy the logs on timeout directly within the timeout step
* replace global timeout on CI workflow with timeout command per step
---------
Signed-off-by: kostas <kostas@dragonflydb.io>