There was a bug on updates of the acl categories when squashing was used. Basically, the parent context could be accessed in parallel by the "stub" contexts causing a dreaded data race on the update.
This is fixed by adding a new AclUpdateMessage at the front of the dispatch queue of the connection.
1. No logic was changed during refactoring.
2. Flipped the flag to run regression tests for now own with zset_tree=on
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix DispatchCommand error reporting when memcached protocol is used (one example is when we use SET command on the replica -- previously we crashed now we properly report an error)
* SendError(ErrorReply) moved to SinkReplyBuilder from RedisReplyBuilder
* SendError(OpStatus) moved to SinkReplyBuilder from RedisReplyBuilder
* added tests for SendError(ErrorReply) in RedisReplyBuilder
* feat: implement CONFIG GET command
The command returns all the matched arguments and their current values.
In addition, this PR adds mutability semantics to each config - whether it can be
changed at runtime.
Fixes#1700
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat(server): Support limiting the number of open connections.
* * Update helio after the small fix was merged to master
* Don't limit admin connections (and add a test case)
* Resolve CR comments
1. If the first request sent to the connection is large (2kb or more)
Dragonfly was closing the connection.
2. Changed server side error reporting according to memcache protocol:
https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L172
3. Fixed the wrong casting in DispatchCommand.
4. Remove practically unused code that translated opstatus to strings.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* introduce `--replicaof` flag
Closes#1381.
The behvaiour of `--replicaof` is similar to `REPLICAOF`. On startup, the instance continuously attempts to connect to master. Stop using the normal `REPLICAOF NO ONE` command.
The flag expects format `<IPv4/host>:<port>` or `[<IPv6>]:<port>`.
---------
Signed-off-by: talbii <ido@dragonflydb.io>
Signed-off-by: talbii <41526934+talbii@users.noreply.github.com>
Requested by #1590.
Introducing a new flag --snapshot_cron, enabling users to use cronjob expressions to time snapshot saves.
Cronjob expressions are parsed using a third party library croncpp.
This PR continues #1599, updating cron expressions to crontab style,
up to minutes resolution instead of seconds.
Signed-off-by: Dor Avrahami <da19965@gmail.com>
Introducing a new flag `--snapshot_cron`, which enables users
to use cron expressions to time snapshot saves.
Signed-off-by: Dor Avrahami <da19965@gmail.com>
* Fix(regression test): fix test_flushall_in_full_sync
The bug: the test checks the replication using role command on replica
The replica updates the status to full sync when starting the full sync
flow, but actually the master did not start snapshoting yet.
The fix: check the status using role command on master, because master
updates the status only after snapshoting started.
Signed-off-by: adi_holden <adi@dragonflydb.io>
* sec: Adjust flag checks when using TLS.
* Trust default certificates if no specific roots are given
* Add regression tests for the different scenarios
* Validate that client connections work as well
The test fails sometimes when starting master after killing it.
The reason for this is that OS did not release port untill we started
master again.
The fix - adding sleep after kill
After we will have randomly selected ports on pytest we can remove this
sleep.
Signed-off-by: adi_holden <adi@dragonflydb.io>
* fix(regression_test): fix in shutdown and replication pytests
- skip test_gracefull_shutdown test
- fix test_take_over_seeder test:
bug: the dbfilename was not unique, therefore between different runs the server reload
the snapshot of the last test run and this failed the test.
fix: use random dbfilename
- fix test_take_over_timeout test:
bug: REPLTAKEOVER timeout was not small enough for opt dfly build
fix: decrease timeout
Signed-off-by: adi_holden <adi@dragonflydb.io>
1. add tls-ca-cert-file flag
2. add tls-ca-cert-dir flag
3. enables redis-cli to connect over tls without --insecure flag by properly validating certificate wtih CA
The issue was that, sometimes, the ID generated for one of the nodes
contained the slot ID that was used in the test (either 5259 or 5260).
This caused the test to replace the "slot" part of the id, which in turn
caused the node to think that it no longer owns any slot.
* fix(server): Initialize ServerFamily with all listeners.
- Add a test for CLIENT LIST which is the visible result of this.
* use std move
* feat: Implement replicas take over
* Basic test
* Address CR comments
* Write a better test. Sadly it fails
* chore: Expose AwaitDispatches for reuse in takeover
* Ensure that no commands can execute during or after a takeover
* CR progress
* Actually disable the expiration
* Improve tests coverage
* Fix the dispatch waiting code
* Improve testing coverage and fix a shutdown snaphot bug
* don't replicate a replica
Enables execution of global lua scripts inside multi/exec transactions if the defualt script config enables global execution for scripts. This change is only a fix and does not provide any safeguards against other execution scenarios (namely enabling globality with script flags). In the future, the proper execution mode should be determined more carefully by inspecting the scripts to be executed
Signed-off-by: Vladislav Oleshko <vlad@dragonflydb.io>
Co-authored-by: Kostas Kyrimis <kostaskyrim@gmail.com>
The test case for checking is_loading == 1 is inherently racy because
the client can connect at any time before or after the dragonfly
instance loads the snapshot.
This PR is a temporary solution for clients that are not properly
removed from the connection pool triggering an active client assertion
during dragonfly instance shutdown
fix: remove bad check-fail in the transaction code.
Fixes#1421.
The failure reproduces for dragongly running with a single thread where all the
arguments grouped within the same ShardData
Also, we improve verbosity levels inside reply_builder.cc.
For that we extend SinkReplyBuilder to support protocol errors reporting
and we remove ad-hoc code for this from dragonfly_connection.
Required to track errors easily with `--vmodule=reply_builder=1`
Finally, a pytest is added to cover the issue.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
In this case, `redis.RedisCluster`.
To be double sure I also looked at the actual packets and saw that the
client asks for `CLUSTER SLOTS`, and then after the redistribution of
slots, following a few `MOVED` replied, it asks for the new slots again.
This allows masters to send data of non-owned keys to their replicas,
which is useful when:
1. Config is temporarily different between master and replica
2. Preparing for taking ownership over currently not-owned slots (in the upcoming migration feature)
Fixed#1319
* feat: Support ACKs from replica to master
* Rework after CR
* Split the acks into a different fiber and remove the PING loop
* const convention
* move around the order.
* revert sleep removal
* Exit ack fiber on cancellation
* Don't send ACKs if server doesn't support it
Now `SUBSCRIBE` will respond synchronously. The change is here so we:
1. Maintain the order in pipelined requests
2. Don't have a "race condition": subscribe needs to update channel store pointers on all threads. While it awaits for all threads to complete the callback, some of them might have done it earlier, so they can already start sending messages before the initial ack is sent
Signed-off-by: Vladislav Oleshko <vlad@dragonflydb.io>
* feat: Use journal LSNs for absolute replication offsets
* 1 - Address small CR comments
2 - Simplify the offset accounting so that we send the correct offset
in `SliceSnapshot::Stop` instead of counting in RdbLoader. This
allows us to revert the changes to slice journaling of EXEC
commands, for example.
* Store int with absl::little_endian
* Document the offset management
Move timeout decrement to the end so we don't assert if we succeed in
the last second (looks like this was the case in the recent regression
test failure on CI)
Clarify the comment of await_synced
Signed-off-by: ashotland <ari@dragonflydb.io>
* test(sentinel_test.py): increase timeout from 10 to 15 seconds in test_failover function
Signed-off-by: ashotland <ari@dragonflydb.io>
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
* feat(server): Save snapshot on shutdown
* CR
* Change save on shutdown to be conditional on --dbfilename.
* Support SHUTDOWN [NO]SAVE and fix unit test
* Better wait for DB loading
* Fix DF format loading state bug
* Fix some fallout from auto save
* feat(server): Insert timestamp into snapshot names explicitly
* Whenever the snapshot filename contains '{timestamp}', it will be substituted with the current local time.
* Default snapshot name is now "dump-{timestamp}"
* InferLoadFile: Modify to recognize "{timestamp}" files correctly.
* ServerFamily::Load: Change extension 'CHECK' into a non-terminating error because it's user-visible
* ServerFamily::DoSave: Add sanity check for the filename extension.
Signed-off-by: Roy Jacobson <roy@dragonflydb.io>
* resolve CR comments
* Add comment about glob sorted output
* Fix InferLoadFile and fix its tests
* Simplify filename behavior with the .dfs format
---------
Signed-off-by: Roy Jacobson <roy@dragonflydb.io>
* fix(regression-tests): Add PortManager
Add PortManager to find and use available ports in regression tests.
Use it in redis_replicaiton_test.
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
fix(regression-test): Delete dump.rdb
This fails redis to start in redis_replicaiton_test,
I assume it was added accidently.
Signed-off-by: ashotland <ari@dragonflydb.io>
* Fix crash in ZPOPMIN
Crash was due to an attempt to access nullptr[-1], which is bad :)
* Add test to repro crash.
There's some leftover debugging statements, they're somewhat useful so I
kept them as the bug is not yet fixed.
* Copy patch by romange to solve the crash
Also re-enable (uncomment) the test in utility.py.
Signed-off-by: chakaz <chakaz@chakaz>
---------
Signed-off-by: chakaz <chakaz@chakaz>
Signed-off-by: Chaka <chakaz@users.noreply.github.com>
Co-authored-by: chakaz <chakaz@chakaz>
1. pytest extensions and fixes - allows running them
with the existing local server by providing its port (--existing <port>).
2. Extend "DEBUG WATCHED" command to provide more information about watched state.
3. Improve debug/vlog printings around the code.
This noisy PR is a preparation before BRPOP fix that will follow later.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Previously we set batch mode when dispatch queue is not empty
but dispatch queue could contain other async messages related to pubsub or monitor.
Now we enable batching only if there are more pipeline commands in the queue.
In addition, fix the issue of unlimited aggregation of batching buffer.
Fixes#935.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
The problem happens when a publisher sends a message and a new subscriber registers.
In that case it sends "subscribe" response and the publish messages and those
interleave sometimes.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* Some test cases for redis replication
Most of them skipped / commented-out to serve as repro without failing
on CI.
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
Store script parameters for each script that allow configuring it's transactions multi mode. They can be configured either for a specific scripts with `SCRIPT CONIFG <sha> [params...]` or changed globally as defaults with `default_lua_config`.
The current supported options are `allow-undeclared-keys` and `disable-atomicity`. Based on those flags, we determine the correct multi mode. `disable-atomicity` allow running in non-atomic mode, whereas being atomic and enabling `allow-undeclared-keys` requires the global mode.
* Ditch docker whcih is complex on CI in favour of local redis binary
Signed-off-by: ashotland <ari@dragonflydb.io>
* Fix typo
Signed-off-by: ashotland <ari@dragonflydb.io>
* Wait for sentinel termination
Signed-off-by: ashotland <ari@dragonflydb.io>
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
* Ditch docker whcih is complex on CI in favour of local redis binary
Signed-off-by: ashotland <ari@dragonflydb.io>
* Fix typo
Signed-off-by: ashotland <ari@dragonflydb.io>
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
* feat(server): Allow admin commands in multi
Needed for sentinel support (#706)
Signed-off-by: ashotland <ari@dragonflydb.io>
* feat(server): Add test coverage for multi global commands
Signed-off-by: ashotland <ari@dragonflydb.io>
* code review fixes
Signed-off-by: ashotland <ari@dragonflydb.io>
* Sentinel integration test
Signed-off-by: ashotland <ari@dragonflydb.io>
* PR code reiew follow up
Have lambda return awaitable instead of defining neoff async function
Signed-off-by: ashotland <ari@dragonflydb.io>
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
* feat(server): Allow admin commands in multi
Needed for sentinel support (#706)
Signed-off-by: ashotland <ari@dragonflydb.io>
* feat(server): Add test coverage for multi global commands
Signed-off-by: ashotland <ari@dragonflydb.io>
* code review fixes
Signed-off-by: ashotland <ari@dragonflydb.io>
* Sentinel integration test
Signed-off-by: ashotland <ari@dragonflydb.io>
---------
Signed-off-by: ashotland <ari@dragonflydb.io>
* test: add dragonfly_db fixture to it tests #199
Signed-off-by: Shmulik Klein <shmulik.klein@gmail.com>
* test: lint using flake8
Signed-off-by: Shmulik Klein <shmulik.klein@gmail.com>
* test: run dragonfly debug version as fallback
Signed-off-by: Shmulik Klein <shmulik.klein@gmail.com>
In rare cases a scheduled transaction is not scheduled correctly and we need
to remove it from the tx-queue in order to re-schedule. When we pull it from tx-queue
and it has been located at the head, we must poll-execute the next txs in the queue.
1. Fix the bug.
2. Improve verbosity loggings to make it easier to follow up on tx flow in release mode.
3. Introduce /txz handler that shows currently pending transactions in the queue.
4. Fix a typo in xdel() function.
5. Add a py-script that reproduces the bug.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. Found dangling transaction pointers that where left in the watch queue. Fix the state machine there.
2. Improved transaction code a bit, merged duplicated code into RunInShard function, got rid of RunNoop.
3. Improved BPopper::Run flow.
4. Added 'DEBUG WATCH' command. Also 'DEBUG OBJECT' now returns shard id and the lock status of the object.