For some of our commands we need to inject another transaction and another SinkReplyBuilder.
This results into error-prone injections of temporary objects into ConnectionContext.
Most commands just need Transaction and SinkReplyBuilder, so lets pass them explicitly.
The final goal will be to remove Transaction and reply_builder fields from ConnectionContext.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
This PR introduces "DEBUG RECVSIZE ENABLE|DISABLE|tid"
command that allows tracking of request sizes.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: remove ToUpper calls in main_service
Also, test for IsPaused() first to avoid doing more checks for common-case.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
ConnectionContext.reply_builder can be injected and replaced by the service logic.
before - dragonfly_connection accessed it via cc_->reply_builder in some places,
which led it to access the injected object. Moreover, EVAL commands can be offloaded
to another thread and that thread could inject the object, making the access to cc_->reply_builder_
non thread-safe.
Now dragonfly_connection copies aside the replier_builder_ pointer, and uses only this pointer for communicating with client.
Also, remove redundant arguments.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Until now, we only tested Dragonfly against Redis 6.2. It appears that
something has changed in the way Redis sends stable sync commands, and
now they also forward `MULTI` and `EXEC` as part of their replication.
Since we do not allow all commands to run under `MULTI`/`EXEC`,
specifically `SELECT`, a Dragonfly replica of such servers failed these
commands and became inconsistent with the data on the master.
The proposed fix is to simply ignore (i.e. not execute) `MULTI`/`EXEC`
coming from a Redis/Valkey master, and run the commands within those
transactions individually, like we do for other transactions.
To test this we randomly choose a redis/valkey server based on 3
available installed binaries and test against them.
* chore: Add `--allocator_tracker` for default tracking
Before, in order to use allocation tracker, one had to issue a `MEMORY
TRACK` command. This flag is identical to that, but allows starting
Dragonfly with certain ranges without issuing a command.
While here, fix a bug. Apparently, `absl::InlinedVector<>` has a bug in
the implementation of `max_size()` and so in practice we did not limit
the number of trackers. I switched to use `capacity()` instead, which I
tested and it works well.
Notes:
1. Currently the flag always add 100% "sampling", we can extend that in
the future if need be
2. I added the flag in `dfly_main.cc` with custom initialization,
because it's low level, and I couldn't get it reasonably working with
changes only to `allocation_tracker.cc`
* fixes
The problem - we used file write in non-direct mode when writing snapshots in epoll mode.
As a result - lots of data was cached into OS memory. But then during the rename operation,
when we rename "xxx.dfs.tmp" into "xxx.dfs", the OS flushes the file caches and the thread
is stuck in OS system call rename for a long time.
The fix - to use DIRECT mode and to avoid caching the data into OS caches at all.
Fixes#3895
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
BITPOS returns 0 for non-existent keys according to Redis's
implmentation.
BITPOS allows only 0 and 1 as the bit mode argument.
Signed-off-by: Denis K <kalantaevskii@gmail.com>
Use intrusive queue that allows batching of scheduling calls instead of handling each call separately.
This optimizations improves latency and throughput by 3-5%
In addition, we expose batching statistics in info transaction block.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Today, some of the failures to load an RDB file passed via
`--dbfilename` cause Dragonfly to terminate with an error code. This is
ok and works as expected.
The problem is that the same code path is used for `DFLY LOAD`, which
means that if there's an error loading the file (such as corrupted
file), Dragonfly will exit instead of returning an error code to the
client.
This change fixes that, by only exiting in the code path which loads on
init.
Note to reviewer: apparently we can't call `Future::Get()` more than
once, as the first call resets the state of the future and drops the
previously saved value, so we use a Fiber here instead.
* fix: Do not publish to connections without context
This is a rare case where a closed connection is kept alive while the
handling fiber yields, therefore leaving `cc_` (the connection context)
pointing to null for other fibers to see.
As far as I can see, this can only happen during server shutdown, but
there could be other cases that I have missed.
The test on its own does _not_ reproduce the crash, however with added
`ThisFiber::SleepFor()`s I could reproduce the crash:
* Right before `DispatchBrief()`
[here](e3214cb603/src/server/channel_store.cc (L154))
* Right after connection context `reset()`
[here](2ab480e160/src/facade/dragonfly_connection.cc (L750))
In any case, calling `SendPubMessageAsync()` to a connection where `cc_`
is null is a bug, and we fix that here.
* rewording
We do not acquire any locks for transactions that are executing optimistically. However, this is problematic for callbacks that need to preempt (e.g. because a journal is active).
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
We currently support rdb files up to version 11. This is a blocker for people who want to migrate to dragonfly with newer versions of the format. As of now, there is only v12 and it only includes the addition of RDB_OPCODE_SLOT_INFO.
* adds support to load rdb files up to version 12
* reads and discards with a warning the contents of RDB_OPCODE_SLOT_INFO if found in the rdb file
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
Today there's a cost to enabling AllocationTracker, even for rarely used
memory bands.
This PR slightly optimizes the "happy path" (i.e. allocations outside
the tracked range), and also for the case where we track 100% of the
allocations.
Also, add a unit test for this class.
We would like to stop passing MutableSlice as arguments and removing ToUpper
is the first step to it.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Sometimes for large values during snapshot loading/saving we allocate a lot of extra memory. For that, we might need to manually run memory decommit for mimalloc to release memory pages back to the OS. This PR addresses that by manually running memory decommit after each shard finishes loading or saving a snapshot.
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
* chore: ClearInternal now can clear partially
Intended for future use - to deallocate large objects gradually.
Currently nothing is changed in the functionality besides some cleanups.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: Implement AddMany method
1. Fix a performance bug in Find2 that made redundant comparisons
2. Provide a method to StringSet that adds several items in a batch
3. Use AddMany inside set_family
Before:
```
BM_Add 4253939 ns 4253713 ns 991
```
After:
```
BM_Add 3482177 ns 3482050 ns 1206
BM_AddMany 3101622 ns 3101507 ns 1360
```
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: fixes
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Clean up interface a bit. AddOrFindDense does not make much sense as a single function
because it does not provide any performance benefits - we still must perform a lookup
before inserting. AddSds should have been removed a long time ago.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat: add DenseSet::IteratorBase::SetExpiryTime
This commit is in preparation for adding FIELDEXPIRE and HEXPIRE.
* fix: 0 is a valid input for MakeSetSds
* feat(rdb_load): add support for loading huge hmaps
* feat(rdb_load): add support for loading huge zsets
* feat(rdb_load): log DFATAL when append fails
We do not allow notify_keyspace_events to be set at runtime via config set command.
* allow notify_keyspace_events in config set command
* add tests
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
Seems that lpStringToInt64 is fairly optimized.
On c7g:
```
BM_LpString2Int/1 27383 ns 27381 ns 204149
BM_LpString2Int/2 24535 ns 24534 ns 227981
```
so SimpleAtoi has only ~10% improvement.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: introduce an optimized integer compare algorithm for lists
Problem: when lists are long OpRem will spend lots of time comparing an element with records in the list.
For integer-based lists, most of the time is spent in lpCompare.
In addition, lpGet has lots of branches that penalize integers use-cases.
This PR:
1. Introduces lpGet2 - the improved version with less branches.
2. Benchmarks lpCompare vs an algorithm that compares records to an integer.
3. Benchmarks lpGet vs lpGet2
```
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_LpCompare 1187 ns 1187 ns 4715144
BM_LpCompareInt 371 ns 371 ns 15216611
BM_LpGet/1 265 ns 265 ns 21473149
BM_LpGet/2 214 ns 214 ns 26075164
```
There are no functional changes to the Dragonfly code.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: fixes
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Previously XREAD:
- Fetched group consumer info (with a hop)
- Looked up last stream ids (with a hop)
- Determined if there are entries to read (on coordinator)
- Dispatched a hop to read entries or retired to blocking mode
Instead we can merge steps (1), (2) and (3) into a single step, optionally with step (4) for single shard operations that can avoid concluding (we had this optimization before)
1. Issue ping upon connect, add a comment why.
2. log error if dfly_bench disconnects before all the requests were processed.
3. Refactor memcache parsing code into ParseMC function.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
A small refactoring to improve the flow of ScheduleInternal() as well as
to prepare it for the next change that will reduce the CPU load from the shard queue.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix bitiop creating the dst key if result is empty
* fix replicating dst with the wrong type
* make bitop a blind update (similar to set command)
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
* chore: Forbid replicating a replica
We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.
Fixes#3679
* `replicaof_mu_`
* chore: some renames + fix a typo in RETURN_ON_BAD_STATUS
Renames in transaction.h - no functional changes.
Fix a typo in error.h following #3758
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
fix 1: in multi command squasher error message was not set therefore it was not printed to log on the relevant command only on exec, fixed by setting the last error in CapturingReplyBuilder::SendError
fix 2: non clearing cached error replies before the command is Invoked
---------
Signed-off-by: adi_holden <adi@dragonflydb.io>
Co-authored-by: kostas <kostas@dragonflydb.io>
* fix: improve BreakStalledFlowsInShard heuristic
Before this change - we wrote in a single call whatever record chunks we pulled from the channel.
This can be problematic for 1GB chunks for example, which might take 10sec to write.
Lately we added a replication breaker on the master side that breaks the fully sync after
a predefined threshold has passed. By default it was 10sec. To improve the robustness of this
breaker, we now write chunks of upto 1MB and update last_write_time_ns_ more frequently.
Also, we added more logs to make replication delays on both sides more visible.
We also added logs of breaking the replication on the master sides.
Unfortunately, this did not help making BreakStalledFlowsInShard more robust because now the
problem moved to replica side which may take 20s+ seconds to parse huge values.
Therefore, I increased the threshold for breaking the replication to 30s.
Finally, instrument GetMetrics call as it takes sometimes more than 1 sec.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: introduce a Clone function for the dense set
We use a state machine to prefetch data in batches.
After this change, the hot spots are predominantly inside ObjectClone and
Hash methods.
All in all benchmarks show ~45% CPU reduction:
```
BM_Clone/elements:32000 1517322 ns 1517338 ns 2772
BM_Fill/elements:32000 841087 ns 841097 ns 4900
```
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Pre-this change, whenever Dragonfly was paused (either by a user or by
internal processes like takeover or slot migration finalization),
migrations and replications were also paused.
This could cause timing issues, which sometime result in migration
failures. Specifically, when 2 nodes have migrations from one to the
other **in parallel** (A->B and B->A), the `Pause()` that happens on A
(which happens because it's a source node) will stop it from processing
incoming traffic from B (incoming because it is also a target node).
If timed correctly, it will be locked until it times out, and so the
migration will fail.
The fix is to prevent replications and migrations from adhering to
`Pause()`s, which I think should not have happened in the first place
because they should use the admin port anyway.
Fixes#3319
For some cases, this map can grow indefinitely.
This change makes it less detailed by makes sure that number of possible keys is bounded.
Still it can provide a good summary of nature of exec transactions.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
chore: deprecate RecordsPopper and serialize channel records during push
Records channel is redundant for DFS/replication because we have single producer/consumer
scenario and both running on the same thread. Unfortunately we need it for RDB snapshotting.
For non-rdb cases we could just pass a io sink to the snapshot producer,
so that it would use it directly instead of StringFile inside FlushChannelRecord.
This would reduce memory usage, eliminate yet another memory copy and generally would make everything simpler.
For that to work, we must serialize the order of FlushChannelRecord, and that's implemented by
this PR. Also fixes#3658.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: cosmetic changes around Snapshot functions
Some renames and added comments. Refactored StartIncremental into a separate function
without any functional changes.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: fix comments
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Also, remove dependence of absl::TimeZone bloated monstrosity, which was required by
absl::FormatTime api, even though we do not actually format a timezone.
When absl::LocalTimeZone is accessed it allocates hundreds of thousands of bytes
for each shard thread (maybe due to lack thread safety during lazy initialization).
At the end, strftime does a great job without any shenanigans.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
fixes#3636
The problem was that instead of implementing GT operator, we implemented
!LT, which is GE operator. As a result the iterators inside the sort algorithm reached
out of bounds.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: crash when passing empty arguments
Fix the case where we pass an empty argument, which then is parsed as an
empty string view with null pointer. The null pointer is not handled correctly
by our low level c code, hence switch to using ""sv for that.
* chore: add more list asserts + improve test_hypothesis
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. The offset value can be negative, in that case we should return an empty array.
2. Fix edge cases of inf*0 and -inf + inf, so they will result in 0 and non NaN (similarily to Valkey).
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
For legacy mode:
1. For mutate commands, an empty result should throw an error
2. For read commands, it returns nil if path was not found, but if it was matched
but only with a wrong types, it will throw an error.
For non-legacy mode, objlen should throw an error for non existing key.
Simplified JsonCallbackResult a bit and made sure more fakeredis tests are passing.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat: add slave_repl_offset to the replication section.
In Valkey slave_repl_offset denotes the replication offset on replica site during stable sync phase.
During fullsync phase it appears with 0 value.
In Dragonfly this field appears only after full sync has completed, thus it allows
to check whether Dragonfly reached stable sync phase. The value of this field describes the cumulative progress
of all the replication flows and it does not directly correspond to master side metrics.
In addition, this PR fixes the bug in wait_available_async() function in our replication tests.
This function is intended to wait until a replica reaches stable state and it did by sending pings until they do not
respond with LOADING error, hence the assumption is that the replica is in full sync state already.
However it can happen that master_link_status is "up" but replica has not reached full sync state, and the PING will succeed
just because wait_available_async() was called before full sync started. The whole approach of polling the state is fragile.
Now we use `slave_repl_offset` explicitly to see if the replica reaches stable state.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: simplify wait_available_async
* chore: comments
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: enable experimental_new_io by default.
It has been running for weeks with the flag on, so enabled it also for community.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Vladislav Oleshko <vlad@dragonflydb.io>
Co-authored-by: Vladislav Oleshko <vlad@dragonflydb.io>
* fix: xreadgroup replies as a map for RESP3
Moreover, it returns data for all the strings, irrespective whether they have results or not
(unlike with XREAD)
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: properly handle xpending with 0 results
Also reject ENTRIESREAD instead of silently accepting it.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: JSON.STRAPPEND
JSON.STRAPPEND was completely broken.
First, it accepts exactly 3 arguments, i.e. a single value to append.
Secondly, the value must be a json string, not the regular string. Meaning it must be in double quotes.
So, before we parsed: `JSON.STRAPPEND key $.field bar` and now we parse:
`JSON.STRAPPEND key $.field "bar"`
In addition fixed the behavior of JSON.STRLEN to return "no such key" error in case the
json key does not exist and path is specified.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Stop supporting DflyVersion::VER0 from more than a year ago.
In addition, rename Metrics fields to make them more clear
General improvements and fix the reconnect metric.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: improve compatibility of set and ping commands
smismember should return an array of longs and not array of strings.
ping in subscribe mode returns an array for resp2.
Also, fix double rounding for legacy float mode.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. Fix corner cases around non existing keys
2. Fix matching logic for * glob, as well as '' glob.
3. Improve SORT option parsing.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. strlen should return 0 for non existing types.
2. reject both EX and PX options in SET
3. prevent overflow of expiry time that is too large
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: allow limiting pipelining queue by length
We already allow limiting the queue by memory usage but it also makes sense to limit by depth,
so that in extreme cases we would provide backpressure back to client connections. Otherwise if we parse and read everything,
clients do not have a sense of how loaded the connection is on the server side.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. Fix the bug of computing incorrectly the weight index in OpInter
2. Remove code duplication and factor out the parsing code from OpInter and OpUnion into PrepareWeightedSets
3. Implement TODO and support union of zsets and sets, which has already been implemented for ZINTER.
fixes#3553
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: disable ThreadLocalMutex when big value ser is off
* refactor: address comments
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-29.ec2.internal>
Co-authored-by: Borys <borys@dragonflydb.io>
* feat(cluster): Allow appending RDB to existing store
The goal of this PR is to support the loadoing of multiple RDB files into a single server, like when migrating from a Valkey cluster to Dragonfly with a different number of nodes.
It makes the following changes:
* Removes `DEBUG LOAD`, as we already have `DFLY LOAD`
* Adds `APPEND` option to `DFLY LOAD` (i.e. `DFLY LOAD <filename> APPEND`) that loads an RDB without first flushing the data store, overriding existing keys
* Does not load keys belonging to unowned slots, if in cluster mode
Fixes#2840
Background
We tried to be compatible with Valkey in their support of Lua flags, but we generally failed:
We are not really compatible with Valkey because our flags are different (we reject unknown flags, and Valkey flags are unknown to us)
The #!lua syntax doesn't work with Lua (# is not a comment), so scripts written for older versions of Redis can't be used with Dragonfly (i.e. they can't add Dragonfly flags and remain compatible with older Redis versions)
Changes
Instead of the previous syntax:
#!lua flags=allow-undeclared-keys,disable-atomicity
We now use this syntax:
--!df flags=allow-undeclared-keys,disable-atomicity
It is not backwards compatible (with older versions of Dragonfly), but it should be very easy to adapt to, and doesn't suffer from the disadvantages above.
Related to #3512
1. Before - when 15% margin of total size becomes larger than 256MB, we increased the file by 256MB even though
we still had unused 256MB. Now it's fixed.
2. We attempted to grow a file even if it reached the limit and then returned the error.
It is wrong handle "reach the limit" state as an error - it's just a logical constraint,
so now we do not attempt to grow and handle this quietly.
3. There was a segfault bug in extent tree in case the existing segment and the new one needs to be united.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: reduce pipelining latency by reusing existing shard fibers
To prove the benefits, run `./dfly_bench --pipeline=50 -n 20000 --ratio 0:1 --qps=0 --key_maximum=1`
Before: the average pipelining latency was 10ms
After: the average pipelining latency is 5ms.
Avg latency: pipelined_latency_usec / total_pipelined_squashed_commands
Also, improved counting of squashed commands - to count actual squashed ones.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>