The bug:
calling lua_error does not return, instead it unwinds the Lua call stack until an error handler is found or the
script exits. This lead to memory leak on object that should release memory in destructor.
Specific example is the absl::FixedArray<string_view, 4> args(argc); which allocates on heap if argc > 4. The free was not called leading to memory leak.
The fix:
Add scoping to to the function so that the destructor is called before calling raise error
Signed-off-by: adi_holden <adi@dragonflydb.io>
There are actually a few failures fixed in this PR, only one of which is a test bug:
* `db_slice_->Traverse()` can yield, causing `fiber_cancelled_`'s value to change
* When a migration is cancelled, it may never finish `WaitForInflightToComplete()` because it has `in_flight_bytes_` that will never reach destination due to the cancellation
* `IterateMap()` with numeric key/values overrode the key's buffer with the value's buffer
Fixes#4207
Should work only for self-hosted runners.
The core files will be kept in /var/crash/
We also copy automatically dragonfly binary into /var/crash to be able to debug later.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Dragonfly responds to ascii based requests to tls port with:
`-ERR Bad TLS header, double check if you enabled TLS for your client.`
Therefore, it is possible to test now both tls and non-tls ports with a plain-text PING.
Fixes#4171
Also, blacklist the bloom-filter test that Dragonfly does not support yet.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat: Huge values breakdown in cluster migration
Before this PR we used `RESTORE` commands for transferring data between
source and target nodes in cluster slots migration.
While this _works_, it has a side effect of consuming 2x memory for huge
values (i.e. if a single key's value takes 10gb, serializing it will
take 20gb or even 30gb).
With this PR we break down huge keys into multiple commands (`RPUSH`,
`HSET`, etc), respecting the existing `--serialization_max_chunk_size`
flag.
Part of #4100
1. Better logging in regtests
2. Release resources in dfly_main in more controlled manner.
3. Switch to ignoring signals when unregister signal handlers during the shutdown.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: Huge entries fail to load outside RDB / replication
We have an internal utility tool that we use to deserialize values in
some use cases:
* `RESTORE`
* Cluster slot migration
* `RENAME`, if the source and target shards are different
We [recently](https://github.com/dragonflydb/dragonfly/issues/3760)
changed this area of the code, which caused this regression as it only
handled RDB / replication streams.
Fixes#4143
* fix: enforce load limits when loading snapshot
Prevent loading snapshots with used memory higher than max memory limit.
1. Store the used memory metadata only inside the summary file
2. Load the summary file before loading anything else, and if the used-memory is higher,
abort the load.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
The regression was caused by #3947 and it causes crashes in bullmq.
It has not been found till now because python client sends commands in uppercase.
Fixes#4113
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Co-authored-by: Kostas Kyrimis <kostas@dragonflydb.io>
Fixes#3896. Now we retry several times.
In my checks this should significantly reduce the failure probability.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Test is flaky because it relies that the producer (the pytest) to send fast enough a bunch of commands before they get dispatched synchronously so I increased the load.
It appears that newer versions of the gh runner require more memory. Some cases of the test test_rss_used_mem_gap allocate more than 6.5-7 gb of memory leaving barely 0.5gb to the gh runner (7.5 in total available) which sometimes cause the instance to run out of memory.
* fix: Fix `test_flushall_in_full_sync`
This test failed in CI many times. The issue was that we reach stable
sync too quickly, and miss the full sync stage.
I changed the seeder to add 100k (instead of 30k) keys for the stage to
take longer.
* StaticSeeder
Until now, we only tested Dragonfly against Redis 6.2. It appears that
something has changed in the way Redis sends stable sync commands, and
now they also forward `MULTI` and `EXEC` as part of their replication.
Since we do not allow all commands to run under `MULTI`/`EXEC`,
specifically `SELECT`, a Dragonfly replica of such servers failed these
commands and became inconsistent with the data on the master.
The proposed fix is to simply ignore (i.e. not execute) `MULTI`/`EXEC`
coming from a Redis/Valkey master, and run the commands within those
transactions individually, like we do for other transactions.
To test this we randomly choose a redis/valkey server based on 3
available installed binaries and test against them.
* fix: Do not publish to connections without context
This is a rare case where a closed connection is kept alive while the
handling fiber yields, therefore leaving `cc_` (the connection context)
pointing to null for other fibers to see.
As far as I can see, this can only happen during server shutdown, but
there could be other cases that I have missed.
The test on its own does _not_ reproduce the crash, however with added
`ThisFiber::SleepFor()`s I could reproduce the crash:
* Right before `DispatchBrief()`
[here](e3214cb603/src/server/channel_store.cc (L154))
* Right after connection context `reset()`
[here](2ab480e160/src/facade/dragonfly_connection.cc (L750))
In any case, calling `SendPubMessageAsync()` to a connection where `cc_`
is null is a bug, and we fix that here.
* rewording