Another PR on my quest to a `*_path` variant for every secret. Adds two
config options `admin_token_path` and `client_secret_path` to the
experimental config under `experimental_features.msc3861`. Also includes
tests.
I tried to be a good citizen here by following `attrs` conventions and
not rewriting the corresponding non-path variants in the class, but
instead adding methods to retrieve the value.
Reading secrets from files has the security advantage of separating the
secrets from the config. It also simplifies secrets management in
Kubernetes. Also useful to NixOS users.
When purging history, we try and delete any state groups that become
unreferenced (i.e. there are no longer any events that directly
reference them). When we delete a state group that is referenced by
another state group, we "de-delta" that state group so that it no longer
refers to the state group that is deleted.
There are two bugs with this approach that we fix here:
1. There is a common pattern where we end up storing two state groups
when persisting a state event: the state before and after the new state
event, where the latter is stored as a delta to the former. When
deleting state groups we only deleted the "new" state and left (and
potentially de-deltaed) the old state. This was due to a bug/typo when
trying to find referenced state groups.
2. There are times where we store unreferenced state groups in the DB,
during the purging of history these would not get rechecked and instead
always de-deltaed. Instead, we should check for this case and delete any
unreferenced state groups rather than de-deltaing them.
The effect of the above bugs is that when purging history we'd end up
with lots of unreferenced state groups that had been de-deltaed (i.e.
stored as the full state). This can lead to dramatic increases in
storage space used.
Currently we don't really have anything that stops us from deleting
state groups when an in-flight event references it. This is a fairly
rare race currently, but we want to be able to more aggressively delete
state groups so it is important to address this to ensure that the
database remains valid.
This implements the locking, but doesn't actually use it.
See the class docstring of the new data store for an explanation for how
this works.
---------
Co-authored-by: Devon Hudson <devon.dmytro@gmail.com>
This PR changes the logic so that deactivated users are always ignored.
Suspended users were already effectively ignored as Synapse forbids a
join while suspended.
---------
Co-authored-by: Devon Hudson <devon.dmytro@gmail.com>
This also happens for rejecting an invite. Basically, any out-of-band membership transition where we first get the membership as an `outlier` and then rely on federation filling us in to de-outlier it.
This PR mainly addresses automated test flakiness, bots/scripts, and options within Synapse like [`auto_accept_invites`](https://element-hq.github.io/synapse/v1.122/usage/configuration/config_documentation.html#auto_accept_invites) that are able to react quickly (before federation is able to push us events), but also helps in generic scenarios where federation is lagging.
I initially thought this might be a Synapse consistency issue (see issues labeled with [`Z-Read-After-Write`](https://github.com/matrix-org/synapse/labels/Z-Read-After-Write)) but it seems to be an event auth logic problem. Workers probably do increase the number of possible race condition scenarios that make this visible though (replication and cache invalidation lag).
Fix https://github.com/element-hq/synapse/issues/15012
(probably fixes https://github.com/matrix-org/synapse/issues/15012 (https://github.com/element-hq/synapse/issues/15012))
Related to https://github.com/matrix-org/matrix-spec/issues/2062
Problems:
1. We don't consider [out-of-band membership](https://github.com/element-hq/synapse/blob/develop/docs/development/room-dag-concepts.md#out-of-band-membership-events) (outliers) in our `event_auth` logic even though we expose them in `/sync`.
1. (This PR doesn't address this point) Perhaps we should consider authing events in the persistence queue as events already in the queue could allow subsequent events to be allowed (events come through many channels: federation transaction, remote invite, remote join, local send). But this doesn't save us in the case where the event is more delayed over federation.
### What happened before?
I wrote some Complement test that stresses this exact scenario and reproduces the problem: https://github.com/matrix-org/complement/pull/757
```
COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestSynapseConsistency
```
We have `hs1` and `hs2` running in monolith mode (no workers):
1. `@charlie1:hs2` is invited and joins the room:
1. `hs1` invites `@charlie1:hs2` to a room which we receive on `hs2` as `PUT /_matrix/federation/v1/invite/{roomId}/{eventId}` (`on_invite_request(...)`) and the invite membership is persisted as an outlier. The `room_memberships` and `local_current_membership` database tables are also updated which means they are visible down `/sync` at this point.
1. `@charlie1:hs2` decides to join because it saw the invite down `/sync`. Because `hs2` is not yet in the room, this happens as a remote join `make_join`/`send_join` which comes back with all of the auth events needed to auth successfully and now `@charlie1:hs2` is successfully joined to the room.
1. `@charlie2:hs2` is invited and and tries to join the room:
1. `hs1` invites `@charlie2:hs2` to the room which we receive on `hs2` as `PUT /_matrix/federation/v1/invite/{roomId}/{eventId}` (`on_invite_request(...)`) and the invite membership is persisted as an outlier. The `room_memberships` and `local_current_membership` database tables are also updated which means they are visible down `/sync` at this point.
1. Because `hs2` is already participating in the room, we also see the invite come over federation in a transaction and we start processing it (not done yet, see below)
1. `@charlie2:hs2` decides to join because it saw the invite down `/sync`. Because `hs2`, is already in the room, this happens as a local join but we deny the event because our `event_auth` logic thinks that we have no membership in the room ❌ (expected to be able to join because we saw the invite down `/sync`)
1. We finally finish processing the `@charlie2:hs2` invite event from and de-outlier it.
- If this finished before we tried to join we would have been fine but this is the race condition that makes this situation visible.
Logs for `hs2`:
```
🗳️ on_invite_request: handling event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=False>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
🔦 _store_room_members_txn update local_current_membership: <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
📨 Notifying about new event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
✅ on_invite_request: handled event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
🧲 do_invite_join for @user-2-charlie1:hs2 in !sfZVBdLUezpPWetrol:hs1
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$bwv8LxFnqfpsw_rhR7OrTjtz09gaJ23MqstKOcs7ygA, type=m.room.member, state_key=@user-1-alice:hs1, membership=join, outlier=True>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$oju1ts3G3pz5O62IesrxX5is4LxAwU3WPr4xvid5ijI, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=join, outlier=False>
📨 Notifying about new event <FrozenEventV3 event_id=$oju1ts3G3pz5O62IesrxX5is4LxAwU3WPr4xvid5ijI, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=join, outlier=False>
...
🗳️ on_invite_request: handling event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
🔦 _store_room_members_txn update local_current_membership: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
📨 Notifying about new event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
✅ on_invite_request: handled event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
📬 handling received PDU in room !sfZVBdLUezpPWetrol:hs1: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
📮 handle_new_client_event: handling <FrozenEventV3 event_id=$WNVDTQrxy5tCdPQHMyHyIn7tE4NWqKsZ8Bn8R4WbBSA, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=join, outlier=False>
❌ Denying new event <FrozenEventV3 event_id=$WNVDTQrxy5tCdPQHMyHyIn7tE4NWqKsZ8Bn8R4WbBSA, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=join, outlier=False> because 403: You are not invited to this room.
synapse.http.server - 130 - INFO - POST-16 - <SynapseRequest at 0x7f460c91fbf0 method='POST' uri='/_matrix/client/v3/join/%21sfZVBdLUezpPWetrol:hs1?server_name=hs1' clientproto='HTTP/1.0' site='8080'> SynapseError: 403 - You are not invited to this room.
📨 Notifying about new event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
✅ handled received PDU in room !sfZVBdLUezpPWetrol:hs1: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
```
This is particularly a problem in a state reset scenario where the membership
might change without a corresponding event.
This PR is targeting a scenario where a state reset happens which causes
room membership to change. Previously, the cache would just hold onto
stale data and now we properly bust the cache in this scenario.
We have a few tests for these scenarios which you can see are now fixed
because we can remove the `FIXME` where we were previously manually
busting the cache in the test itself.
This is a general Synapse thing so by it's nature it helps out Sliding
Sync.
Fix https://github.com/element-hq/synapse/issues/17368
Prerequisite for https://github.com/element-hq/synapse/issues/17929
---
Match when are busting `_curr_state_delta_stream_cache`
Adds a query param `type` to `/_synapse/admin/v1/rooms/{room_id}/state`
that filters the state event query by state event type.
---------
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Fixes various `mypy` errors associated with Twisted `24.11.0`.
Hopefully addresses https://github.com/element-hq/synapse/issues/17075,
though I've yet to test against `trunk`.
Changes should be compatible with our currently pinned Twisted version
of `24.7.0`.
Supersedes https://github.com/element-hq/synapse/pull/17958.
Awkwardly, the changes made to fix the mypy errors in 1.12.1 cause
errors in 1.11.2. So you'll need to update your mypy version to 1.12.1
to eliminate typechecking errors during developing.
The existing `email.smtp_host` config option is used for two distinct
purposes: it is resolved into the IP address to connect to, and used to
(request via SNI and) validate the server's certificate if TLS is
enabled. This new option allows specifying a different name for the
second purpose.
This is especially helpful, if `email.smtp_host` isn't a global FQDN,
but something that resolves only locally (e.g. "localhost" to connect
through the loopback interface, or some other internally routed name),
that one cannot get a valid certificate for.
Alternatives would of course be to specify a global FQDN as
`email.smtp_host`, or to disable TLS entirely, both of which might be
undesirable, depending on the SMTP server configuration.
Another config option on my quest to a `*_path` variant for every
secret. This time it’s `macaroon_secret_key_path`.
Reading secrets from files has the security advantage of separating the secrets from the config. It also simplifies secrets management in Kubernetes. Also useful to NixOS users.
- Fetch the number of invites the provided user has sent after a given
timestamp
- Fetch the number of rooms the provided user has joined after a given
timestamp, regardless if they have left/been banned from the rooms
subsequently
- Get report IDs of event reports where the provided user was the sender
of the reported event
This is an implementation of MSC4190, which allows appservices to manage
their user's devices without /login & /logout.
---------
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Currently, when a new scheduled task is added and its scheduled time has
already passed, we set it to ACTIVE. This is problematic, because it
means it will jump the queue ahead of all other SCHEDULED tasks;
furthermore, if the Synapse process gets restarted, it will jump ahead
of any ACTIVE tasks which have been started but are taking a while to
run.
Instead, we leave it set to SCHEDULED, but kick off a call to
`_launch_scheduled_tasks`, which will decide if we actually have
capacity to start a new task, and start the newly-added task if so.
For context of why we delay read receipts, see
https://github.com/matrix-org/synapse/issues/4730.
Element Web often sends read receipts in quick succession, if it reloads
the timeline it'll send one for the last message in the old timeline and
again for the last message in the new timeline. This caused remote users
to see a read receipt for older messages come through quickly, but then
the second read receipt taking a while to arrive for the most recent
message.
There are two things going on in this PR:
1. There was a mismatch between seconds and milliseconds, and so we
ended up delaying for far longer than intended.
2. Changing the logic to reuse the `DestinationWakeupQueue` (used for
presence)
The changes in logic are:
- Treat the first receipt and subsequent receipts in a room in the same
way
- Whitelist certain classes of receipts to never delay being sent, i.e.
receipts in small rooms, receipts for events that were sent within the
last 60s, and sending receipts to the event sender's server.
- The maximum delay a receipt can have before being sent to a server is
30s, and we'll send out receipts to remotes at least at 50Hz (by
default)
The upshot is that this should make receipts feel more snappy over
federation.
This new logic should send roughly between 10%–20% of transactions
immediately on matrix.org.
To work around the fact that,
pre-https://github.com/element-hq/synapse/pull/17903, our database may
have old one-time-keys that the clients have long thrown away the
private keys for, we want to delete OTKs that look like they came from
libolm.
To spread the load a bit, without holding up other background database
updates, we use a scheduled task to do the work.
There was a bug that meant we would return the full state of the room on
incremental syncs when using lazy loaded members and there were no
entries in the timeline.
This was due to trying to use `state_filter or state_filter.all()` as a
short hand for handling `None` case, however `state_filter` implements
`__bool__` so if the state filter was empty it would be set to full.
c.f. MSC4222 and #17888
The latest Twisted release changed how they implemented `__await__` on
deferreds, which broke the machinery we used to test cancellation.
This PR changes things a bit to instead patch the `__await__` method,
which is a stable API. This mostly doesn't change the core logic, except
for fixing two bugs:
- We previously did not intercept all await points
- After cancellation we now need to not only unblock currently blocked
await points, but also make sure we don't block any future await points.
c.f. https://github.com/twisted/twisted/pull/12226
---------
Co-authored-by: Devon Hudson <devon.dmytro@gmail.com>
Currently, one-time-keys are issued in a somewhat random order. (In
practice, they are issued according to the lexicographical order of
their key IDs.) That can lead to a situation where a client gives up
hope of a given OTK ever being used, whilst it is still on the server.
Related: https://github.com/element-hq/element-meta/issues/2356
When entries insert in the end of timer queue, then unnecessary entry
inserted (with duplicated key).
This can lead to some timeouts expired early and consume memory.
Basically, if the client sets a special query param on `/sync` v2
instead of responding with `state` at the *start* of the timeline, we
instead respond with `state_after` at the *end* of the timeline.
We do this by using the `current_state_delta_stream` table, which is
actually reliable, rather than messing around with "state at" points on
the timeline.
c.f. MSC4222
The main change here is to add a helper function
`gather_optional_coroutines`, which works in a similar way as
`yieldable_gather_results` but takes a set of coroutines rather than a
function
Fixes#17823
While we're at it, makes a change where the redactions are sent as the
admin if the user is not a member of the server (otherwise these fail
with a "User must be our own" message).
Reset `sliding_sync_membership_snapshots` -> `forgotten` status when
membership changes (like rejoining a room).
Fix https://github.com/element-hq/synapse/issues/17781
### What was the problem before?
Previously, if someone used `/forget` on one of their rooms, it would
update `sliding_sync_membership_snapshots` as expected but when someone
rejoined the room (or had any membership change), the upsert didn't
overwrite and reset the `forgotten` status so it remained `forgotten`
and invisible down the Sliding Sync endpoint.