book/faqs: Talk about compression

2024-12-14 11:57:30 +00:00 · 2023-01-01 20:59:02 -07:00 · 2023-01-01 20:59:02 -07:00 · ee16664046
commit ee16664046
parent 0c1f362a62
1 changed files with 33 additions and 0 deletions
--- a/book/src/faqs.md
+++ b/book/src/faqs.md
@ -1,5 +1,7 @@
 # FAQs

+<!-- TODO: Write more about design decisions in a separate section -->
+
 ## Does it replace [Cachix](https://www.cachix.org)?

 No, it does not.
@ -29,6 +31,37 @@ Authentication is done via signed JWTs containing the allowed permissions.
 Each instance of `atticd --mode api-server` is stateless.
 This design may be revisited later, with option for a more stateful method of authentication.

+## How is compression handled?
+
+Uploaded NARs are compressed on the server before being streamed to the storage backend.
+We use the hash of the _uncompressed NAR_ to perform global deduplication.
+
+```
+                    ┌───────────────────────────────────►NAR Hash
+                    │
+                    │
+                    ├───────────────────────────────────►NAR Size
+                    │
+              ┌─────┴────┐  ┌──────────┐  ┌───────────┐
+ NAR Stream──►│NAR Hasher├─►│Compressor├─►│File Hasher├─►File Stream─►S3
+              └──────────┘  └──────────┘  └─────┬─────┘
+                                                │
+                                                ├───────►File Hash
+                                                │
+                                                │
+                                                └───────►File Size
+```
+
+At first glance, performing compression on the client and deduplicating the result may sound appealing, but has problems:
+
+1. Different compression algorithms and levels naturally lead to different results which can't be deduplicated
+2. Even with the same compression algorithm, the results are often non-deterministic (number of compression threads, library version, etc.)
+
+When we do the compression on the server and use the hashes of uncompressed NARs for lookups, the problem of non-determinism is no longer a problem since we only compress once.
+
+On the other hand, performing compression on the server leads to additional CPU usage, increasing compute costs and the need to scale.
+Such design decisions are to be revisited later.
+
 ## On what granularity is deduplication done?

 Currently, global deduplication is done on the level of NAR files.