aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorkaotisk <kaotisk@arching-kaos.org>2024-06-14 08:42:56 +0300
committerkaotisk <kaotisk@arching-kaos.org>2024-06-14 08:42:56 +0300
commit48ba97f639e26141a51ca38691a86cb2f20aabb3 (patch)
tree613733f2f39b9c4e9c38e0f05c58f536b30ab3e3
parent66c9598126b9a80bfe8832f09ad583f063b710da (diff)
downloadarching-kaos-tools-48ba97f639e26141a51ca38691a86cb2f20aabb3.tar.gz
arching-kaos-tools-48ba97f639e26141a51ca38691a86cb2f20aabb3.tar.bz2
arching-kaos-tools-48ba97f639e26141a51ca38691a86cb2f20aabb3.zip
Added documentation about the AKFS
-rw-r--r--AKFS141
1 files changed, 141 insertions, 0 deletions
diff --git a/AKFS b/AKFS
new file mode 100644
index 0000000..a466ca6
--- /dev/null
+++ b/AKFS
@@ -0,0 +1,141 @@
+AKFS - Arching Kaos File System
+===============================
+
+> (previously named SMFS - Sha(512) Merkle File System)
+
+1. Introduction
+---------------
+This filesystem follows the merkle tree architecture. Given a SHA512 hash root,
+we can climb the branches, towards the leafs of the tree, keeping the order of
+the leaves, intact.
+
+In our case, leaves represent chunks of a splitted file.
+
+2. Analysis
+-----------
+Each file is encoded in base64, splitted in chunks of 4KB-4MB (chunk size will
+vary depending on the size of the original file) and then for each chunk, the
+SHA512 hash of it calculated.
+
+Then, starting from the first's chunk hash, we store it with the next's chunk
+hash in a text file. We move to the third and we repeat the process. The figure
+next, will help -hopefully- in understanding this process.
+
+
+ ___
+ | |\
+ .---------->|RUT'-|<---------.
+ | |HASH | |
+ | |_____| |
+ | |
+ | |
+ | |
+ ___ ___
+ | |\ | |\
+ .----->|BR0'-| |BR1'-|<-------.
+ | |HASH | |HASH | |
+ | |_____| |_____| |
+ | ^ ^ |
+ | | | |
+ | | | |
+ ___ | ___ ___ | ___
+ | |\ | | |\ | |\ | | |\
+ |CH1'-| '---|CH2'-| |CH3'-| | |CH4'-|
+ |HASH | |HASH | |HASH |------' |HASH |
+ |_____| |_____| |_____| |_____|
+
+Each "packet" of chunk hashes is called a "branch". Effectively, we repeat the
+process above for the branches as well, matching the pattern 1-2,3-4,..,(N-1)-N.
+
+We repeat this process until we have one branch as a result. This is called the
+"root" (depicted as RUT in the figure). The only note we take, is that if N is
+an odd number, we simply duplicate the last hash, for example, 1-2,3-3 and we
+ignore them when reconstructing the original file.
+
+To "merge" the leaves back to a file, we follow the process in reverse order.
+
+3. Benefits
+-----------
+Although storage nowadays is not a big concern, connectivity issues still are.
+Having small pieces of information about the file to get small and verifiable
+pieces of that file, helps with low-bandwidth connectivity and frequent
+disconnects.
+
+4. Influence
+------------
+The structure is heavily influenced by torrents and bitcoin.
+
+5. Implementation
+-----------------
+Currently, there are two bash scripts in the `bin` directory doing the file to
+hash and the hash to file operations, respectively:
+- `ak-sm-merkle-tree` -> `ak fs --add`
+- `ak-sm-merkle-tree-to-file` -> `ak fs --get`
+
+6. Specifications
+-----------------
+As part of arching-kaos, the tree is stored under the `~/.arching-kaos`
+directory.
+
+The initial file is converted to base64.
+The file is splitted depending on its size in chunks under the `ftr` directory.
+Branches are consisting of 2 hashes as strings, separated with a '\n'. The same
+applies for the root.
+
+7. Networking
+-------------
+Although networking capabilities are not part of the current implementation of
+the system, it's fairly easy to exchange branches and root hashes as 'metadata'
+and chunks/leaves as 'data'. A DHT-like implementation is under the works so
+nodes requesting whichever hash can discover nodes that have those.
+
+8. Bouquets of leaves (maps)
+----------------------------
+Based on the structure of the merkle trees we are producing here, another way is
+possible to share information about the leaves (chunks) of a file. Instead of
+packing those as two hashes of leaves in one branch, we would get all the leaves
+hashes ordered as leaf01,leaf02,...,leafN.
+
+For this kind of work see the following bash scripts:
+./bin/ak-sm-filejoiner
+./bin/ak-sm-filesplitter
+
+9. Storage
+----------
+In current implementations storage is as follows:
+- $AK_WORKDIR/fmrk branches,
+- $AK_WORKDIR/ftr leaves and
+- $AK_WORKDIR/fmp maps
+
+The files are NOT organized for now.
+
+To update this non structured into something more handy, we can use each digit
+of a given SHA512 hash as a directory name.
+
+For example given the hash:
+0d1e5cd136e004be8c9625b1037e2e908304f03998b94cd201f6ca8a125bab03385f3c9c11b3c7cb280fb6b6f1fcbf9b2877e48dad09c81fa0ff6e5e7412ad0e
+
+We should search or store or read in the following file:
+0/d/1/e/5/c/d/1/3/6/e/0/0/4/b/e/8/c/9/6/2/5/b/1/0/3/7/e/2/e/9/0/8/3/0/4/f/0/3/9/9/8/b/9/4/c/d/2/0/1/f/6/c/a/8/a/1/2/5/b/a/b/0/3/3/8/5/f/3/c/9/c/1/1/b/3/c/7/c/b/2/8/0/f/b/6/b/6/f/1/f/c/b/f/9/b/2/8/7/7/e/4/8/d/a/d/0/9/c/8/1/f/a/0/f/f/6/e/5/e/7/4/1/2/a/d/0/e
+
+In this way we would be storing up to 16 files per directory.
+
+Now, doubling the digits would get us 256 files per directory:
+0d/1e/5c/d1/36/e0/04/be/8c/96/25/b1/03/7e/2e/90/83/04/f0/39/98/b9/4c/d2/01/f6/ca/8a/12/5b/ab/03/38/5f/3c/9c/11/b3/c7/cb/28/0f/b6/b6/f1/fc/bf/9b/28/77/e4/8d/ad/09/c8/1f/a0/ff/6e/5e/74/12/ad/0e
+
+Finally, a 65536 files per directory approach would be with 3 digits per
+directory, like:
+0d1e/5cd1/36e0/04be/8c96/25b1/037e/2e90/8304/f039/98b9/4cd2/01f6/ca8a/125b/ab03/385f/3c9c/11b3/c7cb/280f/b6b6/f1fc/bf9b/2877/e48d/ad09/c81f/a0ff/6e5e/7412/ad0e
+
+But that is too much files/directory ratio.
+
+We can also reuse the original hash like:
+0d/1e/5c/d1/36/e0/04/be/8c/96/25/b1/03/7e/2e/90/83/04/f0/39/98/b9/4c/d2/01/f6/ca/8a/12/5b/ab/03/38/5f/3c/9c/11/b3/c7/cb/28/0f/b6/b6/f1/fc/bf/9b/28/77/e4/8d/ad/09/c8/1f/a0/ff/6e/5e/74/12/ad/0d1e5cd136e004be8c9625b1037e2e908304f03998b94cd201f6ca8a125bab03385f3c9c11b3c7cb280fb6b6f1fcbf9b2877e48dad09c81fa0ff6e5e7412ad0e
+
+But, on the other hand, we already know the hash from the path we walked into
+and we can recalculate the hash of the file we found, in case we want to verify
+the correctness of the path/file.
+
+To take care of all the above, a "driver" should be implemented, that it would
+be given a hash and it would return the location of the file. We can go for the
+2digit per directory approach.