aboutsummaryrefslogtreecommitdiff
path: root/AKFS
blob: a466ca63e74ee572db37c833f1cf7a0eeebe65da (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
AKFS - Arching Kaos File System
===============================

> (previously named SMFS - Sha(512) Merkle File System)

1. Introduction
---------------
This filesystem follows the merkle tree architecture. Given a SHA512 hash root,
we can climb the branches, towards the leafs of the tree, keeping the order of
the leaves, intact.

In our case, leaves represent chunks of a splitted file.

2. Analysis
-----------
Each file is encoded in base64, splitted in chunks of 4KB-4MB (chunk size will
vary depending on the size of the original file) and then for each chunk, the
SHA512 hash of it calculated.

Then, starting from the first's chunk hash, we store it with the next's chunk
hash in a text file. We move to the third and we repeat the process. The figure
next, will help -hopefully- in understanding this process.


                                 ___   
                                |   |\ 
                    .---------->|RUT'-|<---------.
                    |           |HASH |          |
                    |           |_____|          |
                    |                            |
                    |                            |
                    |                            |
                  ___                          ___   
                 |   |\                       |   |\ 
          .----->|BR0'-|                      |BR1'-|<-------.
          |      |HASH |                      |HASH |        |
          |      |_____|                      |_____|        |
          |         ^                            ^           |
          |         |                            |           |
          |         |                            |           |
        ___         |    ___         ___         |         ___   
       |   |\       |   |   |\      |   |\       |        |   |\ 
       |CH1'-|      '---|CH2'-|     |CH3'-|      |        |CH4'-|
       |HASH |          |HASH |     |HASH |------'        |HASH |
       |_____|          |_____|     |_____|               |_____|

Each "packet" of chunk hashes is called a "branch". Effectively, we repeat the
process above for the branches as well, matching the pattern 1-2,3-4,..,(N-1)-N.

We repeat this process until we have one branch as a result. This is called the
"root" (depicted as RUT in the figure). The only note we take, is that if N is
an odd number, we simply duplicate the last hash, for example, 1-2,3-3 and we
ignore them when reconstructing the original file.

To "merge" the leaves back to a file, we follow the process in reverse order.

3. Benefits
-----------
Although storage nowadays is not a big concern, connectivity issues still are.
Having small pieces of information about the file to get small and verifiable
pieces of that file, helps with low-bandwidth connectivity and frequent
disconnects.

4. Influence
------------
The structure is heavily influenced by torrents and bitcoin.

5. Implementation
-----------------
Currently, there are two bash scripts in the `bin` directory doing the file to
hash and the hash to file operations, respectively:
- `ak-sm-merkle-tree` -> `ak fs --add`
- `ak-sm-merkle-tree-to-file` -> `ak fs --get`

6. Specifications
-----------------
As part of arching-kaos, the tree is stored under the `~/.arching-kaos`
directory.

The initial file is converted to base64.
The file is splitted depending on its size in chunks under the `ftr` directory.
Branches are consisting of 2 hashes as strings, separated with a '\n'. The same
applies for the root.

7. Networking
-------------
Although networking capabilities are not part of the current implementation of
the system, it's fairly easy to exchange branches and root hashes as 'metadata'
and chunks/leaves as 'data'. A DHT-like implementation is under the works so
nodes requesting whichever hash can discover nodes that have those.

8. Bouquets of leaves (maps)
----------------------------
Based on the structure of the merkle trees we are producing here, another way is
possible to share information about the leaves (chunks) of a file. Instead of
packing those as two hashes of leaves in one branch, we would get all the leaves
hashes ordered as leaf01,leaf02,...,leafN.

For this kind of work see the following bash scripts:
./bin/ak-sm-filejoiner
./bin/ak-sm-filesplitter

9. Storage
----------
In current implementations storage is as follows:
- $AK_WORKDIR/fmrk branches,
- $AK_WORKDIR/ftr leaves and
- $AK_WORKDIR/fmp maps

The files are NOT organized for now.

To update this non structured into something more handy, we can use each digit
of a given SHA512 hash as a directory name.

For example given the hash:
0d1e5cd136e004be8c9625b1037e2e908304f03998b94cd201f6ca8a125bab03385f3c9c11b3c7cb280fb6b6f1fcbf9b2877e48dad09c81fa0ff6e5e7412ad0e

We should search or store or read in the following file:
0/d/1/e/5/c/d/1/3/6/e/0/0/4/b/e/8/c/9/6/2/5/b/1/0/3/7/e/2/e/9/0/8/3/0/4/f/0/3/9/9/8/b/9/4/c/d/2/0/1/f/6/c/a/8/a/1/2/5/b/a/b/0/3/3/8/5/f/3/c/9/c/1/1/b/3/c/7/c/b/2/8/0/f/b/6/b/6/f/1/f/c/b/f/9/b/2/8/7/7/e/4/8/d/a/d/0/9/c/8/1/f/a/0/f/f/6/e/5/e/7/4/1/2/a/d/0/e

In this way we would be storing up to 16 files per directory.

Now, doubling the digits would get us 256 files per directory:
0d/1e/5c/d1/36/e0/04/be/8c/96/25/b1/03/7e/2e/90/83/04/f0/39/98/b9/4c/d2/01/f6/ca/8a/12/5b/ab/03/38/5f/3c/9c/11/b3/c7/cb/28/0f/b6/b6/f1/fc/bf/9b/28/77/e4/8d/ad/09/c8/1f/a0/ff/6e/5e/74/12/ad/0e

Finally, a 65536 files per directory approach would be with 3 digits per
directory, like:
0d1e/5cd1/36e0/04be/8c96/25b1/037e/2e90/8304/f039/98b9/4cd2/01f6/ca8a/125b/ab03/385f/3c9c/11b3/c7cb/280f/b6b6/f1fc/bf9b/2877/e48d/ad09/c81f/a0ff/6e5e/7412/ad0e

But that is too much files/directory ratio.

We can also reuse the original hash like:
0d/1e/5c/d1/36/e0/04/be/8c/96/25/b1/03/7e/2e/90/83/04/f0/39/98/b9/4c/d2/01/f6/ca/8a/12/5b/ab/03/38/5f/3c/9c/11/b3/c7/cb/28/0f/b6/b6/f1/fc/bf/9b/28/77/e4/8d/ad/09/c8/1f/a0/ff/6e/5e/74/12/ad/0d1e5cd136e004be8c9625b1037e2e908304f03998b94cd201f6ca8a125bab03385f3c9c11b3c7cb280fb6b6f1fcbf9b2877e48dad09c81fa0ff6e5e7412ad0e

But, on the other hand, we already know the hash from the path we walked into
and we can recalculate the hash of the file we found, in case we want to verify
the correctness of the path/file.

To take care of all the above, a "driver" should be implemented, that it would
be given a hash and it would return the location of the file. We can go for the
2digit per directory approach.