changed:
-
A P2P file sharing system to be implemented in Python following the model of
Kademlia: http://citeseer.ist.psu.edu/529075.html
For starters, a class can be implemented to do each of these things:
1. Generate a random 128 bit hex number (just like md5 output) to be our
node address and save it. Also generate a random number to be our port
number. Above 1024 etc.
2. Take a file, encrypt it with a symmetric cipher like AES or something,
return encrypted file object
3. md5sum the encrypted file, return the md5sum (known as the key)
4. Break the encrypted file up into 1k chunks while:
a. md5summing each chunk
b. Storing each chunk into a dbm type database with the md5sum as the key
c. return list of md5sums
5. Build a file of meta-data which contains a list of the md5sums and the
original filename
a. md5sum the metadata file
b. store in database
Now we have the raw data in a format ready to be introduced to the rest of
the network, once there are other nodes.
Once that much is done I'll start working on the more complicated network
communications to talk to other nodes. In more general terms this would
consist of:
1. Given another nodes address or list of addresses (ip address, port
number) connect to each and get its md5 address and give it ours. Return
the nodes info and save it.
2. To request a key
a. Look in our list of nodes we are aware of and pick the one closest
to the key in the keyspace.
b. Contact the node and ask, if it has the data it sends it over and
if it doesn't have the data it will say so and return an ip:port combo of
the closest node it is aware of in the keyspace to the data requested.
i. If it doesn't have the data start process over at step 2 using
the new ip:port pair and repeat up to x number of times (to prevent
loops) or until we find the data.
3. To insert a key
a. Look in our list of nodes we are aware of and pick the one closest
to to the key in the keyspace.
b. Contact node, check if it has key already
i. If not, send it and return success
ii. And if it does return success
c. Each node the data is sent to will then repeat this process
starting at step 3 until:
i. It hits a collision
ii. It can't find anyone closer
d. If we exceed some maximum amount of keys (and therefore disk
storage) in our database we find the key that hasn't been accessed
in the longest time (will probably be far from us in our keyspace
also) and delete it.
4. Migrate keys to ensure they are in most optimal place
a. Periodically scan our database and pick the keys furthest away
from us in the keyspace and try to find a node closer to that key
and send him that data.
b. Pick the keys closest to us in the keyspace and find another node
closest to us in the keyspace and send him our data. If we go down
our data isn't lost.
5. Each time a chunk is inserted into the db (such as from another node in
step 3 or if we generated the chunk ourselves) step 4a is called on it to
get it into the most optimal place in the network.
6. If the requested key is decrypted and found to be a metadata file step
2 is called on each key in the metadata file, the chunks are assembled,
decrypted using the decryption key in the metadata file, and saved as the
filename in the metadata file.
And with just the above functionality we should have something that is at
least as good as bittorrant and probably a lot better since it doesn't
need trackers and everyone always contributes if they want to use the
system. Nobody will know who inserted what file and nobody will know who
is requesting what file. It could be a human, it could be an automated
moving of data around the keyspace. We could proxy requests by re-issuing
the request ourselves instead of returning an ip:port combo but that is
slower and more complicated and would be a future task if anything and it
might not even be necessary.
The big complaint against this sort of distributed hash table (DHT)
network is that you know each chunk of the data is going to be on or near
the node with the same address. Unless you proxy you can find out the ip
of the node nearest that address by searching for a key and seeing which
node the search ends on. The enemy might do this for all of the chunks of
a file they don't like and them send commandos out to take out each of
those machines (would be hundreds or thousands) as well as these machines
nearest neighbors. This prevents purists like the freenet guys from
considering DHT. You would also have to take out the machines nearest
neighbors. If it ever comes down to this it's 1984 and we're all doomed so
there's not much point in worrying about it.
Then searching (the real hard part) is the last step. I don't want to
think about it right now but I know it can be done with the above
infrastructure. At first we can just pass around keys through out of band
means like bittorrent does.
I will probably use airhook.org's messaging software which really looks
optimal for this situation. The twisted module may be useful also for a web
configuration interface etc.
The goal will be a working practical, scalable, distributed, uncensorable,
and anonymous p2p system. 100% foolproof anonymity is not the goal. It
just has to be good enough and I'll get there in tiny baby steps like
those listed above making sure the fundamentals work before moving onto
bigger things. Freenet is all fudged up because their routing doesn't
work, they got really complicated before hardly anything worked right,
they chose java, and they are just unrealistic with their goals.
Python has an md5 module. Now need to find some crypto modules.