Definition - ntopper/MD5AwSum GitHub Wiki

System Definition

MD5Awsum is a checksum tool can calculates the MD5 hash of a given file, and attempts to identify the file using a repository by means of a hash table lookup. This tool will also be used to maintain a local repository of hash tables downloaded from the internet. Alternatively, given a MD5 checksum, the tool will attempt to perform a reverse look-up given the same set of repositories. Our goal is to create a one-step method of verifying the integrity of files downloaded from the internet.

##Relevance of our Project
In 1988, the PrairieTek 220 hard drive was just 20 megabytes in size. File sizes were small in the 80's and 90's, and smaller file sizes meant there are less places where a 0 could be miswritten as a 1 or vice versa. These small errors are fatal for many programs. In 2007, the first 1 terabyte hard drive was created. 1 terabyte is 8x10^12 bits or 8x10^12 1's and 0's; that is a lot of room for error. Average file sizes are not on the terabyte scale, but in a 2 gigabyte Ubuntu linux distribution file, there is quite a lot of room for mistakes. A distribution file is very commonly downloaded over the internet using protocols in which errors can occur or even maliciously placed. This creates a need for some sort of algorithm to verify the integrity of these large files.

A hashing algorithm is a precise algorithm that reads in bits and returns a fixed length string that is (theoretically) unique. Good hashing algorithms have certain characteristics: they return the same value for a given input, the output is a fixed size, and if a hashing algorithm is also a checksum algorithm, then small differences in inputs generate largely different outputs. These qualities make checksum hashing algorithms the perfect solution to finding errors in files. A file is run through the algorithm while it is in a pristine and working form and a hash is generated. As the file is shared, used, or is generating an error and the integrity of the file would like to be verified, the algorithm can be run on the file again and the outputs compared. Even one 0 being replaced with a 1 will generate a completely different output and this allows us to quickly verify the integrity of the file.

The idea of checksum hashing algorithms is not a new one; there are many in existence and usage today. A common one is MD5 and it is well tested and widely used. Our project does not propose a better version of MD5, but rather a better way to use the algorithm. This project aims to automate the process of verifying the integrity of files by implementing both the hash and the storage of hashes to which the hash can be compared. Rather than having to search for the hash of the file you are downloading and then manually comparing the output of running MD5 on the file to the given hash, you simply use our project and the hash will be calculated and verified using our databases.

This project aims to make a standard for hash verification. The more people who use our project, the safer file downloading will be. A scary multi-step process that many people skip either from lack of knowledge or fear of the unknown will be boiled down into one project which will do all the work for you. Simply download the file and run our project and you will know if the file you downloaded is the way it should be. Along with decreasing errors, this will make modifying downloads maliciously practically impossible for outside parties. As the use of our project becomes more wide-spread, software vendors and file distributors will follow suit by providing hashes to more files. Ideally, our project will lead to all files having a stored and publicly available hash to accompany it.