The cryptographic hash function is a really important concept to understand in the world of computing. It plays a fundamental role in many different ways. So what the heck is it? The ELI5 version is that you have a chunk of data. It goes into a complex mathematical algorithm and out the other side comes a string. An example of the output is 326327CC2FF9F05379F5058C41BE6BC5E004BAA7
. If the hash function is working correctly, it should be mathematically unfeasible to input any other chunk of data and return that same string out the other side. When I say “mathematically unfeasible” what I really mean is that the computing power necessary to create the same string using a different input would require all of the material in the known galaxy, for instance. While this is true for most of the modern hash functions, some older ones aren’t so secure. MD5 was the old standby for a long time. It’s still non-trivial to exploit, but it has been proven that with enough computing power and the right software an attacker can create the output string they desire. This is called a collision. Collisions are bad because the software that is implementing the hash function almost always needs to be able to trust the output to be known and secure. Being able to manipulate this allows a nefarious party a great deal of power.
An example of this power is the very sophisticated piece of malware called Flame. Flame infected machines by exploiting the MD5 implementation used to verify certificates that signed packages from Windows Update. The upshot is that the computers would search out updates and an infected machine would intervene and say “Hey bro, I have a totally legitimate package for you to install. It’s even signed. Totes legit.” The naive and vulnerable computer would then download and execute that package and one more machine would be totally owned by the attacker. Microsoft no longer uses MD5 due to its susceptibility to collisions.
The amount of computing power necessary to create a collision from MD5 is easily within the reach of even a modestly funded attacker. A small network of machines using GPU calculation can create a custom collision in a matter of days. Since Amazon now sells GPU instances on AWS, a cluster can be rented, brought up, used, and torn down again in record time and for an almost shockingly small sum.
Passwords, for instance, are usually kept in hashed form in a database. Since the frontend that users input their password into should not accept the hash of a password, the end result is that everything is much more hardened against attack. Even if someone from the outside manages to exploit the machine and gain read privileges on the database, all they have is the hash output strings. In order to actually use these to log in they would have to bruteforce the algorithm to get the raw text of the password that was the input for the string. Since hash functions are designed to only go one way, anyone with a moderately decent password would be safe for a substantial period of time. Certainly long enough for the operator of the exploited service to alert users to the breach and have them change their passwords.
But all of this is just an abstraction for those that use technology instead of creating it. Even so it’s useful to know in order to be able to verify your software, for instance. How do you know that the software you just grabbed is indeed legitimate? Many pieces of software list a hash string on the downloads page so that users can verify the authenticity of the package for themselves. This protects against, say, a DNS hijack that tricks you into downloading malware. It also guards against the more common problem of data that didn’t download properly. I like a piece of software called Hashtab to check out the hash strings in Windows. For Linux and Unix systems you can use MD5SUM or sha1sum to get an output from the terminal. Let’s check out our file below:
By right clicking on our file and choosing properties we can then get to the File Hashes tab and check out our hash output. The file in question is an ISO from Microsoft. We know that this is a genuine piece of Microsoft software because the SHA sum was published for this Windows 7 RTM disk. We now know that nobody inserted anything malicious into out download and that we have the full and complete file.
Hash isn’t just for the Cheech and Chong set. It can even be used by people with jobs!