Hashing it out

The cryptographic hash function is a really important concept to understand in the world of computing. It plays a fundamental role in many different ways. So what the heck is it? The ELI5 version is that you have a chunk of data. It goes into a complex mathematical algorithm and out the other side comes a string. An example of the output is 326327CC2FF9F05379F5058C41BE6BC5E004BAA7. If the hash function is working correctly, it should be mathematically unfeasible to input any other chunk of data and return that same string out the other side. When I say “mathematically unfeasible” what I really mean is that the computing power necessary to create the same string using a different input would require all of the material in the known galaxy, for instance. While this is true for most of the modern hash functions, some older ones aren’t so secure. MD5 was the old standby for a long time. It’s still non-trivial to exploit, but it has been proven that with enough computing power and the right software an attacker can create the output string they desire. This is called a collision. Collisions are bad because the software that is implementing the hash function almost always needs to be able to trust the output to be known and secure. Being able to manipulate this allows a nefarious party a great deal of power.

An example of this power is the very sophisticated piece of malware called Flame. Flame infected machines by exploiting the MD5 implementation used to verify certificates that signed packages from Windows Update. The upshot is that the computers would search out updates and an infected machine would intervene and say “Hey bro, I have a totally legitimate package for you to install. It’s even signed. Totes legit.” The naive and vulnerable computer would then download and execute that package and one more machine would be totally owned by the attacker. Microsoft no longer uses MD5 due to its susceptibility to collisions.

The amount of computing power necessary to create a collision from MD5 is easily within the reach of even a modestly funded attacker. A small network of machines using GPU calculation can create a custom collision in a matter of days. Since Amazon now sells GPU instances on AWS, a cluster can be rented, brought up, used, and torn down again in record time and for an almost shockingly small sum.

Passwords, for instance, are usually kept in hashed form in a database. Since the frontend that users input their password into should not accept the hash of a password, the end result is that everything is much more hardened against attack. Even if someone from the outside manages to exploit the machine and gain read privileges on the database, all they have is the hash output strings. In order to actually use these to log in they would have to bruteforce the algorithm to get the raw text of the password that was the input for the string. Since hash functions are designed to only go one way, anyone with a moderately decent password would be safe for a substantial period of time. Certainly long enough for the operator of the exploited service to alert users to the breach and have them change their passwords.

But all of this is just an abstraction for those that use technology instead of creating it. Even so it’s useful to know in order to be able to verify your software, for instance. How do you know that the software you just grabbed is indeed legitimate? Many pieces of software list a hash string on the downloads page so that users can verify the authenticity of the package for themselves. This protects against, say, a DNS hijack that tricks you into downloading malware. It also guards against the more common problem of data that didn’t download properly. I like a piece of software called Hashtab to check out the hash strings in Windows. For Linux and Unix systems you can use MD5SUM or sha1sum to get an output from the terminal. Let’s check out our file below:

hashtab

By right clicking on our file and choosing properties we can then get to the File Hashes tab and check out our hash output. The file in question is an ISO from Microsoft. We know that this is a genuine piece of Microsoft software because the SHA sum was published for this Windows 7 RTM disk. We now know that nobody inserted anything malicious into out download and that we have the full and complete file.

Hash isn’t just for the Cheech and Chong set. It can even be used by people with jobs!

nosce te ipsum

In computing, as in life, taking the time to fully understand what you are trying to accomplish is never a waste of time. Sun Tzu clearly knew a great deal about computer hardware and software when he gave us “故曰:知彼知己,百戰不殆;不知彼而知己,一勝一負;不知彼,不知己,每戰必殆.” If you know what you are trying to accomplish and you know what the hardware is capable of then you cannot lose. If you know one and not the other you might as well be flipping a coin. An absence of knowledge of both will certainly lead to a poorly running computer and or inefficiency in cost.

A case in point is a computer used in a family or office setting that will never play games. For this application, modern CPU’s are way beyond the point of being “fast enough” even on the extremely low end. Even going as low as $47 won’t make a difference. Web browsers and office documents are going to so very rarely be CPU bound spending more is going to get you very little in terms of actual performance payoff. Even more inefficient would be spending to the point where you are adding cores. Slotting a quad core into a use-case like this will result in nothing more than double the number of idle cores sitting around twiddling their expensive thumbs all day. Web browsers like Chrome can thread out very well, but we have to be keep in mind that your typical office worker isn’t going to be using the browser in such a way that it can cap out a cheapo dual core. I doubt Joe in Accounting is going to be pulling up 100 tabs of flash based streaming porn. And if he does, that’s more of a management problem than a computing one.

Your time spent understanding the use-case extends perhaps most importantly to storage. Hard drives are going to be the least expensive still. Analyze how much space your users are using currently. Most will likely never have any case to store more than 128GB locally. If you are a large enough business, allowing anyone to store things locally is a bad idea anyway as that creates a single point of failure for data that could very well be of critical importance to your operations. This problem is best handled with a GPO and an Active Directory member server running RAID. Also with storage it’s worth considering if the slight cost increase of going to an SSD is going to payoff. I think in most cases it will for this type of work if we consider how expensive labor is. The combined cost of 20 workers waiting for their computers to boot up or lagging behind as the storage catches up with them is going to cover that initial cost outlay very quickly. Consider the cost delta between a decent spinning disk and a good quality SSD. Even if you somehow pay minimum wage it will still payoff quickly. Other things are less quantifiable such as the reduced stress of having your workers using fast storage. If they are anything like me I know I get frustrated when I have to wait for things to load.

Another thing to consider is size. Since our office and web browsing machines are so low power they don’t need to be very large at all. Machines running our very cheap Intel CPUs and solid state storage should only be able to peg maybe 50 watts of power at full tilt. They will spend 95% of their lives using half that. You can easily get by with using a form factor such as mini-ITX. This will use up much less desk space and, rather informally, look bomb. One I built recently for such a purpose is roughly the size of your typical George R.R. Martin hardcover.

flat_cropped

upright_cropped

This cute little guy is purpose built, dirt cheap, and boots and opens programs with enough speed to really blow your hair back.

Ampache

It seems like most of the world is content on accessing their media through streaming. For any number of reasons, I prefer to keep actual copies of books, TV, movies, and music. Other plebeians seem to be happy renting the media they want for a monthly fee. After acquiring some decent quality IEM’s, I thought it was time to make sure I could access my audio library on the go. While I could pull the files through SSH and access them using, say, SSHFS, it took little time for me to get frustrated with this inelegant solution. I had previously used GNUMP3d on a LAMP machine to pull music over the Internet. That was a great piece of software, and it worked very well. I decided go a different way this time as the project hasn’t been updated since 2007 and there are better alternatives in these halcyon modern times. This led me to Ampache, which is modern, maintained, and open source.

Ampache is a web application that runs on anything that has the necessary basic components. In this case what we need are and HTTP server, PHP for scripting, and a MySQL database server. Since I am a glutton for punishment, I decided to pick the best of each of these components and see if I could hack it all together and make it work. Picking the “best” HTTP server is a largely masturbatory exercise. This is much like arguing the merits of other holy software such as text editors. Going by just the facts you’d have to agree that the fastest web server software is going to be NGINX. If NGINX is fast enough to approach the outer limits of 10,000 concurrent connections, it should be good enough for this use-case. As for PHP, there is little choice there. Happily, there is a bit of choice with our database server. Ever since Larry Ellison and his band of money-grubbing devils sunk Sun Microsystems, MySQL has been tainted. Luckily, mariadb came along to return our precious freedom.

freedom

As far as ampache goes, I have found it to be slick, fast, and intuitive. There are plenty of features buried in there as well. About half of my music is stored as FLAC, which is high bitrate and quite large. I told ampache to transcode this on the fly, and it promptly told me to install ffmpeg. So after that little altercation, I can access my FLAC as more WAN appropriate V4 mp3. The interface is pretty slick. The media player pops up at the bottom as soon an you start playing anything. Playlists can be made and saved on the fly.

ampache1

If you are looking for a solution for sharing your music over the web, I would check it out.

Opus

The question of audio in any computing environment is one that is going to be rife with complexity, license concerns, and a religious sense of maintaining the status quo of various older formats. Storing audio for mobile usually means two things. Firstly, you are likely going to want smaller files in order to pack more runtime into a device with a limited storage capacity. The second thing to consider is what formats the device you plan on using is capable of playing. For most mobile applications, the use of the old standby Mp3 is probably going to be seen as it has great inertia and subsequently is supported everywhere. When one isn’t constrained by mobile device requirements things can get much more interesting. Without storage requirements, a regular computer user is free to use a lossless codec. With 3TB hard disks in the $100 range, there is little reason to archive music in a lossy format. This leaves us with three general options that work for different use cases.

The first option would be totally uncompressed PCM, which is how music is stored on audio CD’s. While this is going to get you a bit for bit copy of the exact data on an audio CD, there is really no reason not to employ some space saving. This option is objectively inferior for this reason.

Lossless codecs are able to save space with zero reduction in quality, which makes them strong contenders if the goal is to archive music. Both FLAC and Apple Lossless are good options, though FLAC is more popular and arguably superior in terms of licensing. These codecs will squeeze some space savings out of the audio without sacrificing quality in any way. For a desktop application with modern hardware, lossless codecs are going to be a superior option.

While true audiophiles would sneer at the thought of anything less than a lossless copy, some of the more modern lossy codecs are truly impressive pieces of software engineering. Mp3 became a THE lossy standard because it didn’t have much strong competition way back in 1995 when it was first released. Mp3 at high bitrates compresses well and is nearly indistinguishable from a recording source without the use of extremely high end audio equipment. It has hung around for so long because it is “good enough” and has a great deal of market inertia behind it now. These days, there are simply better options. Ogg vorbis and Opus are two open formats that both deliver better audio at comparable bitrates when compared to Mp3. These formats are not patent encumbered, which make them solid options for commercial applications. Vorbis came first and was designed as a higher latency codec for music and multimedia audio applications. Vorbis has a large bitrate range (betweek 45kbit and 500 kbit per second) and is simply much stronger than Mp3 across the entire range. For telephony applications, for example, lower bitrates are perfectly capable of transmitting voice audio which requires less fidelity than a music audio recording. The IETF saw how well Vorbis did and thought that they could make it better. Opus was born. Opus has a substantially wider bitrate range than even Vorbis. Opus can limbo all the way down to 8kbit/s. Combined with a much lower algorithmic delay, Opus is ideal for Voice Over IP and other interactive mediums. What is even more impressive is that, when the bitrate is cranked up, it is objectively superior to alternative formats basically all the way up the bitrate scale.

1010px-Opus_quality_comparison_colorblind_compatible.svg

I did some testing myself with my computer. While I do not have any truly high end audio equipment, my Logitech G930 has a reasonable level of audio fidelity and likely represents a level of audio fidelity slightly above the median across computer users. I acquired an audio source from a great 80’s band and converted the lossless FLAC files to Opus at various bitrates starting with 8kbit/s. I selected “Alone” which is one of my favorite tracks and has a high degree of range between quiet sections and blistering full-range instruments and vocals at once. The 8kbit sample was noticeably degraded in quality. There was obvious background hiss and muddled audio in general. It was clear what the song was, but any sort of fine fidelity just wasn’t happening. At 16kbit, I could very clearly make out all vocals but there was still a slight background noise issue on higher volume. 24kbit was where things began to get interesting. At this bitrate, I had a bit of difficulty noticing any quality degradation until the loudest sections came in. No background noise issues could be detected at any time. I didn’t do any testing beyond 32kbit because I simply couldn’t notice the difference between it and the source. When using random A/B testing, I was able to guess the 32kbit encode correctly only slightly better than average. This 32kbit sample came in at a whopping 865KB in size. Truly remarkable when compared against the more than 24MB source file. Below is a chart comparing the size of the various encodes I did against the source filesize.

bitrate

From this angle, that looks pretty impressive to me.