Saturday, June 30, 2012

The Final Word on the LinkedIn Leak


As you are undoubtedly aware of by now, two weeks ago the professional networking site LinkedIn became the victim of a rather unfortunate mishap: they sprung a little leak, and 6.4 million password hashes trickled out onto the internet. And in those two short weeks, hundreds of security experts the world over, all of various backgrounds whose hats range from white to black, have been feverishly clawing their way through that list in an attempt to crack all 6.4 million passwords. However, few have made more progress in their pursuit than my associate d3ad0ne and me.

Update June 30, 2012: Per Thorsheim and his colleague Tom K. Tørrissen at EVRY have made an infographic based on my data & statistics. You can find it in this blog post.
Update July 2, 2012: Updated table to include GPU speeds for sha512crypt and bcrypt from John the Ripper 1.7.9-jumbo6.
 Update July 3, 2012: Updated 'Pass Phrases' section to include the number of passwords that were at least 16 characters long.
Update May 1, 2013: Corrected GPU name in SHA1 performance table, from "HD6690" to "HD6990." It took me almost a year to catch this mistake!

Surviving on little more than furious passion for many sleepless days, we now have over 90% of the leaked passwords recovered. And while other, presumably less motivated individuals were quick to toss out rather meaningless statistics after cracking as little as one quarter of one percent of the leaked passwords, I am a bit disappointed that I am unable to provide you statistics on 100% of them. However, when leaks such as this occur, a 90% - 95% recovery rate seems to be about par.

The password hash list floating around the internet contains 6,458,020 unique password hashes. However, as many were quick to point out, the majority of the password hashes were mangled: somehow, someway, the first five digits of 3,521,180 of the SHA-1 hashes had been replaced with zeros.

Initial speculation and murmurings around various hacker circles concluded that these mangled hashes must have been the ones already cracked by those who perpetrated the breach. However, such theories do not hold much water when taken into account that 670,781 of the mangled hashes are duplicates of the remaining, non-mangled hashes.

The opinion shared by several esteemed password crackers is that the hash list leaked to the internet was intended to only contain unique hashes, and something along the way didn’t quite go as planned. If the hashes were obtained through SQL injection, there are plenty of things that could have gone wrong. Or, perhaps someone miffed a sed command or similar while attempting to extract the hashes from the database dump. The fact is nobody knows for sure why over half of the hashes are mangled, except perhaps those who did the mangling – and even they may not be sure themselves.


Top 30 Passwords?

There are two important conclusions to reach if you are of the belief that the list was intended to only include unique hashes. First, it is impossible to know exactly how many accounts have been compromised or how many users share the same password. Second, it is therefore impossible to even begin to know what the “Top 30 Passwords” on LinkedIn are, or were. To state that “link” and “1234” are the most used passwords on LinkedIn would be a lie, as from what was provided that information is impossible to know.

What is possible to know, however, is how many passwords share a common base word – the word used to form the password, which usually has additional characters or numbers added to it. For example, we do know that 46,193 of the recovered passwords contain some form of “linkedin” in them. The word “link” was found an additional 12,996 times, while “linked” was found in 7,806 more passwords. The word “love” occurs 21,042 times. And, not surprisingly, some form of “password” occurs in 4,248 passwords, while “pass” occurs and additional 8,008 times.

Top 15 Base Words Used in LinkedIn Passwords
     1.       linkedin
46,193
     2.       love
21,042
     3.       link
12,996
     4.       anna
9,545
     5.       pass
8,008
     6.       linked
7,806
     7.       jack
7,258
     8.       blue
7,234
     9.       john
6,576
     10.   mark
5,525
     11.   mike
5,424
     12.   chris
5,050
     13.   nick
4,751
     14.   paul
4,499
     15.   password
4,486

The base words in bold more than likely have a connection to LinkedIn, and bring up two interesting potential trends.

For years we’ve been instructing people to use a unique password for each site, but it now appears that far too many people have interpreted that advice as “use the same password for each site, but add the site name to the password to make it unique to that site.” Clever, but here’s the fatal flaw in your plan: if I find out your LinkedIn password is “LINKEDINWillem01,” I’m pretty sure I know what your Facebook and PayPal passwords are, too.

The same goes for people who are using the site’s name or URL itself as their password. By our count, nearly 200 unique passwords contain some form of “linkedin.com” in them. The most noticeable trends were “linked.comNNNN,” where NNNN is typically a year, and NAME@linkedin.com, where NAME is someone’s first or last name. A handful of people also went as far as to use passwords like “linkedinpass” or “thisismylinkedinpassword.” I’ll take one guess at what their bank account password is.

NEW RULE: Sites should not allow the use of their site name or other common base words in user’s passwords.
Twitter is already following this rule:



The other trend we’re beginning to see emerge is people basing their passwords on the primary colors of the website they’re on. For instance, LinkedIn’s primary colors are blue and gray, both of which rounds out the Top 10 list.

Top 10 Colors Used in LinkedIn Passwords
     1.       blue
998
     2.       green
665
     3.       red
445
     4.       orange
417
     5.       purple
398
     6.       pink
304
     7.       black
281
     8.       brown
186
     9.       white
157
     10.   gray
146

The big question here is can the color blue be somehow connected to LinkedIn? It could simply be that blue is just a wildly popular color, but I can’t help but feel that seeing the color blue on the site made people more inclined to pick that color. Green and red are both very popular colors as well and there were 40% more blue-based passwords than green. Gray also surprises me – do that many people really love the dreary color gray, or were they inspired by what they saw? There were nearly three times more gray-based passwords than lime-based passwords, and that’s even including passwords where it was impossible to distinguish if the password was referring to the color or the fruit. It’s impossible to draw any hard conclusions, but this is a potential trend that should be kept an eye on in the future.

Password Re-use

One thing we know that will always be true is that people tend to select the same passwords. When the social media company RockYou was compromised in December 2009, details for 32 million accounts were leaked to the internet. And out of those 32 million exposed passwords, only 14.3 million were unique.

On a seemingly unrelated note, I have a wordlist that I maintain that contains nothing but real-world passwords from actual security breaches such as RockYou. It is currently almost six gigabytes in size, containing over 500 million unique passwords from sites all over the world.

So, what would happen if I were to run my real-world password wordlist through the LinkedIn hashes? The answer is I would crack 1.4 million of the 6.4 million password hashes in a matter of seconds. 21% of LinkedIn passwords were used as-is on other sites!

If we apply some logical rules to those real-world passwords we pick up another 2 million passwords, meaning an additional 31% of the passwords on LinkedIn are nearly identical to those used on other sites. We were able to recover over 52% of the LinkedIn passwords within the first two hours without really doing any work at all, simply because people everywhere think alike.
NEW RULE: Stop thinking alike! (But you were probably already thinking that.)


Pass Phrases


While the overwhelming majority of the LinkedIn passwords we cracked were between six and eight characters long, nearly 14,000 were at least 16 characters long and over 200 were at least 20 characters long – many of which were phrases consisting of four or more words. Passwords of that length should be fairly secure, so how were we able to crack so many of them?

Blame LinkedIn. The entire point of storing users’ passwords using one-way hash algorithms is to protect the passwords from being discovered, and one of the primary defenses against offline password recovery is the amount of time it takes to calculate each guessed value. If the algorithm used to hash a password can be calculated very quickly, then we can make lots of guesses at what the password might be in a very short amount of time. Conversely, if the algorithm is very slow to calculate, then only a limited number of guesses can be made in a reasonable amount of time. However, against the advice of NIST, OWASP, and other authorities on the subject, LinkedIn was storing passwords using one of the fastest-computing hash algorithms available: SHA-1.

SHA-1 was never designed to be used for password storage; it was primarily designed for message authentication and data validation. As such, SHA-1 is computationally very inexpensive and hashes can be generated very quickly – which is what you want when dealing with things like SSL or IPsec, but definitely not desirable when trying to protect users’ passwords.

Another aggravating factor is that LinkedIn did not salt users’ passwords. Salting is when you add a unique, random string of characters to a user’s password before calculating the hash, so that even if two users happen to have the same password, they will have unique password hashes. Salting passwords not only defeats Rainbow Tables (large databases which contain every possible password and its hash for a particular algorithm up to a certain length), but also reduces the number of guesses we are able to make per second since each password guess has to be hashed with each unique salt.

The last aggravating factor is that LinkedIn passwords were hashed using only one iteration of the SHA-1 algorithm. Modern password hashing algorithms typically employ thousands of iterations to make them more computationally expensive. For example, the Unix SHA512-based crypt scheme, aka sha512crypt, uses 5,000 iterations of the SHA-512 hash algorithm by default, and can be configured to use as many as one billion iterations so that it scales as computing power increases. The bcrypt algorithm -- which is vastly more computationally expensive than sha512crypt -- typically only uses 256 to 1024 iterations of the Blowfish keying algorithm by default, simply because each iteration is so expensive that it doesn't need to use more than that to be effective.

The vast majority of the LinkedIn passwords my associates and I recovered were cracked using a program called oclHashcat, which enables us to use graphics cards to crack passwords (modern graphics cards being much faster at cracking algorithms like SHA-1 than ordinary computer processors). Using four AMD Radeon HD6990 graphics cards, I am able to make about 15.5 billion guesses per second using the SHA-1 algorithm.

If that sounds like a lot, that’s because it is a lot. Even on my Intel Core i7 processor, I can crack SHA-1 at a rate of 98 million guesses per second using a program called John the Ripper, which is still very fast. Compare that to an algorithm like bcrypt, which I can crack at a rate of almost 5,000 guesses per second for five-iteration hashes using my Core i7 990X processor. Graphics cards don't help much with bcrypt either, since its design makes it very gpu-unfriendly.

Speed of SHA-1 vs. Modern Password Hashing Algorithms

Algorithm
Iterations
Software
Hardware
Guesses Per Second
SHA-1
1
John the Ripper
1.7.9-jumbo6
Intel Core i7 990X
98,000,000
SHA-1
1
oclHashcat
plus-0.09
4x AMD Radeon HD 6990
15,500,000,000
sha512crypt
5,000
John the Ripper
1.7.9-jumbo6
Intel Core i7 990X
1,800
sha512crypt
5,000
John the Ripper
1.7.9-jumbo6
ATI Radeon HD 5870
2,592
sha512crypt
5,000
John the Ripper
1.7.9-jumbo6
Nvidia GTX 580
11,405
bcrypt
32
John the Ripper
1.7.9-jumbo6
Intel Core i7 990X
4,960
bcrypt
32
John the Ripper
1.7.9-jumbo6
ATI Radeon HD 5870
1,745














  


The big question you should be asking right now is did LinkedIn developers consciously make the decision to use such a weak password storage scheme, or did they simply not consider changing the default password storage scheme for some solution that they purchased?

Either way, the responsibility falls squarely on LinkedIn’s shoulders. A solid Systems Development Lifecycle (SDLC) risk management policy would most certainly include application security reviews, where issues such as these would be unearthed by security experts and addressed by developers. So whether LinkedIn has no security leadership, their SDLC risk management program is broken or non-existent, or they chose to accept the risk associated with using a weak password storage scheme, they were doing something wrong.

And it was because LinkedIn was doing it wrong that we were able to crack as many passwords as we did. Had LinkedIn stored their users’ passwords using a computationally expensive hashing algorithm like bcrypt, we would have had to have been very selective about what kinds of attacks we ran and how long they would take to complete. But since they used a single iteration of unsalted SHA-1, we were virtually unlimited in the types of attacks we could run. We were able to throw gigabytes and gigabytes of words at the hashes, running each word through permutation filters and rules engines, even run complex combinations of attacks without having to worry about how long each attack would take to complete. And it’s because even the most complex attacks we launched finished in a matter of hours that we were able to recover as many complex passwords as we did.
NEW RULE: Sites must only store user passwords using hashing algorithms specifically intended for storing passwords, such as bcrypt. A well-defined SDLC risk management program would benefit everyone as well.
---
 HUGE thanks to Per for backing me on this and working with me to create this writeup, @d3ad0ne_ for teaming up with me to crack the last bit of hashes, and to Tom K. Tørrissen for doing the great infographic!