Ashley Madison dump - no database necessary
Ashley Madison is a "dating" site marketed toward married people seeking an affair.They were recently hacked and a dump of around 9 GB was posted to an Onion site. This was quickly mirrored to Bittorrent and will probably be around until the heat the death of the universe.
Aside from the actual information, which is titillating, it's a fascinating amount of data. I've downloaded the dump and used it to practice my text parsing and scripting skills.
In this post, I'll share some methods of parsing huge text data sets such as this one. These methods are somewhat rudimentary. There are better more complicated methods. These are good basic methods that anyone can use. No database is necessary.
In this post, I'll share some methods of parsing huge text data sets such as this one. These methods are somewhat rudimentary. There are better more complicated methods. These are good basic methods that anyone can use. No database is necessary.
Linux
I've chosen to use a Bash shell on Linux. Manipulating large text files is much quicker in the Linux console. There are also mature tools that allow you to search and manipulate the data without first decompressing it, useful since these data sets are gzipped, saving a significant amount of space. If you don't have Linux installed, there are several versions you can boot from a cd, knoppix is simple, but I'm a big fan of System Rescue CD.
The Dump
As mentioned above, this dump consists of several large compressed files. They are mostly gzipped csv (comma separated value) files, with some collections of files compressed into a 7z archive.
The 7z archives must be extracted in order to search them. The Credit Card transaction logs, CreditCardTransactions.7z is 278MB compressed, and over 2.5 GB uncompressed, so make sure you have space.
Here are the commands I used to extract:
mkdir cc_dataThis creates a directory and extracts the data to that directory.
7z x -occ_data/ CreditCardTransactions.7z
To search the data, I use grep, for example, this will find all records from Indiana.
grep \"IN cc_data/*
- grep searches for "IN, the \ is necessary before the " because it is a special character
You can pipe your output to additional grep's, for example, this will find all Indiana records with carmel in them:
grep \"IN cc_data/* | grep -i carmel
- this uses grep to search for "IN, then searches the results for carmel. The -i makes it NOT case sensitive, by default grep searches are case sensitive.
These searches can take a few seconds to complete.
email addresses
One of the gzipped dumps is a list of email addresses. You can search for an email address you know, or an address attached to a CC transaction log. Either way, you will find a user number that can be searched in all the remaining gzipped files to piece together a users record.
To search a gzipped file, you need to use the z commands. Here is a great post on using z commands in Linux.
If you zgrep, you will get a huge block of text. I pipe zcat to tr, replacing "(" with "\n", this puts each record on it's own line, for readability.
Then I pipe the output to grep to find what I want. Here is an example:
zcat am_am.dump.gz |tr '(' '\n' | grep -i obama
- zcat is the equivalent of cat for compressed file. I could use zgrep, but this data set does not have good, readable line breaks, by zcat'ing the file first, I can break it into readable chunks before I grep.