Wednesday, September 2, 2015

Ashley Madison Data: Life is Short. Have a look at an Affair.

If you haven't heard the Affair/Dating site Ashley Madison was hacked last month and when they didn't pay up, the data was released. When I heard the data was released I was really excited. The amount and kind of data make this topic really seductive. 

This dashboard won't allow you to look up people you know (or your spouse). It is designed to show some aggregate information about AM accounts. 

On to the nerdy details! So the data was released as a bunch of mysql dump files so they have to be loaded into a database. The credit card transactions were daily CSV files that had to be merged (I used Pentaho). Getting this all in place was a bit of work, and then the huge amount of the data was bringing my computer to a grinding halt. I ended up creating some views to shape the data how I wanted it and then exporting the view to a text file to load in Tableau. Once in Tableau I could aggregate further in an extract. I ended up with a few smaller aggregated datasets and it is a pretty small workbook. 

The other issue was that there was no data dictionary so some I have a seeking field, but I have no idea what a 4, 6, or 2 means. Reports are that many of the females are actually bots that trick men into buying more credits. Annalee Newitz at Gizmodo had a good breakdown of insights from the second dump with the source code. Using this I have excluded bots from almost everything. 

No comments:

Post a Comment