Playing around with (small) Big Data.

Sexy legs in high heels

Some friends and colleagues of mine recently started a work-group to learn more about big data and while it is a tad bit off my field I joined them, among the things I have contributed with is a twitter analyzer tool that can record tweets that contains tags that you find interesting or that are important to your research. But sometimes you would like to harvest data that have already been written, cue the hashtagCrawler :)

With the twitter API it is very easy to mine data from hashtags or users that you find interesting, I’ll release the source code in a few days when I have cleaned it up some.

But what can you actually do with this data? For the purpose of this demo I selected the currently very active twitter feed “#TopTenTurnons”

Yes I’ll admit that the thought of it increasing the traffic to my blog a tiny bit did cross my mind. ;-).

To get the data I used the .NET Webrequest/Webresponce objects and the twitter api command search and from this I generated a nice pile of twitter data, language type and a timestamp (About 360 tweets were collected far less then I thought it would be). I took the data and added it to a simple sentence compare algorithm ( I was going to try hadoop but I didn’t have it set up – might be an updated post in the future!) and made a percentage based list of what it presented.

So what are people turned on by? I was not surprised that on the top list I found the regular stuff like “tall boys” “tits” “ass” but I was (happily) surprised about the almost total lack of Justin Biber posts.

  1. guys (15 %) (Many posts just plainly said guys)
  2. you (15%) (Referring to a boyfriend/girlfriend)
  3. tits (6.5%) (Including all posts of tits, but 99.9% were similar to large, bombastic etc)
  4. ass (6.5%) (Including all posts of ass, but tastes seem to vary heavily from large to tight etc)
  5. “can joke around” (5.2%)
  6. “self confident” (5.2%)
  7. “when you do things just for me” (5.2%)
  8. “strong” (3.9%)
  9. “players” (3.9%) (This was a big surprise to me)
  10. “playfully insult each other” (3.9%) (Ehm?)
  11. humor (3.9%)
  12. Correct grammar (3.9%)

Interesting though 0.3% answered “when you look like a chicken McNugget!

I am sure that my data is mostly “incorrect” since it’s a very small group of people who has answered and regarding that all I did was to check how common the sentences were. In the future I hope to re test this with better software were I can take height for slight variations in spelling and sentence building, but what I hope I have shown you is that big data is possible even on a simple laptop (Get a cloud service and you can perhaps run with the best?)

Are you working on any big data projects?