Thursday, June 7, 2012

Tweet2Wordlist Utility

After reading an interesting post on using Twitter to generate wordlists from the words that people tweet, I decided that it would be ideal to have a script to do this for you.

Hence I wrote Tweet2Wordlist.py

The script is simple, it takes arguments that give it the depth of how many of the latest tweets to get from Twitter, then you give it Minimum and Maximum word lengths to filter on. Finally you can choose to either filter based on Geo Location (coordinates) or by Tweet Language, or both.

Keep the following in mind;
  1. Twitter may throttle/block you if you try to pull too much information.
  2. The Geo Location coordinates are given in the format latitude, longitude,radius. Radius is how far in a circle outward from your coordinates you want to search for tweets. This only works for users that have tweeted and allowed their location to be shared. Format example : 37.781157,-122.398720,1mi   will search those coordinates plus 1 mile out. You can use km to indicate kilometers.
  3. The depth does not indicate word amounts! it indicates how many tweets to request from Twitter. They may have a lot, or little words in them that qualify based on your criteria.
  4. Twitter returns some characters not supported by the script, in this case it just ignores them.
  5. The language filter applies to the Tweeter's language setting, not what they actually typed into their tweet. Hence you may get words that are not in the language you are filtering. 
TIP - If you want to only get unique words, enter the command you want and then add  > uniq -u which will call the uniq command ask it to only output unique words. (Linux)

TIP - See the cleanup list in the python code and modify it to suit your needs if you want more control over what to filter out of tweets.

Here is the help output from the script:

Usage: tweet2wordlist.py [options]

 This tool is to simplify the dumping of words from tweets into wordlists.
Additionally, I have added features such as geo-location lookup of tweets as
well as the capability to control depth and word output sizes. Please send any
comments to the email listed in the program. If you find this tool useful,
tweet me at @Bitcrack_Cyber. See the blog  post on http://i-am-
rurapenthe.blogspot.com for examples, more info etc.

Options:
  -h, --help            show this help message and exit
  -m [1-1000], --max=[1-1000]
                        maximum word length to output
  -n [1-1000], --min=[1-1000]
                        minimum word length to output
  -o filename, --output=filename
                        output words to a file
  -d 1-1000, --depth=1-1000
                        how many tweets to get from twitter.keep it reasonable
                        to avoid throttling
  -g lat,long,radius[mi]/[km], --geo=lat,long,radius[mi]/[km]
                        geographic coordinates filter for tweets and radius
  -l [EN][CN][FR] etc, --lang=[EN][CN][FR] etc
                        filter tweets based on language code in ISO 639-1 code

And you can download the script by clicking here : tweet2wordlist.py

Comments, feedback, suggestions etc are welcome. Please follow and/or comment on Twitter using @Bitcrack_Cyber


Dimitri AKA Rurapenthe