Wikigen – Wikipedia Wordlist Generator

Having a good wordlist (or several) is pretty important to speed up password cracking. Specially if you wish to merge or apply other masks to your attacks. Let’s say you know that the owner of the hash you’re about to crack like a certain soccer team or perhaps a music genre. This python script will take a Wikipedia page as an argument and start to extract words and URL’s from that page. After opening the initial page it will put every URL into a queue that’s called secondlayer. When firstlayer queue is empty (it will be after the first page since it’s the onlything in it) it will fill the firstlayer queue with the links from the secondlayer, repeat the process until the first layerque links have been visisted, extracted and added into secondlayer and then repeat as needed.

My first version of this script was a lot slower since it did not support threading. This version does support threading and will extract 1 link for each thread. So for example if you choose 10 threads it will extract 10 links at the same time and repeat. I do not know if Wikipedia will ban users that generate too much traffic so it might be stupid too choose too many threads. However, if you’d like you can write a sleep function or similar. Please use this resposibly since it’s not intended (nor do I condone it) as a tool to steal all their bandwidth (although I doubt it’s possible with this script).

There are 3 mandatory arguments that need to be set upon launch and two are not (min/max length — will set to default 6/30 if not provided).

-u STARTURL Wikipedia URL to use as start for the crawler
-t NRTHREADS Amount of threads
-o OUTPUTFILE File to write output to
-m MIN Minimum length of words
-M MAX Maximum length of words

Example usage: python wwg.py -u http://en.wikipedia.org/wiki/Europe -o wordlist.txt -t 5

Here’s the link to the script on github.
The guys at ArchAssault also added it to their repository. Link

Tags: ,