How many times the name "Watson" and "Sherlock" were used in the Sherlock Holmes novels ?
I don't expect the reader to answer this question but if you are curious the number is 3721 and 1557 respectively.
How did I count it ? Well that will be answered eventually .
It all started with a web-scraping exercise I was doing with the site http://sherlock-holm.es/ascii/
The script worked well and I was able to download 67 files or in short all novels and stories themed around Sherlock Holmes . Now I'm not a big fan of detective stories of old times . So it became the source of text mining setup .
My Architecture:
Text : Source Data
Python :Scripting Language
mongoDB :Database ( community edition )
roboMongo : A cross-platform Mongo DB manager ( community edition )
My idea was to deconstruct each of the text files into [ paragraph , sentence and words ] and take it from there . This way I will be able to do a more specific analysis of the text in a particular file at a later date .
I considered mongoDB because ,
1. Had a community edition that had a decent database storage limit
In case if you need more details
2. Has inbuilt optimization for handling text data .
3. Dynamic and Agile with the data structures .
Let's look at the script that I put together in Python for the initial storage.
Structure of the script :
{ if you are interested in the script please feel free to use it from here with a credit to me if you can }
Link to script :
Packages Import : NLTK , PyMongo
Functions :
1. initiatemongodbconnection - To initiate Mongo DB connection
2. insertcollection - Inserting a document into a Mongo DB Collection
3. dicttolist - A custom function to convert a dictionary of k,v into a list.
4. uprint - To deal with utf-8 and latin conversion issues in text files
Main:
- Initiate Mongo DB Connection
- Initialize Corpus Directory ( can be passed as an argument with a simple change in the script if you want that functionality )
- Loop through each file to do the following
1. Get the word list , paragraph list and store them.
2. Get the sentence list and get tokens .
3. Using tokens create frequency distribution of words [ this turned to be a brilliant move later because knowing the frequency of a particular word was a key step in improving your analysis . ]
( Now you know how I counted the word "Watson" )
4. Store them
Database structure:
A view using RoboMongo: https://robomongo.org/
A view of the sentence corpus from one of the files:
After I did this , I went on to design a text mining setup by using Python and Mongo DB text search features and Mapreduce techniques . I will write about them in another post .