Saturday, March 11, 2017

An Optimized Text Storage setup with Python and Mongo DB


How many times the name "Watson" and "Sherlock" were used in the Sherlock Holmes novels ? 

I don't expect the reader to answer this question but if you are curious the number is 3721 and 1557 respectively.

How did I count it ? Well that will be answered eventually . 

It all started with a web-scraping exercise I was doing with the site http://sherlock-holm.es/ascii/

The script worked well and I was able to download 67 files or in short all novels and stories themed around Sherlock Holmes . Now I'm not a big fan of detective stories of old times . So it became the source of text mining setup .

My Architecture:


Text : Source Data
Python :Scripting Language
mongoDB :Database ( community edition )
roboMongo : A cross-platform Mongo DB manager ( community edition )

My idea was to deconstruct each of the text files into [ paragraph , sentence and words ] and take it from there . This way I will be able to do a more specific analysis of the text in a particular file at a later date .

I considered mongoDB because ,

1. Had a community edition that had a decent database storage limit

In case if you need more details 


2. Has inbuilt optimization for handling text data .

3. Dynamic and Agile with the data structures .

Let's look at the script that I put together in Python for the initial storage.

Structure of the script : 

{ if you are interested in the script please feel free to use it from here with a credit to me if you can }

Link to script :



Packages Import : NLTK , PyMongo 

Functions : 

1. initiatemongodbconnection  - To initiate Mongo DB connection

2. insertcollection  - Inserting a document into a Mongo DB Collection

3. dicttolist - A custom function to convert a dictionary of k,v into a list.

4. uprint  - To deal with utf-8 and latin conversion issues in text files

Main:

  • Initiate Mongo DB Connection
  • Initialize Corpus Directory ( can be passed as an argument with a simple change in the script if you want that functionality )
  • Loop through each file to do the following
    1. Get the word list , paragraph list and store them.

    2. Get the sentence list and get tokens .

    3. Using tokens create frequency distribution of words [ this turned to be a brilliant move later because knowing the frequency of a particular word was a key step in improving your analysis . ] 
    ( Now you know how I counted the word "Watson" )

    4.  Store them  


     

Database structure:



A view using RoboMongo: https://robomongo.org/ 



A view of the sentence corpus from one of the files: 



After I did this , I went on to design a text mining setup by using Python and Mongo DB text search features and Mapreduce techniques . I will write about them in another post .






No comments:

Post a Comment