My Experiments with Data Science: An Optimized Text Storage setup with Python and Mongo DB

How many times the name "Watson" and "Sherlock" were used in the Sherlock Holmes novels ?

I don't expect the reader to answer this question but if you are curious the number is 3721 and 1557 respectively.

How did I count it ? Well that will be answered eventually .

It all started with a web-scraping exercise I was doing with the site http://sherlock-holm.es/ascii/

The script worked well and I was able to download 67 files or in short all novels and stories themed around Sherlock Holmes . Now I'm not a big fan of detective stories of old times . So it became the source of text mining setup .

My Architecture:

Text : Source Data

Python :Scripting Language

mongoDB :Database ( community edition )

roboMongo : A cross-platform Mongo DB manager ( community edition )

My idea was to deconstruct each of the text files into [ paragraph , sentence and words ] and take it from there . This way I will be able to do a more specific analysis of the text in a particular file at a later date .

I considered mongoDB because ,

1. Had a community edition that had a decent database storage limit

In case if you need more details

https://docs.mongodb.com/manual/reference/limits/

2. Has inbuilt optimization for handling text data .

3. Dynamic and Agile with the data structures .

Let's look at the script that I put together in Python for the initial storage.

Structure of the script :

{ if you are interested in the script please feel free to use it from here with a credit to me if you can }

Link to script :

Sherlock Holmes Novels - Text Mining

Packages Import : NLTK , PyMongo

Functions :

1. initiatemongodbconnection - To initiate Mongo DB connection

2. insertcollection - Inserting a document into a Mongo DB Collection

3. dicttolist - A custom function to convert a dictionary of k,v into a list.

4. uprint - To deal with utf-8 and latin conversion issues in text files

Main:

Initiate Mongo DB Connection
Initialize Corpus Directory ( can be passed as an argument with a simple change in the script if you want that functionality )
Loop through each file to do the following
1. Get the word list , paragraph list and store them.

2. Get the sentence list and get tokens .

3. Using tokens create frequency distribution of words [ this turned to be a brilliant move later because knowing the frequency of a particular word was a key step in improving your analysis . ]

( Now you know how I counted the word "Watson" )

4. Store them

Database structure:

A view using RoboMongo: https://robomongo.org/

A view of the sentence corpus from one of the files:

After I did this , I went on to design a text mining setup by using Python and Mongo DB text search features and Mapreduce techniques . I will write about them in another post .

My Experiments with Data Science

Saturday, March 11, 2017

An Optimized Text Storage setup with Python and Mongo DB

No comments:

Post a Comment