Saturday, March 11, 2017

An Optimized Text Storage setup with Python and Mongo DB


How many times the name "Watson" and "Sherlock" were used in the Sherlock Holmes novels ? 

I don't expect the reader to answer this question but if you are curious the number is 3721 and 1557 respectively.

How did I count it ? Well that will be answered eventually . 

It all started with a web-scraping exercise I was doing with the site http://sherlock-holm.es/ascii/

The script worked well and I was able to download 67 files or in short all novels and stories themed around Sherlock Holmes . Now I'm not a big fan of detective stories of old times . So it became the source of text mining setup .

My Architecture:


Text : Source Data
Python :Scripting Language
mongoDB :Database ( community edition )
roboMongo : A cross-platform Mongo DB manager ( community edition )

My idea was to deconstruct each of the text files into [ paragraph , sentence and words ] and take it from there . This way I will be able to do a more specific analysis of the text in a particular file at a later date .

I considered mongoDB because ,

1. Had a community edition that had a decent database storage limit

In case if you need more details 


2. Has inbuilt optimization for handling text data .

3. Dynamic and Agile with the data structures .

Let's look at the script that I put together in Python for the initial storage.

Structure of the script : 

{ if you are interested in the script please feel free to use it from here with a credit to me if you can }

Link to script :



Packages Import : NLTK , PyMongo 

Functions : 

1. initiatemongodbconnection  - To initiate Mongo DB connection

2. insertcollection  - Inserting a document into a Mongo DB Collection

3. dicttolist - A custom function to convert a dictionary of k,v into a list.

4. uprint  - To deal with utf-8 and latin conversion issues in text files

Main:

  • Initiate Mongo DB Connection
  • Initialize Corpus Directory ( can be passed as an argument with a simple change in the script if you want that functionality )
  • Loop through each file to do the following
    1. Get the word list , paragraph list and store them.

    2. Get the sentence list and get tokens .

    3. Using tokens create frequency distribution of words [ this turned to be a brilliant move later because knowing the frequency of a particular word was a key step in improving your analysis . ] 
    ( Now you know how I counted the word "Watson" )

    4.  Store them  


     

Database structure:



A view using RoboMongo: https://robomongo.org/ 



A view of the sentence corpus from one of the files: 



After I did this , I went on to design a text mining setup by using Python and Mongo DB text search features and Mapreduce techniques . I will write about them in another post .






Monday, August 15, 2016

Word count from text files

Its been a while since I got time to write . Apologies for that . But I'm back with something that will be useful for anyone working with unstructured data.

One of the most common tasks while working with a lot of text files / comments  is the need to get the list of words . I use Python a lot in recent times and thought of doing this in Python . I went with a native method of reading the CSV files and then splitting the sentence once the text file is split into sentences and then to words . But it dint go well with the spaces . 

So I thought of fixing this with a better and efficient way. And tada I stumbled upon the NLTK . I assume you know NLTK . But for the sakes its Natural Language Tool Kit from Python. It had an excellent method to build corpus . Also as I'm from a Business Intelligence background , I thought why not build a corpus from all the text files and the result is anyway so pristine that it can be fed straight into a database . 

I will quickly summarize the steps . 


1. Import the necessary libraries :

import os  # for traversing the directories
import pandas as pd  # for it had nice export functions and pivoting data
import sys # error handling 
import traceback # error handling 
import numpy as np  # arithmetic and counting operations 
from nltk.corpus.reader.plaintext import PlaintextCorpusReader  # the rockstar library that does stuff


2. Set working directory : Its always and I repeat always a good habit to explicitly setup the working directory as a part of the script . I always do it.

corpusdir = "<< your path>>

3. Change to your working directory

os.chdir(corpusdir)

4. Lets get to the business

customcorpus = PlaintextCorpusReader(corpusdir, '.*') # create a corpus by looking up at all the text files within the folder
duplicatelist=[]
uniquelist=[] 
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith(".txt"):            
            filetoopen=os.path.join(file)
            if os.path.getsize(file) > 0: # ensure that the file size is greater than 0
                wordlist = customcorpus.words(filetoopen) # mention the name of the file here to get the words in the file
                wordlist = list(wordlist) # convert it into a list
                try:
                    duplicatelist=duplicatelist+wordlist
                    uniquelist=list(set(uniquelist + wordlist)) 
                    printstuff = ("amended " + '%s') %(filetoopen)
                    print(printstuff)                    
                except AssertionError:
                    _, _, tb = sys.exc_info()
                    traceback.print_tb(tb) # Fixed format
                    tb_info = traceback.extract_tb(tb)
                    filename, line, func, text = tb_info[-1]
                    print('An error occurred on line {} in statement {}'.format(line, text))
                    exit(1)
                    
df = pd.DataFrame(duplicatelist,columns=['word'])
df.to_csv("wordlist.csv",index=0)
df['counter']=1                
dfp = pd.pivot_table(df,values='counter',index='word',aggfunc=np.sum)
non_duplicate_filename = "nonDuplicatewordlist_withcount.csv"

dfp.to_csv(non_duplicate_filename,index=True,header=True)   # its a series mate not a data frame

Tada you get two additional csv files  wordlist.csv [ which is all the words straight and simple  ] and then the file nonDuplicatewordlist_withcount.csv  which contains a non-duplicate version of the words along with the count of each word .

I could think of a lot of ways for this to be put to use but I will discuss them for later . 

If you find this useful , please take time to like this post and if you really dont mind then you can share the credit with me  :) 

Thanks.









Thursday, December 17, 2015

Time Series Forecasting with R - Oil Price Prediction

"Those who have knowledge, don't predict. Those who predict, don't have knowledge. "

--Lao Tzu, 6th Century BC Chinese Poet 

 A thought provoking statement by Lao Tzu. 

Wikipedia states that Forecasting is the process of making predictions of the future based on past and present data and analysis of trends. A commonplace example might be estimation of some variable of interest at some specified future date. Prediction is a similar, but more general term. Both might refer to formal statistical methods employing time series, cross-sectional or longitudinal data, or alternatively to less formal judgmental methods. Usage can differ between areas of application: for example, in hydrology, the terms "forecast" and "forecasting" are sometimes reserved for estimates of values at certain specific future times, while the term "prediction" is used for more general estimates, such as the number of times floods will occur over a long period.

https://en.wikipedia.org/wiki/Forecasting 
 
Today I'm going to discuss about Time Series Forecasting . Many experts have written about this topic. My favorite being Professor Rob Hyndman http://robjhyndman.com/hyndsight/

R is pretty neat with its graphical capabilities to aid visualisation as we go along.

Time series Forecasting:

Forecasting is almost always done along side a time-series . This is due to the dependency of the algorithms used in forecasting to data that contains the trends for the relevant metric in terms of a time slice such as  (Day , Week , Month ..... and the list is long ).

Let's see how to use Time series forecasting methods to predict oil prices .

Some thoughts before we proceed.

1. The metric we want to forecast should have a time-slice attached to it.
2. Forecast methods in R use the following components .

     a. Seasonality
     b. Randomness
     c. Trend

Therefore it is advisable to have atleast 48 data points to achieve a decent accuracy in your prediction.

3. Even though there is no restriction in the time-slice , generally accuracy starts improving when the data is at a month-level. Having said that you can still experiment with week / day level data.

Getting into the business.

Step 1: You need the following packages to proceed with forecasting.


# My Favorite Reference 
#http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html
#Library
library(TTR)
library(forecast)


If you don't have the packages you can install them using the following command.

install.packages(<packagename>)

Subsequently you will have to include them in the environment by using the library command above.

Step 2:  Its my habit to set the working directory appropriately before proceeding with any analysis . This way I ensure that all my relevant work is stored in the same folder.

#set working directory
setwd("D:/DataScience/Exercises/TimeSeriesForecasting/OilPrice")

# I used oil price data from the following web page. 
# http://www.indexmundi.com/ # Please like the page in facebook. # Don't worry I have attached the data set that I have used at the end of this # blog

#get data.
oil <- read.csv("MonthlyOilPrice.csv")

#head - My habit to take a quick glance at data.
head(oil)


Step 3: Create the time series as follows. See below where I'm creating a timeseries based on the column price along with explicit declaration of start and end .


Syntax is as follows c(year , month).I knew my data was between 1-Jan-1986 and 1-Nov-2015. You can edit it according to your dataset.

#create timeseries
oilts <- ts(oil$Price, start=c(1986, 1), end=c(2015, 11), frequency=12)

# A quick plot#plot
plot.ts(oilts)







Step 4: As I said earlier now we will try to visualize the components seasonality , trend and randomness .


#decompose to get seasonality,observed,random and trend
oiltscomp <- decompose(oilts)

plot(oiltscomp)

 

Step 5: I am attempting to remove the seasonality factor now.


#removing seasonality easy isn't it ?
oiltscompadjust <- oilts - oiltscomp$seasonal

plot(oiltscompadjust)


 
 
Step 6: If you like to play with the smoothing parameters you can play around with the alpha,beta and gamma values in the Holtwinters function. Trust me the function uses machine learning to arrive at the values and therefore tweak it only if you want to see how your data responds.


#Forecasting Including Smoothing

#Apply HoltWinters smoothing

# oiltsforecast <-HoltWinters(logoilts)
oiltsforecast <-HoltWinters(oilts)
plot(oiltsforecast)







 
Step 7: Finally generate forecasts.





# use the variable h for deciding number of periods .# levels to decide the confidence intervals. By default it is 80% and 95%.oiltsforecast2 <- forecast.HoltWinters(oiltsforecast,level=c(80,95),h = 12)

# plot blue line shows forecast , dark grey 80% confidence , light grey 95 % confidence
# plot.forecast(oiltsforecast2)
plot.forecast(oiltsforecast2,type="h",main="Oil Price Forecasting",xlab="Year-Month",ylab="$s per barrel")

 Observe the following graph where the forecast values are shown in a blue line with 80% and  95% confidence intervals in two different colors.





 
 Step 8: Now that we generated the forecast , lets blow them up and see.


#Just focus on the forecast variabes by setting the include variable at 0
plot.forecast(oiltsforecast2,include=0,type="h",main="Oil Price Forecasting",xlab="Year-Month",ylab="$s per barrel")
 

 
Validation of Quality of Forecast: 

There are two ways to measure the accuracy . 

One is reactive method and the other is proactive.

Measuring your forecast value against the actual value once you encounter is pro-active.

Ex: Lets say you have forecasted a profit of x amount the month of Apr'2016. Then you will have to wait till then to see it's accuracy. [ Reactive , not a good idea ].

There are multiple methods to validate the accuracy of a forecasting method. I prefer using MAD ( Mean Absolute Deviation ) .

See below :

#Mean Absolute Deviation method to validate deviation of forecast
#Note you will have to fit the forecast before validating.

mad(fitted(oiltsforecast2))
## [1] 16.40478

Link for dataset and code

https://drive.google.com/folderview?id=0Bw4afn-u-hxjYjFfVV9MSW1XbU0&usp=sharing

Once again my thanks to 

Professor Rob Hyndman for the forecast packages 
http://www.indexmundi.com for the dataset.