My Experiments with Data Science: 2016

Its been a while since I got time to write . Apologies for that . But I'm back with something that will be useful for anyone working with unstructured data.

One of the most common tasks while working with a lot of text files / comments is the need to get the list of words . I use Python a lot in recent times and thought of doing this in Python . I went with a native method of reading the CSV files and then splitting the sentence once the text file is split into sentences and then to words . But it dint go well with the spaces .

So I thought of fixing this with a better and efficient way. And tada I stumbled upon the NLTK . I assume you know NLTK . But for the sakes its Natural Language Tool Kit from Python. It had an excellent method to build corpus . Also as I'm from a Business Intelligence background , I thought why not build a corpus from all the text files and the result is anyway so pristine that it can be fed straight into a database .

I will quickly summarize the steps .

1. Import the necessary libraries :




import os  # for traversing the directories

import pandas as pd  # for it had nice export functions and pivoting data

import sys # error handling 

import traceback # error handling 
import numpy as np  # arithmetic and counting operations 

from nltk.corpus.reader.plaintext import PlaintextCorpusReader  # the rockstar library that does stuff

2. Set working directory : Its always and I repeat always a good habit to explicitly setup the working directory as a part of the script . I always do it.

corpusdir = "<< your path>>

3. Change to your working directory

os.chdir(corpusdir)

4. Lets get to the business

customcorpus = PlaintextCorpusReader(corpusdir, '.*') # create a corpus by looking up at all the text files within the folder
duplicatelist=[]
uniquelist=[]
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith(".txt"):
filetoopen=os.path.join(file)
if os.path.getsize(file) > 0: # ensure that the file size is greater than 0
wordlist = customcorpus.words(filetoopen) # mention the name of the file here to get the words in the file
wordlist = list(wordlist) # convert it into a list
try:
duplicatelist=duplicatelist+wordlist
uniquelist=list(set(uniquelist + wordlist))
printstuff = ("amended " + '%s') %(filetoopen)
print(printstuff)
except AssertionError:
_, _, tb = sys.exc_info()
traceback.print_tb(tb) # Fixed format
tb_info = traceback.extract_tb(tb)
filename, line, func, text = tb_info[-1]
print('An error occurred on line {} in statement {}'.format(line, text))
exit(1)

df = pd.DataFrame(duplicatelist,columns=['word'])
df.to_csv("wordlist.csv",index=0)
df['counter']=1
dfp = pd.pivot_table(df,values='counter',index='word',aggfunc=np.sum)
non_duplicate_filename = "nonDuplicatewordlist_withcount.csv"

dfp.to_csv(non_duplicate_filename,index=True,header=True) # its a series mate not a data frame

Tada you get two additional csv files wordlist.csv [ which is all the words straight and simple ] and then the file nonDuplicatewordlist_withcount.csv which contains a non-duplicate version of the words along with the count of each word .

I could think of a lot of ways for this to be put to use but I will discuss them for later .

If you find this useful , please take time to like this post and if you really dont mind then you can share the credit with me :)

Thanks.

My Experiments with Data Science

Monday, August 15, 2016

Word count from text files