Monday, August 15, 2016

Word count from text files

Its been a while since I got time to write . Apologies for that . But I'm back with something that will be useful for anyone working with unstructured data.

One of the most common tasks while working with a lot of text files / comments  is the need to get the list of words . I use Python a lot in recent times and thought of doing this in Python . I went with a native method of reading the CSV files and then splitting the sentence once the text file is split into sentences and then to words . But it dint go well with the spaces . 

So I thought of fixing this with a better and efficient way. And tada I stumbled upon the NLTK . I assume you know NLTK . But for the sakes its Natural Language Tool Kit from Python. It had an excellent method to build corpus . Also as I'm from a Business Intelligence background , I thought why not build a corpus from all the text files and the result is anyway so pristine that it can be fed straight into a database . 

I will quickly summarize the steps . 

1. Import the necessary libraries :

import os  # for traversing the directories
import pandas as pd  # for it had nice export functions and pivoting data
import sys # error handling 
import traceback # error handling 
import numpy as np  # arithmetic and counting operations 
from nltk.corpus.reader.plaintext import PlaintextCorpusReader  # the rockstar library that does stuff

2. Set working directory : Its always and I repeat always a good habit to explicitly setup the working directory as a part of the script . I always do it.

corpusdir = "<< your path>>

3. Change to your working directory


4. Lets get to the business

customcorpus = PlaintextCorpusReader(corpusdir, '.*') # create a corpus by looking up at all the text files within the folder
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith(".txt"):            
            if os.path.getsize(file) > 0: # ensure that the file size is greater than 0
                wordlist = customcorpus.words(filetoopen) # mention the name of the file here to get the words in the file
                wordlist = list(wordlist) # convert it into a list
                    uniquelist=list(set(uniquelist + wordlist)) 
                    printstuff = ("amended " + '%s') %(filetoopen)
                except AssertionError:
                    _, _, tb = sys.exc_info()
                    traceback.print_tb(tb) # Fixed format
                    tb_info = traceback.extract_tb(tb)
                    filename, line, func, text = tb_info[-1]
                    print('An error occurred on line {} in statement {}'.format(line, text))
df = pd.DataFrame(duplicatelist,columns=['word'])
dfp = pd.pivot_table(df,values='counter',index='word',aggfunc=np.sum)
non_duplicate_filename = "nonDuplicatewordlist_withcount.csv"

dfp.to_csv(non_duplicate_filename,index=True,header=True)   # its a series mate not a data frame

Tada you get two additional csv files  wordlist.csv [ which is all the words straight and simple  ] and then the file nonDuplicatewordlist_withcount.csv  which contains a non-duplicate version of the words along with the count of each word .

I could think of a lot of ways for this to be put to use but I will discuss them for later . 

If you find this useful , please take time to like this post and if you really dont mind then you can share the credit with me  :) 


No comments:

Post a Comment