Its been a while since I got time to write . Apologies for that . But I'm back with something that will be useful for anyone working with unstructured data.
One of the most common tasks while working with a lot of text files / comments is the need to get the list of words . I use Python a lot in recent times and thought of doing this in Python . I went with a native method of reading the CSV files and then splitting the sentence once the text file is split into sentences and then to words . But it dint go well with the spaces .
So I thought of fixing this with a better and efficient way. And tada I stumbled upon the NLTK . I assume you know NLTK . But for the sakes its Natural Language Tool Kit from Python. It had an excellent method to build corpus . Also as I'm from a Business Intelligence background , I thought why not build a corpus from all the text files and the result is anyway so pristine that it can be fed straight into a database .
I will quickly summarize the steps .
1. Import the necessary libraries :
2. Set working directory : Its always and I repeat always a good habit to explicitly setup the working directory as a part of the script . I always do it.
corpusdir = "<< your path>>
3. Change to your working directory
os.chdir(corpusdir)
4. Lets get to the business
customcorpus = PlaintextCorpusReader(corpusdir, '.*') # create a corpus by looking up at all the text files within the folder
duplicatelist=[]
uniquelist=[]
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith(".txt"):
filetoopen=os.path.join(file)
if os.path.getsize(file) > 0: # ensure that the file size is greater than 0
wordlist = customcorpus.words(filetoopen) # mention the name of the file here to get the words in the file
wordlist = list(wordlist) # convert it into a list
try:
duplicatelist=duplicatelist+wordlist
uniquelist=list(set(uniquelist + wordlist))
printstuff = ("amended " + '%s') %(filetoopen)
print(printstuff)
except AssertionError:
_, _, tb = sys.exc_info()
traceback.print_tb(tb) # Fixed format
tb_info = traceback.extract_tb(tb)
filename, line, func, text = tb_info[-1]
print('An error occurred on line {} in statement {}'.format(line, text))
exit(1)
df = pd.DataFrame(duplicatelist,columns=['word'])
df.to_csv("wordlist.csv",index=0)
df['counter']=1
dfp = pd.pivot_table(df,values='counter',index='word',aggfunc=np.sum)
non_duplicate_filename = "nonDuplicatewordlist_withcount.csv"
dfp.to_csv(non_duplicate_filename,index=True,header=True) # its a series mate not a data frame
Tada you get two additional csv files wordlist.csv [ which is all the words straight and simple ] and then the file nonDuplicatewordlist_withcount.csv which contains a non-duplicate version of the words along with the count of each word .
I could think of a lot of ways for this to be put to use but I will discuss them for later .
If you find this useful , please take time to like this post and if you really dont mind then you can share the credit with me :)
Thanks.
One of the most common tasks while working with a lot of text files / comments is the need to get the list of words . I use Python a lot in recent times and thought of doing this in Python . I went with a native method of reading the CSV files and then splitting the sentence once the text file is split into sentences and then to words . But it dint go well with the spaces .
So I thought of fixing this with a better and efficient way. And tada I stumbled upon the NLTK . I assume you know NLTK . But for the sakes its Natural Language Tool Kit from Python. It had an excellent method to build corpus . Also as I'm from a Business Intelligence background , I thought why not build a corpus from all the text files and the result is anyway so pristine that it can be fed straight into a database .
I will quickly summarize the steps .
1. Import the necessary libraries :
import os # for traversing the directories
import pandas as pd # for it had nice export functions and pivoting data
import sys # error handling
import traceback # error handling
import numpy as np # arithmetic and counting operations
from nltk.corpus.reader.plaintext import PlaintextCorpusReader # the rockstar library that does stuff
2. Set working directory : Its always and I repeat always a good habit to explicitly setup the working directory as a part of the script . I always do it.
corpusdir = "<< your path>>
3. Change to your working directory
os.chdir(corpusdir)
4. Lets get to the business
customcorpus = PlaintextCorpusReader(corpusdir, '.*') # create a corpus by looking up at all the text files within the folder
duplicatelist=[]
uniquelist=[]
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith(".txt"):
filetoopen=os.path.join(file)
if os.path.getsize(file) > 0: # ensure that the file size is greater than 0
wordlist = customcorpus.words(filetoopen) # mention the name of the file here to get the words in the file
wordlist = list(wordlist) # convert it into a list
try:
duplicatelist=duplicatelist+wordlist
uniquelist=list(set(uniquelist + wordlist))
printstuff = ("amended " + '%s') %(filetoopen)
print(printstuff)
except AssertionError:
_, _, tb = sys.exc_info()
traceback.print_tb(tb) # Fixed format
tb_info = traceback.extract_tb(tb)
filename, line, func, text = tb_info[-1]
print('An error occurred on line {} in statement {}'.format(line, text))
exit(1)
df = pd.DataFrame(duplicatelist,columns=['word'])
df.to_csv("wordlist.csv",index=0)
df['counter']=1
dfp = pd.pivot_table(df,values='counter',index='word',aggfunc=np.sum)
non_duplicate_filename = "nonDuplicatewordlist_withcount.csv"
dfp.to_csv(non_duplicate_filename,index=True,header=True) # its a series mate not a data frame
Tada you get two additional csv files wordlist.csv [ which is all the words straight and simple ] and then the file nonDuplicatewordlist_withcount.csv which contains a non-duplicate version of the words along with the count of each word .
I could think of a lot of ways for this to be put to use but I will discuss them for later .
If you find this useful , please take time to like this post and if you really dont mind then you can share the credit with me :)
Thanks.