Text Comparing Program

112
April 07, 2022, at 03:20 AM

I'm making a program where I am supposed to compare text files by returning a list of all the words that come up in the file, and the number of times they come up. I have to disregard a list of words called stopwords so they won't be checked for the number of times they come up. For the first part I need to check if the word is in the stopwords, if it is, i don't count that word, if it isn't in stopwords then I make a brand new row for that word in a dataframe, assuming it doesn't already exist in the data frame, and increment the appearance frequency by 1. Each text file will have a column. I am a little stuck on this part however. I have bits of the code already but I need to fill in the blanks. Here is what I have so far:

from tkinter.filedialog import askdirectory
import glob
import os 
import pandas as pd

def main():
    df = pd.DataFrame(columns =["TEXT FILE NAMES HERE..."])
    data_directory = askdirectory(initialdir = "/School_Files/CISC_121/Assignments/Assignment3/Data_Files")
    stopwords = open(os.getcwd() + "/" + "StopWords.txt") 

    text_files = glob.glob(data_directory + "/" + "*.txt")

    for f in text_files:
        infile = open(f, "r", encoding = "UTF-8")
        #now read the file and do all the word-counting etc...
        lines = infile.readlines()
        for line in lines:
            x = 0
            words = line.split()
            while (x < len(words)):
                """
                Check if the word is in the stopwords
                If it isn't, then add the word into a row in a dataframe, for the first occurence, then
                increment the value by 1
                Have a column for each book 
                """
                for line in infile:
                    if word in line:
                        found = True
                        word +=1 
                    else:
                        found = False
                x = x+1
main()

If anyone can help me finish this section I'd really appreciate it. Please show the change in code. Thanks in advance!

Answer 1

I see that you just want to count the occurrence of the words. For this you could use a dictionary instead of a Dataframe.

And for stopwords, read it to a list.

Try the below code.

stopwords = []
count_dictionary {}
with open(os.getcwd() + "/" + "StopWords.txt") as f:
    stopwords = f.read().splitlines()
#your code
while (x < len(words)):
    if word not in stopwords:
        if word in count_dictionary :
            count_dictionary[word] += 1
        else:
            count_dictionary[word] = 1
           
Rent Charter Buses Company
READ ALSO
Why won&#39;t my Django template display the correct date listed in the database?

Why won't my Django template display the correct date listed in the database?

My model contains a datefield, 'valid_until'Even though an entry contains the date '12/25/2022', the update template literally displays as its contents 'mm/dd/yyyy' when the user tries to modify the date

75
Best way to import custom packages [duplicate]

Best way to import custom packages [duplicate]

I use python for data analysisA lot of pandas

121
Sound notification in python when eberytime IF codition satisfies

Sound notification in python when eberytime IF codition satisfies

I want to play a sound every time an IF condition satisfies, I tried playsound module but it plays once & ask for unique alias device id but as per my IF condition, I should be get different answer every time whereas sound only plays once

116
Plot a horizontal line using matplotlib

Plot a horizontal line using matplotlib

I have used spline interpolation to smooth a time series and would also like to add a horizontal line to the plotBut there seems to be an issue that is out of my grips

116