Problem Formulation
π‘ Problem Formulation: The goal is to determine how many times a word appears throughout the text.
Given:
- A text file (
example.txt
) containing a body of text. - A specific word to search for within this text (e.g.,
"Python"
).
Goal:
- Write a Python program that reads the content of
example.txt
. - Counts and returns the number of times the specified word (
"Python"
) appears in the text. - The word comparison should be case-insensitive, meaning
"Python"
,"python"
, and"PYTHON"
would all be counted as occurrences of the same word. - Words should be considered as sequences of characters separated by whitespace or punctuation marks. For instance,
"Python,"
(with a comma) and"Python"
(without a comma) should be treated as the same word.
Example: Consider the text file example.txt
with the following content:
πΎ example.txt
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.
If the word to search for is "Python"
, the program should output a count of 5, as the word "Python"
(in various cases) appears five times in the text.
Method 1: Using the split() Function
The simplest way to count a specific word in a text file is by reading the file’s content into a string, converting it to lowercase (to make the search case-insensitive), and then using the split()
function to break the string into words. After that, you can use the count()
method to find the occurrences of the specified word.
def count_word_in_file(file_path, word): with open(file_path, 'r') as file: text = file.read().lower() words = text.split() return words.count(word.lower()) print(count_word_in_file('example.txt', 'Python'))
This code opens the file example.txt
in read mode, reads its content, and converts it into lowercase. Then, it splits the content into a list of words and counts how many times the specified word appears in the list.
Method 2: Using Regular Expressions
For more control over what constitutes a word (e.g., ignoring punctuation), you can use the re
module. This approach allows you to define a word more accurately by using regular expressions.
import re def count_word_in_file_regex(file_path, word): with open(file_path, 'r') as file: text = file.read().lower() word_pattern = fr'\b{re.escape(word.lower())}\b' return len(re.findall(word_pattern, text)) print(count_word_in_file_regex('example.txt', 'Python'))
Here, the re.findall()
function searches for all non-overlapping occurrences of the specified word, considering word boundaries (\b
), making it more accurate for word matching. re.escape()
is used to escape the word, making sure it’s treated as a literal string in the regular expression.
Method 3: Using the collections.Counter Class
The collections
module provides a Counter
class that can be extremely useful for counting word frequencies in a text. This method involves reading the text, splitting it into words, and then passing the list of words to Counter
to get a dictionary-like object where words are keys and their counts are values.
from collections import Counter import re def count_word_in_file_counter(file_path, word): with open(file_path, 'r') as file: text = file.read().lower() words = re.findall(r'\b\w+\b', text) word_counts = Counter(words) return word_counts[word.lower()] print(count_word_in_file_counter('example.txt', 'Python'))
This method uses regular expressions to split the text into words in a way that excludes punctuation. Then, it uses Counter
to count occurrences of each word. Finally, it returns the count of the specified word.
Method 4: Using a Loop and Dictionary
If you want to avoid importing any additional modules, you can manually count occurrences of each word using a loop and a dictionary. This method provides a good understanding of how word counting works under the hood.
def count_word_in_file_dict(file_path, word): word_counts = {} with open(file_path, 'r') as file: for line in file: for word in line.lower().split(): word_counts[word] = word_counts.get(word, 0) + 1 return word_counts.get(word.lower(), 0) print(count_word_in_file_dict('example.txt', 'Python'))
This code reads the file line by line, splits each line into words, and uses a dictionary to keep track of word counts. The get()
method is used to update counts, providing a default of 0 if the word isn’t already in the dictionary.
Method 5: Using the pandas Library
For those who are working with data analysis, the pandas
library can be a powerful tool for text processing. This method involves reading the entire file into a pandas DataFrame and then using pandas methods to count the word occurrences.
import pandas as pd def count_word_in_file_pandas(file_path, word): df = pd.read_csv(file_path, sep='\t', header=None) all_words = pd.Series(df[0].str.cat(sep=' ').lower().split()) return all_words[all_words == word.lower()].count() print(count_word_in_file_pandas('example.txt', 'Python'))
This code reads the text file as if it were a CSV file with a single column, concatenates all lines into a single string, splits this string into words, and then counts the occurrences of the specified word using pandas Series
methods.
Bonus One-Liner Method 6: Using Path and List Comprehension
For a succinct approach, you can combine the Path
object from the pathlib
module with list comprehension. This one-liner is efficient and Pythonic.
from pathlib import Path def count_word_in_file_oneliner(file_path, word): return Path(file_path).read_text().lower().split().count(word.lower()) print(count_word_in_file_oneliner('example.txt', 'Python'))
This method reads the file content as a string, lowers its case, splits it into words, and counts the occurrences of the specified word, all in one line.