β¨Summary: There are four different ways to split a text into sentences:
π Using nltk
module
π Using re.split()
π Using re.findall()
π Using replace
Minimal Example
text = "God is Great! I won a lottery." # Method 1 from nltk.tokenize import sent_tokenize print(sent_tokenize(text)) # Method 2 import re res = [x for x in re.split("[//.|//!|//?]", text) if x!=""] print(res) # Method 3 res = re.findall(r"[^.!?]+", text) print(res) # Method 4 def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!'] print(splitter(text, sep)) # Output: ['God is Great', ' I won a lottery']
Problem Formulation
Problem: Given a string/text containing numerous sentences; How will you split the string into sentences?
Example: Letβs visualize the problem with the help of an example.
# Input text = "This is sentence 1. This is sentence 2! This is sentence 3?" # output ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']
Method 1: Using nltk.tokenize
Natural Language Processing (NLP) has a process known as tokenization using which a large quantity of text can be divided into smaller parts called tokens. The Natural Language toolkit contains a very important module known as NLTK tokenize sentence which further comprises sub-modules. We can use this module and split a given text into sentences.
Code:
from nltk.tokenize import sent_tokenize text = "This is sentence 1. This is sentence 2! This is sentence 3?" print(sent_tokenize(text)) # ['This is sentence 1.', ' This is sentence 2!', ' This is sentence 3?']
Explanation:
- Import the
sent_tokenize
module. - Further, the
sentence_tokenizer
module allows you to parse the given sentences and break them into individual sentences at the occurrence of punctuations like periods, exclamation,Β question marks, etc.
Caution: You might get an error after installing the nltk
package. So, hereβs the entire process to install nltk
in your system.
Install nltk using β pip install nltk
Then go ahead and type the following in your Python shell:
import nltk nltk.download('punkt')
Thatβs it! You are now ready to use the sentence_tokenizer
module in your code.
Method 2: Using re.split
The re.split(pattern, string)
method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, re.split('a', 'bbabbbab')
results in the list of strings ['bb', 'bbb', 'b']
.
Approach: Split the given string using alphanumeric separators, and use the either-or (|)
metacharacter. It allows you to specify each separator within the expression like so: re.split("[//.|//!|//?]", text)
. Thus, whenever the script encounters any of the mentioned characters specified within the pattern, it will split the given string.Β The expression x!=""
ignores all the empty characters.
Code:
import re text = "This is sentence 1. This is sentence 2! This is sentence 3?" res = [x for x in re.split("[//.|//!|//?]", text) if x!=""] print(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']
π§©Recommended Read:Β Python Regex Split
Method 3: Using findall
The re.findall(pattern, string)
method scans the string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.
Code:
import re text = "This is sentence 1. This is sentence 2! This is sentence 3?" res = re.findall(r"[^.!?]+", text) print(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']
Explanation: In the expression, i.e., re.findall(r"[^.!?]+", text)
, all occurrences of characters are grouped except the punctuation marks. []+
denotes that all occurrences of one or more characters except (given by ^
) β!
β, β?
β, and β.
β will be returned. Thus, whenever the script finds and groups all characters until any of the mentioned characters within the square brackets are found. As soon as one of the mentioned characters is found it splits the string and finds the next group of characters.
π§©Related Read: Python re.findall() β Everything You Need to Know
Method 4: Using replace
Approach: The idea here is to replace all the punctuation marks (β!β, β?β,
and β.β
) present in the given string with a comma (,
) and then split the modified string to get the list of split substrings. The problem here is the last element returned will be an empty string. You can use the pop()
method to remove the last element out of the list of substrings (the empty string).
Code:
def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!', '?'] text = "This is sentence 1. This is sentence 2! This is sentence 3?" print(splitter(text, sep)) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']
π§©Related Read: Python String replace()
Conclusion
We have successfully solved the given problem using different approaches. I hope this article helped you in your Python coding journey. Please subscribe and stay tuned for more interesting articles.
Happy coding! π
Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.