Python | Split Text into Sentences

✨Summary: There are four different ways to split a text into sentences:
πŸš€ Using nltk module
πŸš€ Using re.split()
πŸš€ Using re.findall()
πŸš€ Using replace

Minimal Example

text = "God is Great! I won a lottery."

# Method 1
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))

# Method 2
import re
res = [x for x in re.split("[//.|//!|//?]", text) if x!=""]
print(res)

# Method 3
res = re.findall(r"[^.!?]+", text)
print(res)

# Method 4
def splitter(txt, delim):
   for i in txt:
       if i in delim:
           txt = txt.replace(i, ',')
   res = txt.split(',')
   res.pop()
   return res

sep = ['.', '!']
print(splitter(text, sep))

# Output: ['God is Great', ' I won a lottery']

Problem Formulation

Problem: Given a string/text containing numerous sentences; How will you split the string into sentences?

Example: Let’s visualize the problem with the help of an example.

# Input
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
# output
['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

Method 1: Using nltk.tokenize

Natural Language Processing (NLP) has a process known as tokenization using which a large quantity of text can be divided into smaller parts called tokens. The Natural Language toolkit contains a very important module known as NLTK tokenize sentence which further comprises sub-modules. We can use this module and split a given text into sentences.

Code:

from nltk.tokenize import sent_tokenize
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
print(sent_tokenize(text))

# ['This is sentence 1.', ' This is sentence 2!', ' This is sentence 3?']

Explanation: 

  • Import the sent_tokenize module.
  • Further, the sentence_tokenizer module allows you to parse the given sentences and break them into individual sentences at the occurrence of punctuations like periods, exclamation,Β  question marks, etc.

Caution: You might get an error after installing the nltk package. So, here’s the entire process to install nltk in your system.

Install nltk using β†’ pip install nltk

Then go ahead and type the following in your Python shell:

import nltk
nltk.download('punkt')

That’s it! You are now ready to use the sentence_tokenizer module in your code.

Method 2: Using re.split

The re.split(pattern, string) method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, re.split('a', 'bbabbbab') results in the list of strings ['bb', 'bbb', 'b'].


Approach: Split the given string using alphanumeric separators, and use the either-or (|) metacharacter. It allows you to specify each separator within the expression like so: re.split("[//.|//!|//?]", text). Thus, whenever the script encounters any of the mentioned characters specified within the pattern, it will split the given string.Β The expression x!="" ignores all the empty characters.

Code:

import re
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
res = [x for x in re.split("[//.|//!|//?]", text) if x!=""]
print(res)

# ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

🧩Recommended Read:  Python Regex Split

Method 3: Using findall

The re.findall(pattern, string) method scans the string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.

Code:

import re
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
res = re.findall(r"[^.!?]+", text)
print(res)

# ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

Explanation: In the expression, i.e., re.findall(r"[^.!?]+", text), all occurrences of characters are grouped except the punctuation marks. []+ denotes that all occurrences of one or more characters except (given by ^) β€˜!’, β€˜?’, and β€˜.’ will be returned. Thus, whenever the script finds and groups all characters until any of the mentioned characters within the square brackets are found. As soon as one of the mentioned characters is found it splits the string and finds the next group of characters.

🧩Related Read: Python re.findall() – Everything You Need to Know

Method 4: Using replace

Approach: The idea here is to replace all the punctuation marks (β€˜!’, β€˜?’, and β€˜.’) present in the given string with a comma (,) and then split the modified string to get the list of split substrings. The problem here is the last element returned will be an empty string. You can use the pop() method to remove the last element out of the list of substrings (the empty string).

Code:

def splitter(txt, delim):
   for i in txt:
       if i in delim:
           txt = txt.replace(i, ',')
   res = txt.split(',')
   res.pop()
   return res

sep = ['.', '!', '?']
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
print(splitter(text, sep))

# ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']

🧩Related Read: Python String replace()

Conclusion

We have successfully solved the given problem using different approaches. I hope this article helped you in your Python coding journey. Please subscribe and stay tuned for more interesting articles.

Happy coding! 🐍


Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.