We hear more and more about natural language processing. In today’s world, the amount of text content produced is growing exponentially. In order to process a text with a machine learning model, it is important to find certain information and numerical features about the text.
What Are N-Grams?
N-Grams are one of the tools to process this content by machine. You can use N-grams for automatic additions, text recognition, text mining and much more. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”, size 3 is a “trigram”, and so on.
Definition: N-grams are a sequence of words (or sentences, or characters…) that are often used together in a given text. This is very useful for automatic completion or spelling correction for example, but of course, the result depends heavily on the size and content of the sample text.
This can be achieved in several ways in Python.
- First, we see a given text in a variable, which we need to break down into words, and then use pure Python to find the N-grams.
- In the second example, we use Python’s NLTK package (Natural Language Toolkit) to parse an imported CSV file.
- The third example is similar, but here we use the
TextBlob
module.
Solution 1: Regex + Lists.
Here is an example with pure Python and regex:
import re import collections def generate_ngrams(text, n): # Generate list of all N-Grams: ngrams = [] # Store N-Gram distribution (N-Gram to frequency mapping) outcome = {} # Split sentences into tokens tokens=re.split("\\s+",text) # Collect the N-Grams for i in range(len(tokens)-n+1): temp = [tokens[j] for j in range(i,i+n)] ngrams.append(" ".join(temp)) # Frequency of n-grams with built in functions for k in ngrams: partial_outcome = ngrams.count(k) outcome[k] = partial_outcome # Sort results by frequency (descending) outcome = sorted(outcome.items(), key=lambda x: x[1], reverse=True) # frequency of n-grams with collections module result = collections.Counter(ngrams) print(outcome) print(100 * '-') print(result) # The text we want to examine text = '''Coding is like learning a new language. You improve by listening to and expressing yourself in the new language. A core language skill is to understand words quickly. Likewise, a core programming skill is to understand code quickly. Finxter teaches you rapid code understanding. It teaches you to see beyond the code. When we are done with you, the meaning of a code snippet will unfold like words from your mother’s tongue. At this point, consider yourself a code master. Becoming a code master is what we at Finxter want for you. Furthermore, we want you to achieve this with minimal effort only by committing to a simple process. We know: If you are like most of our users, you can not commit full-time to learning to code. Fortunately, this is not needed. A small habit is much better than a great event. Learn 5 minutes every day and not a single whole weekend. True learning is a process, not an event. To help you reach your goal on the side, we have created the Finxter loop, a daily 5-minute habit. Solve a puzzle a day and enjoy the release of endorphins into your brain when solving it. Commit to this single habit and your Python skills will expand rapidly. Solve a code puzzle now!''' # the sequence of n items n = 2 generate_ngrams(text, n)
First, we import the necessary modules. Regex is for text splitting, collections is for frequency counting (optional).
Then we define a method, called „generate_ngrams
”, and pass the „text
” and „n
” variables to it. „text
” is the text we want to analyze, „n
” is the size of the desired N-grams we want to generate. (i.e., bigrams, threegrams, etc.).
The text is then split into words using a regex command re.split("\\s+",text)
. You can find a regex cheat sheet here.
We iterate through the given text and append the list named „ngrams
” each „n
” sequence of words.
We build a dictionary, where keys are the ngrams
, and values are the number of occurrences.
Sort the dictionary.
❗ Please note: older version of Python before 3.7 has unordered dictionaries!
You can do this with the built-in „sorted” function and lambda, or collection’s „Counter” method in a row.
You can find the output of this code snippet at the end of this article (it’s long)!
Solution 2: NLTK Module
There is a more compact solution—use Python’s nltk
module.
As you can read on nltk.org:
„NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.”
N-grams can be used not only for text but also for numeric data. Let’s see an example:
Here we import a CSV file, which contains winning numbers from a popular gambling game named „Keno” from 1996 to 2021.
In this game, you can mark up to 10 numbers out of 80. The goal is to hit as many of the 20 winning numbers drawn each day as possible. It is possible to pick only three numbers, for example, and the prize is 15 times the stake. So, guess which trigram was the most common! :)?
from nltk import ngrams import collections import csv with open('/Finxter/NGrams/keno.csv') as csvfile: readCSV = csv.reader(csvfile, delimiter=';') numbers_1 = [] for row in readCSV: for i in range(4, 24): numbers_1.append(row[i]) listToStr = ' '.join([element for element in numbers_1]) n = 3 unigrams = ngrams(listToStr.split(), n) triramFreq = collections.Counter(unigrams) print(triramFreq.most_common(10))
Solution 3: TextBlob Module
TextBlob is a Python library for processing text. It provides a wide range of natural language processing tools, like translation, part-of-speech tagging, spelling correction, sentiment analysis, tokenization and more.
Let’s see, how we can make N-Grams with it!
from textblob import TextBlob sentence = '''"All right," said Deep Thought. "The Answer to the Great Question..." "Yes..!" "Of Life, the Universe and Everything..." said Deep Thought. "Yes...!" "Is..." said Deep Thought, and paused. "Yes...!" "Is..." "Yes...!!!...?" "Forty-two," said Deep Thought, with infinite majesty and calm. ''' ngram_object = TextBlob(sentence) ngrams = ngram_object.ngrams(n=2) print(ngrams)
Import the textblob
library, then create a string variable to analyze.
Create a TextBlob
object. („ngram_object
”). Call the function ngrams()
, and specify its argument such as n = 2 for bigrams, and n =3 trigrams.
Please note, that this method returns a list-like collection of words object: „class textblob.blob.WordList
”
You can see that Python makes it very easy to create n-grams, so it is easier and faster to process the text by machine. Whether you use pure Python or the existing NLP libraries (NLTK or TextBlob), just a few lines of code will do the work.
Output Solution 1 Code Snippet
<!-- wp:enlighter/codeblock {"language":"raw"} --> <pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">[('a code', 4), ('new language.', 2), ('skill is', 2), ('is to', 2), ('to understand', 2), ('teaches you', 2), ('you to', 2), ('day and', 2), ('Solve a', 2), ('Coding is', 1), ('is like', 1), ('like learning', 1), ('learning a', 1), ('a new', 1), ('language. You', 1), ('You improve', 1), ('improve by', 1), ('by listening', 1), ('listening to', 1), ('to and', 1), ('and expressing', 1), ('expressing yourself', 1), ('yourself in', 1), ('in the', 1), ('the new', 1), ('language. A', 1), ('A core', 1), ('core language', 1), ('language skill', 1), ('understand words', 1), ('words quickly.', 1), ('quickly. Likewise,', 1), ('Likewise, a', 1), ('a core', 1), ('core programming', 1), ('programming skill', 1), ('understand code', 1), ('code quickly.', 1), ('quickly. Finxter', 1), ('Finxter teaches', 1), ('you rapid', 1), ('rapid code', 1), ('code understanding.', 1), ('understanding. It', 1), ('It teaches', 1), ('to see', 1), ('see beyond', 1), ('beyond the', 1), ('the code.', 1), ('code. When', 1), ('When we', 1), ('we are', 1), ('are done', 1), ('done with', 1), ('with you,', 1), ('you, the', 1), ('the meaning', 1), ('meaning of', 1), ('of a', 1), ('code snippet', 1), ('snippet will', 1), ('will unfold', 1), ('unfold like', 1), ('like words', 1), ('words from', 1), ('from your', 1), ('your mother’s', 1), ('mother’s tongue.', 1), ('tongue. At', 1), ('At this', 1), ('this point,', 1), ('point, consider', 1), ('consider yourself', 1), ('yourself a', 1), ('code master.', 1), ('master. Becoming', 1), ('Becoming a', 1), ('code master', 1), ('master is', 1), ('is what', 1), ('what we', 1), ('we at', 1), ('at Finxter', 1), ('Finxter want', 1), ('want for', 1), ('for you.', 1), ('you. Furthermore,', 1), ('Furthermore, we', 1), ('we want', 1), ('want you', 1), ('to achieve', 1), ('achieve this', 1), ('this with', 1), ('with minimal', 1), ('minimal effort', 1), ('effort only', 1), ('only by', 1), ('by committing', 1), ('committing to', 1), ('to a', 1), ('a simple', 1), ('simple process.', 1), ('process. We', 1), ('We know:', 1), ('know: If', 1), ('If you', 1), ('you are', 1), ('are like', 1), ('like most', 1), ('most of', 1), ('of our', 1), ('our users,', 1), ('users, you', 1), ('you can', 1), ('can not', 1), ('not commit', 1), ('commit full-time', 1), ('full-time to', 1), ('to learning', 1), ('learning to', 1), ('to code.', 1), ('code. Fortunately,', 1), ('Fortunately, this', 1), ('this is', 1), ('is not', 1), ('not needed.', 1), ('needed. A', 1), ('A small', 1), ('small habit', 1), ('habit is', 1), ('is much', 1), ('much better', 1), ('better than', 1), ('than a', 1), ('a great', 1), ('great event.', 1), ('event. Learn', 1), ('Learn 5', 1), ('5 minutes', 1), ('minutes every', 1), ('every day', 1), ('and not', 1), ('not a', 1), ('a single', 1), ('single whole', 1), ('whole weekend.', 1), ('weekend. True', 1), ('True learning', 1), ('learning is', 1), ('is a', 1), ('a process,', 1), ('process, not', 1), ('not an', 1), ('an event.', 1), ('event. To', 1), ('To help', 1), ('help you', 1), ('you reach', 1), ('reach your', 1), ('your goal', 1), ('goal on', 1), ('on the', 1), ('the side,', 1), ('side, we', 1), ('we have', 1), ('have created', 1), ('created the', 1), ('the Finxter', 1), ('Finxter loop,', 1), ('loop, a', 1), ('a daily', 1), ('daily 5-minute', 1), ('5-minute habit.', 1), ('habit. Solve', 1), ('a puzzle', 1), ('puzzle a', 1), ('a day', 1), ('and enjoy', 1), ('enjoy the', 1), ('the release', 1), ('release of', 1), ('of endorphins', 1), ('endorphins into', 1), ('into your', 1), ('your brain', 1), ('brain when', 1), ('when solving', 1), ('solving it.', 1), ('it. Commit', 1), ('Commit to', 1), ('to this', 1), ('this single', 1), ('single habit', 1), ('habit and', 1), ('and your', 1), ('your Python', 1), ('Python skills', 1), ('skills will', 1), ('will expand', 1), ('expand rapidly.', 1), ('rapidly. Solve', 1), ('code puzzle', 1), ('puzzle now!', 1)] ---------------------------------------------------------------------------------------------------- Counter({'a code': 4, 'new language.': 2, 'skill is': 2, 'is to': 2, 'to understand': 2, 'teaches you': 2, 'you to': 2, 'day and': 2, 'Solve a': 2, 'Coding is': 1, 'is like': 1, 'like learning': 1, 'learning a': 1, 'a new': 1, 'language. You': 1, 'You improve': 1, 'improve by': 1, 'by listening': 1, 'listening to': 1, 'to and': 1, 'and expressing': 1, 'expressing yourself': 1, 'yourself in': 1, 'in the': 1, 'the new': 1, 'language. A': 1, 'A core': 1, 'core language': 1, 'language skill': 1, 'understand words': 1, 'words quickly.': 1, 'quickly. Likewise,': 1, 'Likewise, a': 1, 'a core': 1, 'core programming': 1, 'programming skill': 1, 'understand code': 1, 'code quickly.': 1, 'quickly. Finxter': 1, 'Finxter teaches': 1, 'you rapid': 1, 'rapid code': 1, 'code understanding.': 1, 'understanding. It': 1, 'It teaches': 1, 'to see': 1, 'see beyond': 1, 'beyond the': 1, 'the code.': 1, 'code. When': 1, 'When we': 1, 'we are': 1, 'are done': 1, 'done with': 1, 'with you,': 1, 'you, the': 1, 'the meaning': 1, 'meaning of': 1, 'of a': 1, 'code snippet': 1, 'snippet will': 1, 'will unfold': 1, 'unfold like': 1, 'like words': 1, 'words from': 1, 'from your': 1, 'your mother’s': 1, 'mother’s tongue.': 1, 'tongue. At': 1, 'At this': 1, 'this point,': 1, 'point, consider': 1, 'consider yourself': 1, 'yourself a': 1, 'code master.': 1, 'master. Becoming': 1, 'Becoming a': 1, 'code master': 1, 'master is': 1, 'is what': 1, 'what we': 1, 'we at': 1, 'at Finxter': 1, 'Finxter want': 1, 'want for': 1, 'for you.': 1, 'you. Furthermore,': 1, 'Furthermore, we': 1, 'we want': 1, 'want you': 1, 'to achieve': 1, 'achieve this': 1, 'this with': 1, 'with minimal': 1, 'minimal effort': 1, 'effort only': 1, 'only by': 1, 'by committing': 1, 'committing to': 1, 'to a': 1, 'a simple': 1, 'simple process.': 1, 'process. We': 1, 'We know:': 1, 'know: If': 1, 'If you': 1, 'you are': 1, 'are like': 1, 'like most': 1, 'most of': 1, 'of our': 1, 'our users,': 1, 'users, you': 1, 'you can': 1, 'can not': 1, 'not commit': 1, 'commit full-time': 1, 'full-time to': 1, 'to learning': 1, 'learning to': 1, 'to code.': 1, 'code. Fortunately,': 1, 'Fortunately, this': 1, 'this is': 1, 'is not': 1, 'not needed.': 1, 'needed. A': 1, 'A small': 1, 'small habit': 1, 'habit is': 1, 'is much': 1, 'much better': 1, 'better than': 1, 'than a': 1, 'a great': 1, 'great event.': 1, 'event. Learn': 1, 'Learn 5': 1, '5 minutes': 1, 'minutes every': 1, 'every day': 1, 'and not': 1, 'not a': 1, 'a single': 1, 'single whole': 1, 'whole weekend.': 1, 'weekend. True': 1, 'True learning': 1, 'learning is': 1, 'is a': 1, 'a process,': 1, 'process, not': 1, 'not an': 1, 'an event.': 1, 'event. To': 1, 'To help': 1, 'help you': 1, 'you reach': 1, 'reach your': 1, 'your goal': 1, 'goal on': 1, 'on the': 1, 'the side,': 1, 'side, we': 1, 'we have': 1, 'have created': 1, 'created the': 1, 'the Finxter': 1, 'Finxter loop,': 1, 'loop, a': 1, 'a daily': 1, 'daily 5-minute': 1, '5-minute habit.': 1, 'habit. Solve': 1, 'a puzzle': 1, 'puzzle a': 1, 'a day': 1, 'and enjoy': 1, 'enjoy the': 1, 'the release': 1, 'release of': 1, 'of endorphins': 1, 'endorphins into': 1, 'into your': 1, 'your brain': 1, 'brain when': 1, 'when solving': 1, 'solving it.': 1, 'it. Commit': 1, 'Commit to': 1, 'to this': 1, 'this single': 1, 'single habit': 1, 'habit and': 1, 'and your': 1, 'your Python': 1, 'Python skills': 1, 'skills will': 1, 'will expand': 1, 'expand rapidly.': 1, 'rapidly. Solve': 1, 'code puzzle': 1, 'puzzle now!': 1})</pre> <!-- /wp:enlighter/codeblock -->