5 Best Ways to Use TensorFlow Text to Split UTF-8 Strings in Python

💡 Problem Formulation: Working with text data often involves parsing and tokenizing strings, which can be especially challenging with UTF-8 encoded strings due to the variety of character sets. This article discusses how TensorFlow Text, a powerful text processing library, can be leveraged in Python to efficiently split UTF-8 strings into tokens or substrings. Imagine having a string “¡Hola! ¿Cómo estás?” and wanting to split it into tokens [“¡Hola!”, “¿Cómo”, “estás?”].

Method 1: Using the UnicodeScriptTokenizer

The UnicodeScriptTokenizer in TensorFlow Text can split UTF-8 strings into tokens based on Unicode script boundaries. It’s particularly useful when dealing with multilingual text. This function respects the script of each Unicode character and segments accordingly, which is ideal for accurately tokenizing strings with mixed languages.

Here’s an example:

import tensorflow_text as tf_text

tokenizer = tf_text.UnicodeScriptTokenizer()
example_text = tf.constant("TensorFlow es genial! こんにちは世界")
tokens = tokenizer.tokenize(example_text)
print(tokens.to_list())

Output:

[[b'TensorFlow', b'es', b'genial', b'!', b'こんにちは世界']]

This example demonstrates how the UnicodeScriptTokenizer method splits a UTF-8 string that contains both Spanish and Japanese. The tokenizer smartly recognizes the boundaries between scripts and tokens while keeping multilingual characters intact.

Method 2: Using the WhitespaceTokenizer

The WhitespaceTokenizer is a simple and efficient tokenizer that splits UTF-8 strings at whitespace characters. This method is particularly suitable for languages where whitespace is used to separate tokens.

Here’s an example:

import tensorflow_text as tf_text

tokenizer = tf_text.WhitespaceTokenizer()
example_text = tf.constant("Splitting strings can be easy.")
tokens = tokenizer.tokenize(example_text)
print(tokens.to_list())

Output:

[[b'Splitting', b'strings', b'can', b'be', b'easy.']]

This snippet uses the WhitespaceTokenizer method to split an English sentence into its constituent words. The whitespace tokenizer is straightforward and fast, making it an excellent choice for text containing clear word boundaries denoted by spaces.

Method 3: Using the WordpieceTokenizer

WordpieceTokenizer is a more advanced tokenizer that splits UTF-8 strings into subword units, which can be helpful for various neural network models. This method improves the model’s ability to deal with rare or unknown words by breaking them down into known subwords.

Here’s an example:

import tensorflow_text as tf_text

vocab = ["Tensor", "##Flow", "es", "genial", "!"]
tokenizer = tf_text.WordpieceTokenizer(vocab=vocab)
example_text = tf.constant("TensorFlow es genial!")
tokens = tokenizer.tokenize(example_text)
print(tokens.to_list())

Output:

[[b'Tensor', b'##Flow', b'es', b'genial', b'!']]

This snippet demonstrates the WordpieceTokenizer’s ability to break down a string into known subwords based on a predefined vocabulary. It’s especially useful for tokenizing input for models that require a fixed vocabulary, such as BERT.

Method 4: Using the RegexSplitTokenizer

The RegexSplitTokenizer allows for complex tokenization rules by using regular expressions to split UTF-8 strings. This method offers flexibility for intricate tokenization scenarios that require a custom split pattern.

Here’s an example:

import tensorflow_text as tf_text

pattern = r'\w+|\S'
tokenizer = tf_text.RegexSplitTokenizer(pattern)
example_text = tf.constant("TensorFlow, BERT, and more!")
tokens = tokenizer.tokenize(example_text)
print(tokens.to_list())

Output:

[[b'TensorFlow', b',', b'BERT', b',', b'and', b'more', b'!']]

In this code, the RegexSplitTokenizer method splits a string into words, punctuation, and special tokens, by using a regular expression pattern. It provides granular control over the tokenization process, which is useful for specialized text processing tasks.

Bonus One-Liner Method 5: Using StringSplit with a delimiter

Though not strictly a tokenizer, TensorFlow’s StringSplit function can come in handy for quickly splitting strings using a specified delimiter. It’s a convenient one-liner for simple use cases.

Here’s an example:

import tensorflow as tf

example_text = tf.constant("Quick,split;this")
delimiter = tf.constant("[,;]")
tokens = tf.strings.split(example_text, sep=delimiter)
print(tokens.to_list())

Output:

[[b'Quick', b'split', b'this']]

Using TensorFlow’s StringSplit function, you can split a string on multiple delimiters, specified in the delimiter string. This method is useful when you have a known set of delimiter characters and you want a quick split without the overhead of a tokenizer.

Summary/Discussion

Method 1: UnicodeScriptTokenizer. Best used for preserving linguistic integrity across different languages. May not be suitable for languages without script boundaries.
Method 2: WhitespaceTokenizer. Simple and fast. Does not handle languages that do not use whitespace for word segmentation.
Method 3: WordpieceTokenizer. Great for handling out-of-vocabulary words by breaking them into known subwords. Requires a predefined vocabulary.
Method 4: RegexSplitTokenizer. Highly customizable with regular expressions. Can be complex to set up for intricate tokenization rules.
Bonus Method 5: StringSplit Function. Quick and easy with specified delimiters. Not ideal for complex tokenization needs.