Splitting Strings by Character in Python with TensorFlow Text and Unicode

💡 Problem Formulation: In scenarios where data needs to be tokenized, such as text preprocessing for natural language processing tasks, it’s often necessary to split strings at the character level. For instance, turning the string "hello" into ["h", "e", "l", "l", "o"]. TensorFlow Text provides a Unicode-aware method to accomplish this, which we’ll explore using various techniques.

Method 1: Using TensorFlow Text Unicode Split

TensorFlow Text offers a function called unicode_split to split strings into substrings of UTF-8 characters. This method handles strings with multi-byte characters correctly, ensuring that each token represents a complete character.

Here’s an example:

import tensorflow as tf
import tensorflow_text as text

input_string = tf.constant('こんにちは')
character_tokens = text.unicode_split(input_string, 'UTF-8')

Output: <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'\xe3\x81\x93', b'\xe3\x82\x93', b'\xe3\x81\xab', b'\xe3\x81\xa1', b'\xe3\x81\xaf'], dtype=object)>

This snippet creates a TensorFlow constant with the string ‘こんにちは’ (hello in Japanese) and applies unicode_split to obtain the constituent characters. The output is a TensorFlow Tensor with the byte representations of each Unicode character.

Method 2: Unicode Split with Tensor Slicing

If you wish to manipulate or analyze individual characters after splitting, you can slice the resulting tensor to obtain separate tensors for each character in a string.

Here’s an example:

import tensorflow as tf
import tensorflow_text as text

input_string = tf.constant('foo𝔄bar')
unicode_chars = text.unicode_split(input_string, 'UTF-8')
for char in unicode_chars:
    print(char.numpy())

Output:

b'f'
b'o'
b'o'
b'\xf0\x9d\x94\x84'
b'b'
b'a'
b'r'

The code demonstrates slicing the tensor resulting from unicode_split to print each Unicode character. The special character ‘𝔄’ is correctly represented as a four-byte UTF-8 sequence.

Method 3: Batch Splitting with Unicode Split

TensorFlow Text’s unicode_split function can also be applied to batched strings, making it easy to process multiple strings at once in a TensorFlow Dataset or a batch of string Tensors.

Here’s an example:

import tensorflow as tf
import tensorflow_text as text

batch_strings = tf.constant(['hello', 'world✌️'])
batch_characters = text.unicode_split(batch_strings, 'UTF-8')
print(batch_characters.to_list())

Output: [[b'h', b'e', b'l', b'l', b'o'], [b'w', b'o', b'r', b'l', b'd', b'\xe2\x9c\x8c', b'\xef\xb8\x8f']]

This code splits multiple strings simultaneously. It demonstrates the function’s utility in situations where many pieces of text must be tokenized, such as in batch operations during machine learning model training.

Method 4: Splitting and Trimming Unwanted Characters

Sometimes, you might want to split the string and also remove unwanted characters like whitespace. TensorFlow Text allows you to perform both operations using unicode_split in conjunction with TensorFlow string utility functions.

Here’s an example:

import tensorflow as tf
import tensorflow_text as text

input_string = tf.constant(' hello ')
trimmed_string = tf.strings.strip(input_string)
characters = text.unicode_split(trimmed_string, 'UTF-8')
print(characters.numpy())

Output: [b'h', b'e', b'l', b'l', b'o']

Before splitting, this code snippet trims the input string to remove leading and trailing whitespace using tf.strings.strip. Then it proceeds with the split operation, resulting in a tensor of characters excluding whitespaces.

Bonus One-Liner Method 5: Unicode Split with Ragged Tensors

A concise method to split strings and handle variable-length characters involves the use of Ragged Tensors. This one-liner can be particularly efficient when dealing with multiple strings of varying lengths.

Here’s an example:

import tensorflow as tf
import tensorflow_text as text

result = text.unicode_split(['hello', 'world✌️'], 'UTF-8').to_list()
print(result)

Output: [[b'h', b'e', b'l', b'l', b'o'], [b'w', b'o', b'r', b'l', b'd', b'\xe2\x9c\x8c', b'\xef\xb8\x8f']]

By applying unicode_split directly to a list and calling to_list on the result, we can efficiently split multiple strings and convert the results into Python lists, dealing with varying string lengths gracefully.

Summary/Discussion

Method 1: TensorFlow Text Unicode Split. Simple and effective for splitting Unicode strings. May not be the most efficient for large datasets or when only a subset of a string is needed.
Method 2: Unicode Split with Tensor Slicing. Offers character-level manipulation and analysis post-split. It can get cumbersome with very large strings or batches.
Method 3: Batch Splitting with Unicode Split. Ideal for processing large datasets. Ensures efficient operations within a machine learning pipeline, especially for batched tensor inputs.
Method 4: Splitting and Trimming Unwanted Characters. Adds pre-processing steps to clean data before splitting. Useful for text data that requires trimming or cleaning.
Method 5: Unicode Split with Ragged Tensors. Offers a concise and scalable approach, especially when dealing with variable-length strings. The simplicity can be a double-edged sword if additional processing steps are needed.