Summary: This blog explains the various ways one can extract commonly used Emojis embedded within text.
Note: All the solutions provided below have been verified using Python 3.9.0b5.
Problem Formulation
One has a list with normal text words and emojis, all mixed together, as shown below.
orig_list = ['These ? Emojis ? are ? embedded ? within ? this ? text.']
How does one extract only the emojis, into a new list, as follows?
new_list = ['?', '?', '?', '?', '?', '?']
Background
The Zen Master once said “A picture is worth a thousand words”. This nugget of wisdom has held true since the cave-men roamed the earth. It continues to be true in our uber-tech world today.
In 1998 a Japanese man named Shigetaka Kurita invented emojis for use by a telecom company. Since this invention, human pictorial expressions (Emojis) in written language have assumed a life of their own.
Emojis are now used in all sorts of electronic communication. Such an explosion in use has also created the need for a way to separate emojis from text.
This blog article explores different ways, using Python, to separate emojis from text.
Method 1: The Obvious Eliminate The Text Method
The most straight-forward way to separate the emojis from the text is to cut the text out of the string. Use the regex library to do this as shown below.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## Import the required Packages. >>> import regex as re >>> >>> ## The original list. The Emojis are extracted from this list. >>> orig_list = ['These ? Emojis, ? are ? embedded ? within ? this ? text.'] >>> >>> ## Use a regular expression set to find and extract all emojis in orig_list >>> new_list = re.findall(r'[^\w\s,.]', orig_list[0]) >>> >>> new_list ['?', '?', '?', '?', '?', '?'] >>>
The re.findall()
method uses the regular expression set, i.e. ‘[]’
along with the exclude-character ‘^’
, to remove text characters. The text is represented by character classes such as ‘\w’
, ‘\s’
and special characters such as ‘.’
or ‘,’
. This is also sometimes referred to as the inline search method
Note: Although the above method is straightforward, it has its limitations. One has to specify each test character that might appear in the original text. i.e. if the text contains “=”
and “+”
for example, one needs to specify r’[^\w\s,.=+]’
as the regular expression set. The regular expression as coded above, is called the inline search method. Use it for when a quick short search is needed. Otherwise use the compile method which is shown in the next section.
Note: This method uses the regex library. Install this library using pip, before running the example.
Method 2: The Selectively Pick-Out-The-Emojis Method
This next method shows how to search and target emojis within the string. Once found, the emojis get appended to the final list.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## Import the required Packages. >>> import emoji >>> import regex as re >>> >>> ## The original list. The Emojis are extracted from this list. >>> orig_list = ['These ? Emojis, ? are ? embedded ? within ? this ? text.'] >>> >>> ## Create an iterable object. >>> emojis_iter = map(lambda y: y, emoji.UNICODE_EMOJI['en'].keys()) >>> >>> ## Use the iterable object to compile a regular expression set of all emojis within >>> ## the ‘en’ dictionary subset. >>> regex_set = re.compile('|'.join(re.escape(em) for em in emojis_iter)) >>> >>> ## Use the compiled regular expression set to find and extract all emojis in >>> ## orig_list >>> new_list = regex_set.findall(orig_list[0]) >>> new_list ['?', '?', '?', '?', '?', '?']
This method is different from the above method, in that it targets its search of emojis in a given string.
To do this, first extract the entire 'en'
subset of the emojis dictionary into an iterable (i.e. emojis_iter
).
Next, use the iterable to compile a regular expression set. This set now contains all the emojis extracted from the emojis dictionary. Finally, use this regular expression set, to extract all the emojis from the given string. The Python-Community has put forth various reasons for compiling a regular expression set. Some of these reasons are
- It makes the code concise and easier to read.
- Code uses the search expression more than a few times.
- Compile the search expression at a less compute intense part of the application. i.e. compile it in the beginning, use it later.
Note: This method uses the emoji
and the regex
libraries. Install these libraries using pip, before running the example.
Method 3: The All-Out Hit It With Everything Method
This method uses the Python advertools
package. This package provides productivity and analysis tools. Online marketers and data-scientists use advertools
to understand, manipulate and manage data. They also use advertools
to visualize, communicate and make decisions, based on data. A reader would use advertools
, if they wish to do some heavy-duty research on emojis. But first things first. The steps below use the advertools
package to extract emojis from orig_list
.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## Import the required Packages. >>> import advertools as adv >>> >>> ## The original list. The Emojis are extracted from this list. >>> orig_list = ['These ? Emojis, ? are ? embedded ? within ? this ? text.'] >>> >>> ## Use advertools to process orig_list. A dictionary is returned. >>> emoji_dict = adv.extract_emoji(orig_list) >>> >>> ## This dictionary is packed with all sorts of statistical information about orig_list. >>> ## We want specific information, which is enclosed in the ‘emoji’ key. >>> ## Also the returned value is a list of lists. We need only the first list. >>> new_list = emoji_dict['emoji'][0] >>> new_list ['?', '?', '?', '?', '?', '?']
The extract_emoji()
method simplifies extraction of all the emojis. It is able to do a lot of steps under the hood, to separate emojis from within text. It returns a dictionary with a lot of useful statistical information. The specific dictionary key 'emoji'
, contains a list-of-lists. The first list within this list-of-lists contains the answer. In the simple example above, emoji_dict['emoji'][0]
extracts this first list and hence the answer.
Note: This method uses the advertools package. The advertools
package uses the regex
and the emoji
libraries within it. Install this package using pip, before running the example.
Method 4+: But Wait, There Is More!!!
Remember the dictionary returned by extract_emoji()
? This dictionary contains useful information about emojis in the original string. The curious reader might be eager to explore all this useful information. The initial steps are the same as shown above, i.e.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## Import the required Packages. >>> import advertools as adv >>> >>> ## The original list. The Emojis are extracted from this list. >>> orig_list = ['These ? Emojis, ? are ? embedded ? within ? this ? text.', 'Another ? Emoji ? list'] >>> >>> ## Use advertools to process orig_list. A dictionary is returned. >>> emoji_dict = adv.extract_emoji(orig_list) >>> >>> ## This dictionary is packed with all sorts of statistical information about orig_list. >>> ## This information is explored below...
Note that orig_list
has an extra string to better explain the features below.
First get the various keys in the dictionary.
>>> emoji_dict.keys() dict_keys(['emoji', 'emoji_text', 'emoji_flat', 'emoji_flat_text', 'emoji_counts', 'emoji_freq', 'top_emoji', 'top_emoji_text', 'top_emoji_groups', 'top_emoji_sub_groups', 'overview']) >>>
The reader is already familiar with the 'emoji'
key from the example, above. Note the returned list-of-lists. Each sub-list contains the emojis from the corresponding original sub-string.
>>> emoji_dict['emoji'] [['?', '?', '?', '?', '?', '?'], ['?', '?']] >>>
The ‘emoji_text’
key, returns the english names (or descriptions) of the individual emojis. Once again, note the returned list-of-lists.
>>> emoji_dict['emoji_text'] [['girl', 'smiling face with horns', 'robot', 'grinning cat', 'smiling cat with heart-eyes', 'weary cat'], ['grinning face', 'yellow heart']] >>>
The ‘emoji_flat’
key, returns a flat list of all emojis in the entire original list of strings. Note that the returned value is a flat list and not a list-of-lists as in previous examples.
>>> emoji_dict['emoji_flat'] ['?', '?', '?', '?', '?', '?', '?', '?'] >>>
The ‘emoji_flat_text’
key, returns a flat list of all emoji descriptions for the entire original list of strings. Note again, that the returned value is a flat list.
>>> emoji_dict['emoji_flat_text'] ['girl', 'smiling face with horns', 'robot', 'grinning cat', 'smiling cat with heart-eyes', 'weary cat', 'grinning face', 'yellow heart'] >>>
The ‘emoji_counts’
key, returns a list of emoji counts for each sub-string in the original list of strings.
>>> emoji_dict['emoji_counts'] [6, 2] >>>
The ‘emoji_freq’
key, groups strings which have the same count of emojis.
>>> emoji_dict['emoji_freq'] [(2, 1), (6, 1)] >>>
The ‘top_emoji’
key, groups each unique emoji by its count.
>>> emoji_dict['top_emoji'] [('?', 1), ('?', 1), ('?', 1), ('?', 1), ('?', 1), ('?', 1), ('?', 1), ('?', 1)] >>>
The ‘top_emoji_text’
key groups each unique emoji-text-description by its count.
>>> emoji_dict['top_emoji_text'] ([('girl', 1), ('smiling face with horns', 1), ('robot', 1), ('grinning cat', 1), ('smiling cat with heart-eyes', 1), ('weary cat', 1), ('grinning face', 1), ('yellow heart', 1)],) >>>
The ‘top_emoji_groups’
key, counts the number of emojis belonging to different emoji-groups.
>>> emoji_dict['top_emoji_groups'] [('Smileys & Emotion', 7), ('People & Body', 1)] >>>
The ‘top_emoji_sub_groups’
key, counts the number of emojis belonging to different emoji-sub-groups.
>>> emoji_dict['top_emoji_sub_groups'] [('cat-face', 3), ('person', 1), ('face-negative', 1), ('face-costume', 1), ('face-smiling', 1), ('emotion', 1)] >>>
Finally, the ‘overview’
key, gives an overview of the orig_list
from the point of view of emojis.
>>> emoji_dict['overview'] {'num_posts': 2, 'num_emoji': 8, 'emoji_per_post': 4.0, 'unique_emoji': 8} >>>
Finxter Academy
This blog was brought to you by Girish, a student of Finxter Academy. You can find his Upwork profile here.
Reference
All research for this blog article was done using Python Documents, the Google Search Engine and the shared knowledge-base of the Finxter Academy and the Stack Overflow Communities. Concepts and ideas were also researched from the following websites…