Problem Formulation
Given a string s
. Create a new string based on s
with all control characters such as '\n'
and '\t'
removed.
What is a Control Character?
A control character, also called non-printing character (NPC), is a character that doesn’t represent a written symbol. Examples are the newline character '\n'
and the tabular character '\t'
. The inverse set of control characters are the printable characters.
In Unicode, control characters have the code pattern U+000 - 0U+001F
, U+007F
, and U+0080 - U+009F
.
Solution Based on Unicode Category
The unicodedata
module provides a function unicodedata.category(c)
that returns the general category assigned to the character c
as a string. The Unicode categories 'Cc'
, 'Cf'
, 'Cs'
, 'Co'
, and 'Cn'
could be seen as “control characters”, although you could argue that only 'Cc'
is a control character. In any case, you can customize our solution below based on your preferences.
Depending on your preferences, you’d obtain the Python one-liner ''.join(c for c in s if unicodedata.category(c)[0] != 'C')
removes all control characters in the original string s
.
Here’s the final code that removes all control characters from a string:
import unicodedata def remove_control_characters(s): return ''.join(c for c in s if unicodedata.category(c)[0] != 'C') s = 'hello\nworld\tFinxters!' print(s) s = remove_control_characters(s) print(s)
- The
join()
function combines all characters in an iterable using the separator string on which it is called. In our case, we combine them on the empty string''
. - The generator expression
c for c in s if unicodedata.category(c)[0] != 'C'
goes over all characters that are not in a category starting with the uppercase'C'
.
Alternatively, you can write it using a simple for loop like this:
import unicodedata def remove_control_characters(s): s_new = '' for c in s: if unicodedata.category(c)[0] != 'C': s_new = s_new + c return s_new s = 'hello\nworld\tFinxters!' print(s) s = remove_control_characters(s) print(s)
The output of both variants is:
# First print() statement before removal of control chars hello world Finxters! # Second print() statement after removal of control chars helloworldFinxters!
You can see that the second output doesn’t contain any control characters.