Python Base64 – String Encoding and Decoding [+Video]

Rate this post

A Short Guide to Base64’s History and Purpose

Base64 is a system of binary-to-text transcoding schemas, which enable the bidirectional transformation of various binary and non-binary content to plain text and back.

Compared to binary content, storage and transfer of textual content over the network is significantly simplified and opens many possibilities for flexible data exchange and processing between different, heterogeneous information systems.

One of the key benefits of Base64 is the possibility of data transfer over e-mail attachments.

Why is this important?

💡 Well, as e-mail is one of the oldest and most used communication technologies, dating back to 1971, it is a very common way of conveying information between a sender and any number of receivers. The information is most often delivered as a readable text message, but it can also carry an attachment in a binary form.

E-mail servers and clients support transfers of attached binary content by converting it to plain text before sending and back to the original binary content after receiving.

Of course, there is some overhead in terms of time needed for conversion of the content, but it lends to the flexibility of unrestricted content exchange.

Some other uses of Base64 are the storage of binary content in information systems, such as databases.

Base64 should not be mistaken for a cryptographic algorithm, because it does not encrypt the data. It just converts the data to plain text and the transcoding schema is very well known, so there is no secrecy involved.

Base64 Memory Addressing Schema

Before we move on, I’d like to share a couple of thoughts on memory addressing schema.

When a computer stores and handles the data in its memory, namely the Random Access Memory (RAM), it has to use an addressing scheme.

In a common computer architecture model, the memory locations are accessed by their addresses.

Although it is not the only approach, we will take a look at how the byte addressing scheme works because it is the best approximation for our explanation of the Base64 mechanism.

When a memory location is addressed, a specific amount of data, i.e. 1 byte (equals to eight bits) is read from it. This one byte is used as the smallest addressable amount of data available.

If in some imaginary case we would have to store less than one byte of data, let’s say, only three bits, such as 101, that data would get padded with zeroes and would look like 0000 0101 (don’t mind the space, it is here just to make the example more readable).

In a described case, we say that the addressing scheme used is byte-addressing.

In some other cases, where the addressing scheme would use more than one byte (but always the multiple), like four bytes (32 bits = 1 word), each data shorter than 4 bytes would get padded to the full length of 4 bytes.

Accordingly, if the data we are handing is longer than four bytes, it would get split between more than one memory location by the formula: ceil(data_len / mem_loc_len), where

  • ceil() is a mathematical function for up-rounding to the nearest higher integer,
  • date_len is the length of our data in bits (bytes) and
  • mem_loc_len is the length of our memory location in bits (bytes).

In other words, if our data is 33 bits long and our memory location is 32 bits long, our data would occupy two memory locations because ceil(33/32) = 2.

Base64 Transcoding Table

Base64 schema uses 64 characters (hence the name), which can be encoded with only six bits, since 2^6 = 64.

Before we go more specific about the transcoding mechanism, we should construct our transcoding table of symbols. The standard way of doing so is defined by RFC 4648, which says that the transcoding table

  • starts with capital letters of the English alphabet A-Z,
  • followed by small letters of the English alphabet a-z,
  • followed by digits 0-9, and
  • is concluded by symbols + and /.

The padding symbol, not included in the table is the equality sign =.

Base64 Transcoding Mechanism

The transcoding mechanism for binaries takes into account that our data exists in byte-sized segments, meaning that whatever data we take and split into segments of 8-bit size, every segment will be full by design (because of the memory addressing schema), without the need to do any padding.

From this point on, we will consider our data on the bit level.

The process is very simple: we take three byte-sized segments and consider them as a segment of 3 x 8 bits = 24 bits.

👇 This choice of length will be further discussed in our section on “The Least Common Multiple” below.

Our segment of 24 bits is further segmented into four 6-bit segments, which will represent our keys for the transcoding table. Each of the four keys gets transcoded to the associated symbol, i.e. one of the 64 available symbols in the transcoding table.

For instance, if our 6-bit segment is 000 111, it gets transcoded to symbol 'H', and a segment 111 000 gets transcoded to symbol '4'.

Base64 Special Considerations

When there is a case where our ending data segment length does not correspond to three bytes, but two bytes or a one-byte segment instead, we have to take into account special considerations, and this is where we introduce bit padding.

First, let us discuss how to process the two-byte segment.

A two-byte segment consists of 16 bits that get split into two whole 6-bit segments (2 x 6 bit = 12 bit) and a rest of 4 bits. These four bits are padded with two 0 bits that will complete the segment to a 6-bit length.

However, now that we have three 6-bit segments, and four segments were originally needed, the transcoding mechanism will take note of that and introduce a special, character-level padding symbol to account for the last segment.

This symbol does not exist in the transcoding table and is denoted as '='.

Here is an example of such a case: the original 2-byte data segment 0011 0101 0010 1010 gets split into 001 101 010 010 101 0+00 where bits 00 are used as segment-level padding to ensure us having three full segments.

These segments are transcoded to symbols 'NSo' and the symbol '=' is added to account for the missing 6-bit segment, resulting in 'NSo='.

Considering the given example, we can always tell that if there is only one symbol '=' in the end, two bits were added to the original data and will be removed when we transcode the Base64 string back to the original data.

In the second possible example, our original data is made up of only one byte. This byte will get split into one 6-bit segment and the remaining two bits are padded with four 0 bits that will complete the segment to a full 6-bit length.

Now that we have two 6-bit segments, and four segments were originally needed, the special symbol will be used two times to account for the missing segments: '=='.

Here is an example of such a case: the original 1-byte data segment 0011 0101 gets split into 001 101 01+0 000 where bits 0 000 are used as segment-level padding to ensure us having two full segments. These segments are transcoded to symbols 'NQ' and two padding symbols '==' are added to account for the two missing 6-bit segments, resulting in 'NQ=='.

Considering the given example, we can always tell that if there are two symbols '=' in the end, four bits were added to the original data and will be removed when we transcode the Base64 string back to the original data.

Base64 Encoder/Decoder Implementation in Python

A Python implementation of a Base64 encoder/decoder is presented below. Presumably less intuitive parts are commented on for better understanding and convenience.

class Base64(object):
    CHARS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
    # 1-byte padding symbol.
    padding_symbol = b'\x00'

    @classmethod
    def chunk(cls, data, length):
        # Creates an array of data chunks (segments).
        data_chunks = [data[i:i + length] for i in range(0, len(data), length)]
        return data_chunks

    # Encodes the string to a Base64 string.
    @classmethod
    def encode(cls, data):
        padding_length = 0

        # Calculates the length of the padding.
        if len(data) % 3 != 0:
            padding_length = (3 - len(data) % 3) % 3
        data += cls.padding_symbol * padding_length

        # Splits the data in three-byte chunks.
        chunks_3_byte = cls.chunk(data, 3)

        # Generates a binary string representation of each byte in the chunk,
        # i.e. three bytes -> three binary representations per chunk.
        bin_string_repr = ""
        for chunk in chunks_3_byte:
            for byte in chunk:
                # Cuts off the '0b' string prefix and appends
                # to the 'bin_string_repr'.
                # {:0>8} stands for leading zeroes, align right, 8 places.
                bin_string_repr += "{:0>8}".format(bin(byte)[2:])

        # Splits the data in six-bit chunks.
        chunks_6_bit = cls.chunk(bin_string_repr, 6)

        base64_encoded_str = ""
        for element in chunks_6_bit:
            # Transcodes the binary representation (2) string to an integer,
            # and maps it to an alphanumeric string.
            base64_encoded_str += cls.CHARS[int(element, 2)]

        # Encodes the ending with the padding character(s) '='.
        base64_encoded_str = base64_encoded_str[:-padding_length] + '=' * padding_length
        return base64_encoded_str

    @classmethod
    def decode(cls, data):
        # Counts the number of '=' occurrences; this way we'll know
        # how much of the padding we should trim (0, 1 or 2 chars).
        replaced = data.count('=')

        # Replaces '=' by 'A'; it will be trimmed at the end, but we
        # need it until then to retain 3-byte segments.
        data = data.replace('=', 'A')

        binstring = ''
        # Processes each character and returns its binary code.
        for char in data:
            # {:0>6b} stands for leading zeroes, align right, 6 places, binary.
            binstring += "{:0>6b}".format(cls.CHARS.index(char))

        # Splits the data in 1-byte (8-bit) chunks.
        chunks_1_byte = cls.chunk(binstring, 8)

        base64_decoded_str = b''
        for chunk in chunks_1_byte:
            # Creates the decoded byte-string.
            base64_decoded_str += bytes([int(chunk, 2)])

        return base64_decoded_str[:-replaced]


if __name__ == "__main__":
    b64_enc = Base64.encode(b'Finxter rules!')
    print(b64_enc)
    b64_dec = Base64.decode(b64_enc)
    print(b64_dec)

The Least Common Multiple

You might have asked yourself, why are we taking exactly 24 bits as a segment for transcoding to Base64?

There is a very simple reason behind this step, and it is called “the least common multiple”, also denoted shortly as “LSM”.

Least Common Multiple (LSM): Given any two numbers A and B, the least common multiple of A and B is the smallest number that is divisible by both A and B without remainders. For instance, the least common multiple of 2 and 3 is 6, because 6 / 2 = 3 with remainder = 0, and 6 / 3 = 2 with remainder = 0.

In the case of our specific interest, numbers A and B represent the length of segments in our addressing schema (8 bits) and the length of a binary represented character in the Base64 transcoding table (6 bits -> 64 characters).

By calculating the least common multiple for 6 and 8, we get 24.

If we perform a validity check: 24 / 6 = 4 with remainder = 0, 24 / 8 = 3 with remainder = 0, we can confirm that 24 indeed is a length that should be used for segment generation to support both 6-bit and 8-bit segments.

Conclusion

In this article, we learned about the Base64 transcoding mechanism. 

  • First, we explained the use of Base64 in a real-world context. 
  • Second, we touched upon the topic of memory addressing schema. 
  • Third, we got acquainted with the transcoding table. 
  • Fourth, we explained how the transcoding mechanism works.
  • Fifth, we dove into special considerations on how to handle “incomplete” data (incomplete in terms of not being a 24-bit multiple).
  • Sixth, we analyzed a Base64 implementation.
  • Seventh, we held our breath for some simple math theory on the least common multiple.