Serialize DataFrame to Markdown, Strata, HDF, LaTex, XML

This article focuses on the serialization and conversion methods of a Python DataFrame:

to_markdown(),
to_stata(),
to_hdf(),
to_latex(),
to_xml().

Let’s get started!

Preparation

Before any data manipulation can occur, four (4) new libraries will require installation.

The Pandas library enables access to/from a DataFrame.
The Tabulate library enables formatted output.
The Tables library allows formatted output (table format).
The lxml library enables writing to an XML file.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install tabulate

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install tables

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install lxml

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guide for the required libraries.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import tabulate
import tables
import lxml

DataFrame.to_markdown()

The to_markdown() method provides a simple character-based border surrounding the cells/rows of a table when output to the terminal.

The syntax for this method is as follows:

DataFrame.to_markdown(buf=None, mode='wt', index=True, storage_options=None, **kwargs)

Parameter	Description
`buf`	This parameter is the buffer to write. If empty, a string returns.
`mode`	This depicts the mode the file opens in. The default is ‘`wt`‘.
`index`	If `True`, add the index (row) labels to the output.
`storage_options`	This parameter contains extra options (dictionary format), such as host, port, username, etc.

Rivers Clothing had a 5-day sale on Winter Scarfs and Sweaters. The Sales Manager would like a Sales Report in this regard.

Run this code to view the formatted output.

df = pd.DataFrame({'2022-01-27': [33, 41],
                   '2022-01-28': [52, 43],
                   '2022-01-29': [48, 98],
                   '2022-01-30': [23, 23],
                   '2022-01-31': [49, 43]},
                   index=['Scarfs', 'Sweaters'])

df.loc['Total',:] = df.sum(axis=0)
print(df.to_markdown())

Line [1] creates a DataFrame from a dictionary of lists. The output saves to df.
Line [2] creates a Total row and calculates the daily sales. The output saves to df.
Line [3] outputs the formatted DataFrame to the terminal.

Output

Note that if you want to convert the Markdown table back to a CSV file, follow the instructions in this tutorial on the Finxter blog.

DataFrame.to_stata()

The to_stata() method converts a DataFrame object to a database-like format (.dat) file.

The syntax for this method is as follows:

DataFrame.to_stata(path, convert_dates=None, write_index=True, byteorder=None, time_stamp=None, data_label=None, variable_labels=None, version=114, convert_strl=None, compression='infer', storage_options=None, *, value_labels=None)

Parameter	Description
`path`	This parameter is the string path to write. If empty, a string returns.
`convert_dates`	The date conversion method. The options are: `'tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'`. By default, `'tc'` is set.
`write_index`	If `True`, write the index to the Stata dataset.
`byteorder`	This parameter can be: `'<', '>', 'little'`, or `'big'`. The default is `sys.byteorder`.
`time_stamp`	This parameter is the datetime to use as the date created. Default is the current time.
`data_label`	This is the label for the dataset. The maximum length is 80 characters.
`variable_labels`	This is a dictionary with columns as keys and labels as values. The maximum length is 80 characters.
`version`	This is the version to use in the output (`.dta`) file.
`convert_strl`	This parameter is a list containing column names to convert to Stata `StrL` format.
`compression`	If infer is selected, the options are: `'.gz', '.bz2', '.zip', '.xz',` or `'.zst'` extensions.
`storage_options`	This parameter contains extra options (dictionary format), such as host, port, username, etc.
`value_labels`	A dictionary with columns as keys and dictionaries of column values.

This example reads in the first five (5) rows of the Periodic Table CSV file to a Stata dataset. Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('PubChemElements_all.csv',
                 usecols=['AtomicNumber', 'Symbol', 'Name', 'YearDiscovered']).head()
print(df)
df.to_stata('elements.dta')

Line [1] does the following:
- reads in the first five (5) rows (head) of the CSV file
- selects the columns to display
- saves the output to the DataFrame df
Line [2] outputs the DataFrame to the terminal.
Line [3] outputs the DataFrame to a Stata dataset file.

Output

	Atomic Number	Symbol	Name	Year Discovered
0	1	H	Hydrogen	1766
1	2	He	Helium	1868
2	3	Li	Lithium	1817
3	4	Be	Beryllium	1798
4	5	B	Boron	1808

💡 Note: If you navigate to the current working directory, the elements.dta file resides in the file list.

DataFrame.to_hdf()

The to_hdf() method writes data to a Hierarchical Data Format (HDF) file. This format can hold a mixture of objects accessed individually or by a group.

The syntax for this method is as follows:

DataFrame.to_hdf(path_or_buf, key, mode='a', complevel=None, complib=None, append=False, format=None, index=True, min_itemsize=None, nan_rep=None, dropna=None, data_columns=None, errors='strict', encoding='UTF-8')

Parameter	Description
`path_or_buf`	This parameter is the file `path`/`HDFStore` to write. If empty, a string returns.
`key`	This depicts the identifier for the group in the `HDFStore`.
`mode`	The mode to use to open a file. The options are: `'a', 'w', 'r+'`. The default mode is `'a'` (append).
`complevel`	This parameter sets the compression level (0-9). Zero disables compression.
`complib`	Specifies the compression method to use: `'zlib', 'lzo', 'bzip2', 'blosc'`. The default compression is `'zlib'`.
`append`	If `True` and format is `'table'`, it appends the input data to the existing table.
`format`	The available format options are: – `'fixed'`: A fixed format that does not allow appends/searches. – `'table'`: Writes to a table. This option has appends/searches. – `None`: Falls to fixed or `pd.get_option('io.hdf.default_format')`.
`errors`	Depict how errors are determined. The default value is `'strict'`.
`min_itemsize`	A dictionary containing column names to min. string sizes.
`nan_rep`	Depicts how to represent NULL values as a string. This option is not permitted if the append parameter is `True`.
`data_columns`	This parameter is a column list for indexed data. This option is available if the format is `'table'`.
`encoding`	Depicts the encoding. The default value is `'UTF-8'`.

This example creates a DataFrame with the Host City details for the previous five (5) Summer and Winter Olympic Games.

df = pd.DataFrame(({2010: ['Vancouver', 'Canada', 'North America'],
                    2012: ['London', 'United Kingdon', 'Europe'],
                    2014: ['Sochi', 'Russia', 'Europe',],
                    2016: ['Rio de Janeiro', 'Brazil', 'South America'],
                    2018: ['Pyeongchang', 'South Korea', 'Asia']}))

df.to_hdf('olympics.h5', key='Games', mode='w', format='table')
print(pd.read_hdf('olympics.h5', 'Games'))

Line [1] creates a DataFrame from a dictionary of lists. The output saves to df.
Line [2] does the following:
- creates an h5 file
- sets the key to Games
- sets the file mode to w (write mode)
- sets the output to a table format
- saves the output to olympics.h5
Line [3] reads in and displays the contents of the olympics.h5 file.

Output

	2010	2012	2014	2016	2018
0	Vancouver	London	Sochi	Rio de Janeiro	Pyeongchang
1	Canada	United Kingdon	Russia	Brazil	South Korea
2	North America	Europe	Europe	South America	Asia

💡 Note: If you navigate to the current working directory, the olympics.h5 file resides in the file list.

DataFrame.to_latex()

The to_latex() method converts a DataFrame into a formatted LaTeX document. This document can then save as a separate file.

The syntax for this method is as follows:

DataFrame.to_latex(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, bold_rows=False, column_format=None, longtable=None, escape=None, encoding=None, decimal='.', multicolumn=None, multicolumn_format=None, multirow=None, caption=None, label=None, position=None)

Parameter	Description
`buf`	This parameter is the file path/buffer to write. If empty, a string returns.
`columns`	This parameter is the sub-set of columns to write. If empty, all columns write.
`col_space`	This depicts the length of each column.
`header`	This parameter writes out the column names.
`index`	This parameter writes out the row (index) names.
`na_rep`	This parameter represents the string value for missing data.
`formatters`	This parameter is a formatter function to apply to elements by position/name.
`float_format`	This parameter is a formatter for floating-point numbers.
`sparsify`	If True and MultiIndex, display the key for each row.
`index_names`	This parameter displays the index names.
`bold_rows`	This parameter displays the row names in bold.
`column_format`	This parameter is the column format as outlined in the LaTeX table format
`longtable`	The value of this parameter is read from the `pandas config` module. If `True`, use a `longtable` format instead of tabular.
`escape`	The value of this parameter is read from the pandas `config` module. If `False`, prevent escaping LaTeX special characters in column names.
`encoding`	A string representation of encoding. By default, UTF-8 is used.
`decimal`	This parameter is the decimal separator, comma (`,`) in Europe.
`multicolumn`	If `True`, use multi-column to enhance MultiIndex columns.
`multicolumn_format`	This parameter is the alignment for multi-columns.
`multirow`	If `True`, use multi-row to enhance MultiIndex rows.
`caption`	This parameter is a tuple containing the caption.
`label`	This parameter is the LaTeX label inside `\label{}` in the output.
`position`	This parameter is the table position. This option is placed after `\begin{}` in the output.

The owner of Finxters has decided to add some additional details for their users: their favorite color in RGB format. At present, this is just a test.

Run the code to create a LaTeX document (.tex) file containing sample users.

cols = ('Name', 'Username', 'Fav Color'); 
df = pd.DataFrame ((('Pete',  '1998_pete2', (139, 0, 139)),
                    ('Leon',  'Gar_man',    (143, 188, 143)),
                    ('Isla',  'Isla2021',   (173, 216, 230)),
                    ('Blake', 'kirbster',   (147, 112, 219))), 
                    columns=cols)

print(df)
df.to_latex('finxters.tex', index=False, caption='User Details')

Line [1] creates column names and saves them to the cols list.
Line [2] creates a DataFrame with user data and sets the columns to the cols variable created above.
Line [3] outputs the DataFrame to the terminal.
Line [4] saves the DataFrame to a LaTeX file (finxters.tex).

Output

df

	Name	Username	Fav Color
0	Pete	1998_pete2	(139, 0, 139)
1	Leon	Gar_man	(143, 188, 143)
2	Isla	Isla2021	(173, 216, 230)
3	Blake	kirbster	(147, 112, 219)

finxters.tex file

💡 Note: You can also use an online converter to convert a tex file to a pdf.

DataFrame.to_xml()

The to_xml() method converts a DataFrame object into a valid XML format.

The syntax for this method is as follows:

DataFrame.to_xml(path_or_buffer=None, index=True, root_name='data', row_name='row', na_rep=None, attr_cols=None, elem_cols=None, namespaces=None, prefix=None, encoding='utf-8', xml_declaration=True, pretty_print=True, parser='lxml', stylesheet=None, compression='infer', storage_options=None)

Parameter	Description
`path_or_buffer`	This parameter is the file/string to write. If empty, a string returns.
`index`	If `True`, includes the index in the XML document.
`root_name`	This parameter is the root name of the XML document.
`row_name`	This parameter is the name of row elements in the XML document.
`na_rep`	This is a string representation of any missing data.
`attr_cols`	This is a column list to write as row element attributes.
`elem_cols`	This is a column list to write as child-row elements.
`namespaces`	This parameter is the namespaces defined in the root element.
`prefix`	This is a prefix for the namespace for each element/attribute.
`encoding`	This is the encoding of the XML document. The default is UTF-8.
`xml_declaration`	If `True`, include the XML declaration at the top of the document.
`pretty_print`	If `True`, the XML outputs with indentation and line breaks.
`parser`	This is the parser module for the building of a tree. The `lxml` and `etree` are supported.
`stylesheet`	A URL, file, or string containing an XSLT script for formatting the XML output.
`compression`	If infer is selected, the options are: `'.gz', '.bz2', '.zip', '.xz'`, or `'.zst'` extensions.
`storage_options`	This parameter contains extra options (dictionary format), such as host, port, username, etc.

This example reads in the countries.csv file and saves the same to an XML file. Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('countries.csv')
df.to_xml('countries.xml', row_name='country', pretty_print=True)

Line [1] reads in the comma-delimited CSV file and saves it to df.
Line [2] creates an XML file with the following options:
- adds <country></country> tags around each country (row)
- prints to the XML file with the appropriate indents and line break.

Output (partial)

💡 Note: Click here to validate your XML.

Further Learning Resources

This is Part 21 of the DataFrame method series.

Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
Part 21 focuses on the serialization and conversion methods from_dict(), to_dict(), from_records(), to_records(), to_json(), and to_pickles().
Part 22 focuses on the serialization and conversion methods to_clipboard(), to_html(), to_sql(), to_csv(), and to_excel().
Part 23 focuses on the serialization and conversion methods to_markdown(), to_stata(), to_hdf(), to_latex(), to_xml().
Part 24 focuses on the serialization and conversion methods to_parquet(), to_feather(), to_string(), Styler.
Part 25 focuses on the serialization and conversion methods to_bgq() and to_coo().