Exploring Data File Structures with Python

When writing programs and scripts, it is important that you follow standards in the way you structure your data. These standards are what allows one program to generate some data, another program to consume that data. It is also used when one program needs to save some data for use later on.

This structured data can be stored in many ways. It can be stored in memory while the program is actively running. It can be stored in a database, and it can be stored as a file on your filesystem.

Today we will discuss some of the most common data structures that you might encounter when writing your applications.

Binary files

Binary files may contain structure inside of them. But this structure is not easily discernible to someone who may want to inspect them. A good example of a binary file is a MS Word document from before Word 2007. Prior to Word 2007, Microsoft saved all office documents in a binary format. This meant you could only read and write these files using Microsoft Office programs or programming libraries.

This was OK if all you wanted to do was to use the program as designed. But if you wanted to build complementary products, you had to install office to get access to dll files, which contained the programming libraries you needed to read office files.

This also proved to be a problem should you experience some sort of file corruption. Since Office files were proprietary and binary, you didn’t have much of a chance of fixing file corruption if Word, Excel, etc… were unable to resolve the problem for you.

These issues were not unique to Microsoft office. This is an issue with any application specific file such as documents from early versions of WordPerfect, or a saved game from a video game.

It is not impossible for a third party to interpret this binary data. But it is also not incredibly easy. In the next few sections we will talk through some more standardized file formats

What are XML files?

XML stands for Extensible Markup Language. It is a structured way of organizing your data. XML relies heavily on human-readable tags for structuring your data. If you are familiar with how HTML works, XML tags work very similarly. You have an opening tag, some data, then a closing tag. Below is a very basic example of what XML looks like

<?xml version="1.0" encoding="UTF-8"?>
<text>
  <line1>Hello World!</line1>
  <line2>Hello World!</line2>
</text>

In the above example, you see we have a tag where we specify that this is an XML document. We then have a pair of text tags, then some line tags with values specified inside of them. By structuring our data like this, we can write programs that can easily read, interpret, and save data from/in this format. Every major programming language has libraries for reading and writing XML files.

If we refer back to the previous section, I mentioned that before Office 2007, Microsoft used Binary formats for all of their documents. If you try to open an older word document with a basic text editor, you will see a lot of unreadable garbage information.

Starting in Office 2007, Microsoft created a new office format based on XML. That is why your Word 2007 and newer documents are called mydocument.docx . That X is for XML.

Each office document is actually a collection of XML files that together form your document. If you are interested in reading about the specification in detail, you can find the specification here.

In case are interested in using XML as a way to store the data for your application, or you have an existing XML file you need to read, here is an example of how to read and write an XML file in Python.

########### Read XML ##############
#Install Dependency
#pip install xmltodict
import xml.etree.ElementTree as xml
import xmltodict

#Read XML file
tree = xml.parse('test.xml')
myxml = tree.getroot()

######### Write XML to file ############
xmlfile = open(r"test.xml","w+") 
xmlfile.write(str(myxml))

Another very common place that XML is used is with REST and SOAP APIs. SOAP stands for Simple Object Access Protocol. In short, SOAP API’s are simple programs that you communicate over SMTP or HTTP(s) and pass XML files back and forth. REST API’s work largely the same as SOAP API’s in this context. Although, it is more common for REST API’s to use JSON these days.

What is JSON?

JSON stands for Javascript Object Notation. It has had an official specification since 2001. Although it has existed since at least 1996 in unofficial forms. If you are interested, you can see a presentation by Douglas Crockford here about the beginnings of JSON. Douglas is the guy who came up with the specification.

Even though JSON has Javascript in the name. It really has nothing to do with Javascript other than it is really easy to use with Javascript. Most/all modern languages include the capability to read and write JSON due to how prevalent it is on the web. If you interact with any REST API’s, they most likely pass JSON back and forth.

Here is an example of what JSON might look like:

[{
"Name" : "Bob",
"City" : "San Diego",
"State" : "CA"
},
{
"Name" : "Sue",
"City" : "San Francisco",
"State" : "CA"
},
{
"Name" : "Ted",
"City" : "New York",
"State" : "NY"
},
{
"Name" : "Joe",
"City" : "Miami",
"State" : "FL"
}]

In this example, we have a simple array of JSON objects containing 3 attributes (keys): Name, City, State. and values that correspond to those attributes. This is one of the more simple examples. We can also nest the JSON objects together

{
"Bob":{"City":"San Diego","State":"CA"},
"Sue":{"City":"New York","State":"NY"},
"Joe":{"City":"Miami","State":"FL"}
}

As you can see in both examples. JSON is very simple and easy to read format. It consists of a pair of curly braces to show the beginning and the end of a given object, then a word wrapped in double quotes to notate the key/attribute, a colon to separate the key from the value. Then a value wrapped in double quotes again. We separate key/value pairs with a comma. In the example above, we have a JSON object with 2 key/value pairs (City and State) nested within another JSON object.

JSON and XML both have their respective advantages. But a couple of the advantages of JSON vs XML is it is less verbose, so it takes less storage space to store it, and less bandwidth to transmit it. This translates into faster web applications. And I think it is easier to read.

For examples of how to read and write JSON files, see our article on converting JSON files to/from CSV.

As a bonus, here is an example of how you would convert JSON to XML and convert XML to JSON

########### Convert JSON to XML #################
#First we need to use pip to install dependencies
#pip install dicttoxml

#Import our Libraries
from json import loads
from dicttoxml import dicttoxml
import xml.etree.ElementTree as xml

#Declare our JSON Object
myjson = '{"Bob":{"City":"San Diego","State":"CA"},
"Sue":{"City":"New York","State":"NY"},"Joe":{"City":"Miami","State":"FL"}}'

#Convert our JSON to XML
xmldoc = dicttoxml(loads(myjson))

#Write XML document to file
xmlfile = open(r"test.xml","w+") 
xmlfile.write(str(xmldoc))

########### Convert XML to JSON #################
#Install Dependency
#pip install xmltodict
import xml.etree.ElementTree as xml
import xmltodict

#Read XML file
tree = xml.parse('test.xml')
tree = tree.getroot()

#Convert XML to string
xml_str = xml.tostring(tree).decode()

#Convert XML string into JSON Objcet
json = xmltodict.parse(xml_str)

#Print JSON Object
print(json)

When the above code runs it prints out the JSON to the screen. There are a couple of things to note about this output. First, it is somewhat hard to read. We can make it a bit easier by using Pretty Print. Pretty Print is a Python module for printing out JSON and XML in a more readable form. To use pretty print, you would do the following

import pprint
pp = pprint.PrettyPrinter()
pp.pprint()
print(json)

Now that we have pretty print, the output will be much easier to read. Here is a comparison of the output before and after pretty print:

As you can see, Pretty print is much more readable. It is also more obvious what it looks like now that we have converted our XML into JSON. Compare this ouput to some of the other JSON examples we have recently discussed and you will notice some differences.

One major difference between XML and JSON is JSON is typically just text. XML has different data types embedded into it. When we converted from XML to JSON, we brought along the extra metadata describing the datatypes that were in the XML. That is why you see type: dict and type: str.

CSV

CSV stands for Comma Separated Values. in a CSV file you will generally have a row that acts as headings, then many rows of data. Each column heading is separated by a comma, and each value is separated by a comma.

CSV is very command when you are extracting data from databases or spreadsheets. But it is also used quite often with other programs that need to do data manipulation.

The nice things about CSV are that it is easy to construct both manually and programmatically, it is fairly readable if a human needs to open it up and inspect the file. And it is very commonly used. So most applications you encounter will be able to read/write CSV files. Here is an example of a CSV file

Name, City, State
bob,San Diego,CA
sue,New York,NY
joe,Miami,FL

In the above example, we used the same data as we did in the JSON example to make it easier to compare the two.

CSV’s are even less verbose than JSON, which makes them even faster for transmitting data. They are easy to read, and they match the general table structure of spreadsheets and relational databases. However, they are pretty rigid. JSON is a lot more flexible if you need each record to have slightly different attributes.

If you are using CSV files to store your data, and you encounter instances where your data contains commas, it can be difficult to work around. There are similar files which use the same data structure. Those could be pipe delimited files ‘|’, semicolon ‘;’ or possibly tab separated files. These files can typically all be ready with the same libraries as CSV files. You just need to specify a delimiter when you import the file.

We have an article here on converting files from JSON to CSV and converting CSV files to JSON. If you would like some code examples for interacting with these two data structures, it is a good resource.

YAML/YML

YAML is Pronounced Yah-Mall. I have seen it written as both YML and YAML. According to yaml.org, it is proper to use YAML. YAML used to mean “Yet Another Markup Language”. But it was later changed to “YAML Ain’t Markup Language”. This was changed to emphasis YAML’s role as a data structure and not simply a markup language like HTML.

YAML is very commonly used for configuration files. For example, Docker very heavily uses YAML files. YAML supports hierarchal relationships of the data. And is sensitive to spaces in the files. Here is an example YAML data structure

---
Bob:
  City: San Diego
  State: CA
Sue:
  City: New York
  State: NY
Joe:
  City: Miami
  State: FL

As you can see, in the example, we declare a name for a given object, then specify the attributes on the following lines. The structure looks very similar to JSON. But you eliminate the curly braces. In this aspect, it is similar to Python. Also like Python, you use spaces in the file to denote the structure. As an application is reading the YAML, it knows that anything with two spaces in front of it should be nested under the main object name.

In the code example below, you can see how to convert from JSON into YAML, then write the YAML file to the file system

<Insert code here>#Insdtall Dependencies
#pip install pyyaml

#Import Libraries
import yaml
import json

#Declare JSON Object
myjson = '{"Bob":{"City":"San Diego","State":"CA"},
"Sue":{"City":"New York","State":"NY"},"Joe":{"City":"Miami","State":"FL"}}'

#Convert JSON to YAML
myyaml = yaml.dump(yaml.load(json.dumps(json.loads(myjson))))
print(myyaml)

#Write file
yamlfile = open(r"test.yml","w+") 
yamlfile.write(str(myyaml))

#Close file
yamlfile.close()

In the next code example, you can see how to read a YAML file from the file system, then convert that YAML data into JSON

import yaml

yamlfile = open(r"test.yml","r") 

myjson = yaml.load(yamlfile)
myyaml = yaml.dump(myjson)

print(myyaml)
print(myjson)

yamlfile.close()

In the code example above, you can see how we read the file using the regular libraries for reading a text file in Python. We then use the yaml.load function to parse the text input. You might have found my choice of variable name “myjson” unusual. I chose this variable because when we first parse the file, the resulting data structure is JSON.

We then call the yaml.dump function and pass in our JSON. This converts our input from JSON back into YAML again.

Databases

Databases are programs whose job is to store, organize, and retrieve data. Databases usually have specialized languages you use for interacting with them such as T-SQL.

In each of the previous sections, I discussed common data structures. Given that you can store all your data in JSON or CSV, why do you need a database? The answer is scale.

If you look at our article on TinyDB, you will see there are very basic database options out there like TinyDB. TinyDB stores all the data you insert into a single JSON document stored on your file system.

Contrast this with MySQL or Oracle, or MS SQL, these database engines store their files in binary format on the file system. But they give you programmatic interfaces for inserting and retrieving your information.

Once you have defined the structure of your data within the database, it is up to the database engine to figure out how to store the files on disk and keep your data consistent. All of your interactions with the data will be through the database engine interface.

See below for example code of how to connect and query a mySQL database

#Install dependencies
#pip install mysql-connector

#import module
import mysql.connector

#Setup SQL Connection String
db = mysql.connector.connect(
  host="localhost",
  user="myuserrname",
  passwd="mypassword"
)

#Create a cursor pointing at the database
cursor = db.cursor()

#Execute SQL Query and fetch results
cursor.execute("SELECT * FROM mytable")
results = cursor.fetchall()

#Print results to the screen
for result in results:
  print(result)

Summary

Today we have discussed the most common data structures you might use when reading and writing files in your python application, or when interacting with other services.

We have discussed the differences between Binary, XML, YAML, JSON, and CSV files, and databases. There are a lot of commonalities in the way these files are structured. But there are also major differences. Each file format has its benefits and weaknesses, and that is why there are so many different file formats.