It has been said that a programmer is not worth their salt until they understand how compilers work. In the spirit of self improvement, I’m taking a small step down that path by making a foray into the world of compiler frontends: lexing and parsing.
A good candidate for tiptoeing into this space is JSON, which is arguably the most popular data serialization format today. Writing a parser for JSON is not something most programmers do, as there exists a number of libraries for doing this type of thing in many of the mainstream languages. Taking a deep dive into the implementation of a technology that is often taken for granted seems like a good way of learning something important.
Here, we’ll attempt to write a simplified JSON parser that follows but does not fully implement the official JSON specification. In particular, only objects or arrays will be considered to be valid JSON text, and standalone string literals, numbers, and boolean values will not be considered valid. This constraint is not clearly specified in the original JSON specification, but we deliberately make this choice for the exercise. (However, as it turns out the implementation we arrive at below will actually be able to handle standalone literals, which is a nice side-effect).
Additionally, we will further simplify the task at hand by:
- Not handling escape characters in string literals
- Not handling scientific notation for numeric values
Boolean values and null
are also out of scope and left as an exercise for the
reader to add as extensions. Our JSON parsing task will be modeled as a single
process that combines both lexing (i.e. tokenization) and parsing. The JSON text
is to be treated as a stream of characters, and a stack data structure will be
used for storing values.
To start, let’s define the basic parsing control flow. When a bracket or brace is encountered, we should create a new array or object, respectively, and push it onto the parser stack. Once the closing bracket or brace is encountered, the active container (i.e. value at the top of the stack) can be considered complete and popped off the stack. By handling the opening and closing brackets and braces, we will have defined a significant portion of the core parsing code.
def parse(text):
stack = []
index = 0
while index < len(text):
char = text[index]
if char == '{': # Start object
stack.append({})
elif char == '}': # End object
stack.pop()
elif char == '[': # Start array
stack.append([])
elif char == ']': # End array
stack.pop()
index += 1
The stack keeps track of the arrays or objects that have been encountered thus far at any point in the stream. Importantly, it preserves the order of these arrays or objects, and will allow us to correctly handle nested structures. For this example JSON text:
[{"a":[1,2,3]},"b",["c"]]
a condensed view of the state stored by the stack looks like this:
[]
[[]]
[[], {}]
[[], {}, []]
[[], {}]
[[]]
[[], []]
[[]]
[]
The stack starts off empty ([]
), then the initial/outer array is pushed onto
the stack ([[]]
). The first element in the array is an object, which itself
contains an array as a value; this reflects the deepest part of the nesting in
the structure, which is when the stack is at its maximum size: [[], {}, []]
.
As closing brackets and braces are encountered in the stream, items are popped
off the stack accordingly.
Now that we have the basic structure, the next step is to define some logic for handling literals in the JSON text so that we can actually parse values and insert them into the container objects on the parser stack. For strings, we have made the simplifing assumption that there are no escape characters. This means we can simply look for opening and closing double quotes as the delimiters for a full string literal, and parse accordingly.
def parse_string(text, start):
"Returns a string literal and index at last character of literal (i.e. a double quotation mark)"
end = start + 1
while text[end] != '"':
end += 1
return (text[start+1:end], end)
Numeric values mostly follow the same story. Since we are ignoring scientific notation, we just have to handle a few special cases around negative numbers and decimals. Without diminishing the value of this exercise, we rely on the string to number parsing logic built into the language of our choice.
def parse_number(text, start):
"Returns a numeric literal and index at last character of literal"
end = start
has_decimal = False
while end < len(text) and (text[end] == '-' or text[end] == '.' or (text[end] >= '0' and text[end] <= '9')):
if text[end] == '.':
has_decimal = True
end += 1
val = text[start:end]
return (float(val) if has_decimal else int(val), end-1)
The parse_string
and parse_number
helper functions both effectively eagerly
advance the pointer into the character stream. At the end of parsing a literal,
the index in the original text pointing to the last character in the literal is
returned (along with the actual literal value) to be used by the core parsing
logic for advancing the pointer appropriately.
Once the parsing logic has been expanded to handle literal values, we are ready to connect the dots between values and containers. Since we are following the strict JSON specification, we can assume that there is always at least one active container during any valid parsing action. That is, after a string or numeric literal is successfully parsed, it should be added to the active container as per the semantics of the active container.
In fact, after any value is successfully parsed - be it an array, object, or literal - the value should be added to the active container. This is equivalent to adding the value to the item at the top of the stack. The important edge case is when a value is successfully parsed but there is nothing at the top of the stack (i.e. the stack is empty). When parsing valid JSON, this can happen only when the parsed value is the outermost container. In this case, the parsed value is the result to be returned in as the final parsed object.
For arrays, the underlying representation simply maps to the built-in list, vector, or equivalent sequential data structure of your choice. For objects, the important thing is to convert the container structure into a key-value structure, such as your garden variety hash table, hash map, associative array, or dictionary. The implementation here takes a shortcut by storing both types of containers as lists, and only upon creation will construct a dictionary be out of the list of elements. The list is expected to have an even number of elements in order to fulfill the key-value contract of a valid JSON object.
def parse(text):
stack = []
index = 0
result = None
while index < len(text):
char = text[index]
val = None
if char == '{': # Start object
stack.append([])
elif char == '}': # End object
val = stack.pop()
val = dict(zip(val[::2], val[1::2]))
elif char == '[': # Start array
stack.append([])
elif char == ']': # End array
val = stack.pop()
elif char == '"': # String literal
val, index = parse_string(text, index)
elif char == '-' or (char >= '0' and char <= '9'): # Numeric literal
val, index = parse_number(text, index)
index += 1
if val is not None:
if stack:
# Add parsed value to the active container at top of stack
stack[-1].append(val)
else:
# Parsed value is the outermost container
result = val
return result
For convenience (and without compromising correctness), commas, colons, and
whitespace characters are ignored by the parsing logic. Additionally, error
checking can be added in the form of assertions on properties of certain
values. For example, when a }
character is encountered and a dictionary is to
be created out of the container object at the top of the stack, a valid
assertion is that the number of items in the container (i.e. list) should be
even.
To close things out, a list of test cases for the parser:
assert parse('1.1') == 1.1
assert parse('1') == 1
assert parse('-0.3') == -0.3
assert parse('{}') == {}
assert parse('[]') == []
assert parse('""') == ""
s = '[{"a":[1,2,[{}],4]},"b",["c",{"d":6}]]'
assert parse(s) == eval(s)
s = '{"xkd":1, "kcw":2, "art":3, "hxm":4, "qrt":5, "pad":6, "hoy":7}'
assert parse(s) == eval(s)
s = '[{"a_key": 1, "b_\xe9": 2}, {"a_key": 3, "b_\xe9": 4}]'
assert parse(s) == eval(s)
That concludes our foray into the world of JSON parsing. The parser works but has certainly not been optimized for performance. A rough comparison shows that the built-in Python json library is roughly 11-12x faster than the implementation in this article. However, the hope is that the implementation can serve as an instructive baseline upon which to build faster and more fully-featured parsers.
Comments