Parsers printing rule: make sure you print what you parsed

There should be a rule for all parsers: Parsers “print” method should always render a string that can be parsed back without changing the semantics. In pseudo code, it translates to:

initial_string = "parse me"

//parse back
assert to_string(parse(initial_string)) == initial_string

//don't change the semantics
assert parse(to_string(parse(initial_string) == parse(initial_string)

If I do the same with JavaScript and JSON:

var jsonStr = '{"key1":"val1","key2":"val2"}';
JSON.stringify(JSON.parse(jsonStr)) == jsonStr;

As programmers, the less we need to think and worry, the better. Parsers following that rule can be used with confidence; they will never betray you. If that statement doesn’t convince you of the importance for consistency, let me give you a couple of examples of errors caused by inconsistent parsers/printers.

Bad idea 1: Filtering before printing

I once encountered a JSON print method that filtered the output. It striped out the keys having null or empty strings values. “They’re empty, they’re worthless anyway right?” No. What’s wrong with it is that there is a difference between an empty string value, a null value, and a non existing key. Stripping empty strings and null values make the removed key become non existing the next time the JSON will be parsed.

This breaks the second part of the rule: not changing the semantics.

Bad idea 2: Accepting invalid input

Take the definition of UUID.

In its canonical form, a UUID is represented by 32 lowercase hexadecimal digits, displayed in five groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters (32 alphanumeric characters and four hyphens).

Then take the following string.

123E4567-E89B-12D3-A456-426655440000

Is this a canonical UUID form? No. The definition says lowercase hexadecimal digits. This one contains uppercase. Should your parser accept it? Of course not. First, it may denote an issue with the system that provided it to you. Second, you will have a dilemma of what to do with it. Keep it as such and another parser along the way may reject it and crash. Or “correct it” by changing the upper case characters to lower case and risk that another program along the way fails to match your “corrected UUIDs” with his. Whichever route you chose, you cannot predict the outcome.

The pragmatic tip “Crash Early” applies. You can then fix the problem upstream and go in good conscience that you won’t have any related surprises along the road.

Bad idea 3: An object builder that cannot parse its own output

This one came out of an issue I fought with python. We had a script that manually generated XML. The output sometimes failed to be parsed because it contained invalid characters.

The logical solution: build the XML with a library instead of doing it manually. The output will be be parsable by the same library right? No. It seems the python elementree library, the one used for building/printing XML doesn’t respect the first part of the rule: be capable of parsing back the string representation you create.

It’s sad to see that this issue has been ongoing since 2009 and that people are still arguing about it. How many persons lost productive time dealing with it?

Posted in IT

Comments

  1. Totally agree! I call this serialization/deserialization round-tripping. A couple of thoughts:

    1) In general, we shouldn’t expect json_in==json_out after json_in->deserialize->object->serialize->json_out because in JSON (for example) the order of keys doesn’t matter. What we really want is object_in==equal object_out (for some reasonable definition of ‘==’) after object_in->serialize->json->deserialize-object_out

    2) Your point #2 is an interesting contrast with the Robustness Principle (http://en.wikipedia.org/wiki/Robustness_principle) wherein you should be liberal in your input and conservative in your output. Perhaps in certain contexts it’s better to be strict in what you accept.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.