bookmark_borderParsers printing rule: make sure you print what you parsed

There should be a rule for all parsers: Parsers “print” method should always render a string that can be parsed back without changing the semantics. In pseudo code, it translates to:

initial_string = "parse me"

//parse back
assert to_string(parse(initial_string)) == initial_string

//don't change the semantics
assert parse(to_string(parse(initial_string) == parse(initial_string)

If I do the same with JavaScript and JSON:

var jsonStr = '{"key1":"val1","key2":"val2"}';
JSON.stringify(JSON.parse(jsonStr)) == jsonStr;

As programmers, the less we need to think and worry, the better. Parsers following that rule can be used with confidence; they will never betray you. If that statement doesn’t convince you of the importance for consistency, let me give you a couple of examples of errors caused by inconsistent parsers/printers.
Continue reading “Parsers printing rule: make sure you print what you parsed”

bookmark_borderQuery string white space vs plus

Trivia: What is the difference between the encoded query string parameter “a+b” and “a%20b” ?

Answer: Nothing! They are both encoded representations for “a b”.

Isn’t a “+” supposed to remain a “+”? Well, the URL and the query strings are not encoded following the same rules. In the URL, the “+” remains a “+” indeed, but in the query string it’s actually encoded and becomes a “%2B”. This can be misleading.
Continue reading “Query string white space vs plus”

bookmark_borderDetecting unknown charsets

Character encoding. That and date time formats are what I consider the two biggest wastes of programmer time when handling data. For the later, sticking to iso-8601 rules out the problem. (Read my Timestamps post for more information.) For the former, sticking to ASCII or UTF8 should work all the time. However, just like for timestamps, you may not control the source and get some “unfriendly” formats. Here are my tips to detect them.

Continue reading “Detecting unknown charsets”

bookmark_borderPython: warnings and deprecation

Aside the logging library resides the less known warning library. The former is meant to log events related to execution whereas the later is meant to warn more or less about improper module usage or deprecated functions. By default, most warnings are displayed once, meaning that they will not clutter your logs by being shown repeatedly. However, some are “ignored by default”, hence not displayed at all. This is where the important difference with logging is: the control you get over them at the command line level.

Continue reading “Python: warnings and deprecation”

bookmark_borderPython counters on unhashable types

Have you ever heard or used python counters? They are very useful to count the number of occurrences of “simple” items. Basically:

> from collections import Counter
> colors = ['red', 'blue', 'red', 'green']
> Counter(colors)
Counter({'red': 2, 'blue': 1, 'green': 1})

However, if you try to use it on non hashable types it doesn’t work.

> colors = [['red', 'warm'], ['blue', 'cold'], ['red', 'warm']]
> Counter(colors)
[...]
TypeError: unhashable type: 'list'

What do we do then?

Continue reading “Python counters on unhashable types”

bookmark_borderStarCluster: Streaming node addition refactoring and dupe alias fixing

About a month ago I created the streaming node addition functionality within StarCluster. As the time went by, I fixed some of its issues and found it a bit messy and hard to understand so I decided to move it to a separate file. The new version is ready and battle tested.

Another feature that I found to not be working as expected is the handler for nodes having the same alias. I fixed it and made a clean commit easy to pull/cherry-pick. It’s only a matter of calling _recover_duplicate_aliases.

bookmark_borderStarCluster: Streaming node addition

In the core version of StarCluster, when you add many nodes at once (command “addnode -n #”), StarCluster goes through three sequential checks[*] that all nodes need to fulfill in order to move forward and eventually start configuring the nodes within the cluster.

  1. Wait for the spot instance requests to propagate.
  2. Wait for all spot instance requests to become active.
  3. Wait for ssh on all those nodes to be active.

If you add a single node, that’s fine, but if you add 10, you lose some time as the first node might be ready a few minutes before the last node is. That is to say, you are wasting some computing time.

Continue reading “StarCluster: Streaming node addition”

bookmark_borderWhen is a good time to document software?

Everyone knows about the benefits of having documentation. What is harder to agree on is when is a good time to produce it.

A good development cycle should always have its phase of documenting for the new features and the new behaviors to be expected. But, if it’s not your case, there are two other very good moments for documenting.

Continue reading “When is a good time to document software?”