Found another difficult problem on codeeval today that I though I would use as an opportunity to explore generators in python.

The challenge can be found on codeeval at https://www.codeeval.com/open_challenges/36/.

Here is a copy of the challenge for your viewing convience:

Message Decoding

Challenge Description:

Credits: This challenge has appeared in a past ACM competition.

Some message encoding schemes require that an encoded message be sent in two parts. The first part, called the header, contains the characters of the message. The second part contains a pattern that represents the message.You must write a program that can decode messages under such a scheme.

The heart of the encoding scheme for your program is a sequence of "key" strings of 0's and 1's as follows:
0,00,01,10,000,001,010,011,100,101,110,0000,0001,. . .,1011,1110,00000, . . .

The first key in the sequence is of length 1, the next 3 are of length 2, the next 7 of length 3, the next 15 of length 4, etc. If two adjacent keys have the same length, the second can be obtained from the first by adding 1 (base 2). Notice that there are no keys in the sequence that consist only of 1's.

The keys are mapped to the characters in the header in order. That is, the first key (0) is mapped to the first character in the header, the second key (00) to the second character in the header, the kth key is mapped to the kth character in the header. For example, suppose the header is:

AB#TANCnrtXc
Then 0 is mapped to A, 00 to B, 01 to #, 10 to T, 000 to A, ..., 110 to X, and 0000 to c.

The encoded message contains only 0's and 1's and possibly carriage returns, which are to be ignored. The message is divided into segments. The first 3 digits of a segment give the binary representation of the length of the keys in the segment. For example, if the first 3 digits are 010, then the remainder of the segment consists of keys of length 2 (00, 01, or 10). The end of the segment is a string of 1's which is the same length as the length of the keys in the segment. So a segment of keys of length 2 is terminated by 11. The entire encoded message is terminated by 000 (which would signify a segment in which the keys have length 0). The message is decoded by translating the keys in the segments one-at-a-time into the header characters to which they have been mapped.

Input sample:

The input file contains several data sets. Each data set consists of a header and a message. These will all be on one line. The length of the header is limited only by the fact that key strings have a maximum length of 7 (111 in binary). If there are multiple copies of a character in a header, then several keys will map to that character. The encoded message contains only 0's and 1's, and it is a legitimate encoding according to the described scheme. That is, the message segments begin with the 3-digit length sequence and end with the appropriate sequence of 1's. The keys in any given segment are all of the same length, and they all correspond to characters in the header. The message is terminated by 000.
Your program should accept the first argument as the filename and read the contents of this file as the test data, according to the conditions above.eg.

$#**\0100000101101100011100101000

Output sample:

For each data set, your program must write its decoded message on a separate line. There should not be blank lines between messages.eg.

##*\$



Ok so there are a few different problems that need to be solved in order for this solution to work out. Here is how I broke down the problem.
  1. Generate a sequence of keys using the given pattern
  2. Separate the message from the header
  3. Map the header to the keys
  4. iterate over the encoded message
  5. Decode the message using the map

First thing then is to figure out a way to generate the sequence of keys they are speaking about. So lets take a closer look at the pattern. "The first key in the sequence is of length 1, the next 3 are of length 2, the next 7 of length 3, the next 15 of length 4, etc. If two adjacent keys have the same length, the second can be obtained from the first by adding 1 (base 2). Notice that there are no keys in the sequence that consist only of 1's."
It looks like they are implementing some sudo-binary format. To follow exactly their pattern, each set of numbers is 2^n -1 long where n is the number of characters. (That gives us the 1,3,7,15) sequence. Then each set of numbers starts with all zeros, and increments by 1 base 2 for each remaining element in that set.

To follow that pattern I built 2 generators. For those of you who dont know, a generator in python is a function that behaves as an iterator. This means a generator will return a sequence of values, which could be theoretically infinitely long. This allows us to scale the sequence up to as long as it needs to be.

My first generator is to generate the sequence of numbers for the length of each set (following the 2^n -1 pattern)
def generate_sequence():
    m = 1
    while True:
        yield 2**m -1
        m += 1
The yield keyword in python is what defines a generator. Each time you iterate over the generator, it returns the yielded value. Using while True: means this generator is an infinite generator. Call this generator with the .next() will iterate the generator as follows:
>>> gen = generate_sequence()
>>> gen.next()
1
>>> gen.next()
3
>>> gen.next()
7
>>> gen.next()
15
Working exactly as expected. Now we can use this generator inside our other generator to build our key sequence.
def generate_keys():
    length_sequence = generate_sequence() #initialize the generator object
    n_length = 1                                       #set how long each number in the set will be starting at 1
    while True:
        set_length = length_sequence.next()
        value = 0                                       #each set starts with 0
        for i in range(set_length):
            yield "{0:b}".format(value).zfill(n_length)   #cast value to binary and fill in preceding zeros until it is the proper length
            value += 1
        n_length += 1
Here we used .format() and .zfill() to translate each value into binary and fill to match the length. Again we created an infinite generator, which would allow the encoding header to be any length. Lets test it out:
>>> key = generate_keys()
>>> key.next()
'0'
>>> key.next()
'00'
>>> key.next()
'01'
>>> key.next()
'10'
>>> key.next()
'000'
And there we have the first few keys of the sequence, exactly how they are shown in the problem description. And we did it in a way that can be used for any size header.

Ok now that we have the sequence generated, we need to separate the header from the encoded message. Since the message is in 1s and 0s and the characters in the header are not, that is how we are going to separate the two. To use the codeeval syntax we are going to break out the test cases and separate each case into a mapping or into the encoded message:
for test in test_cases:
    mappings = {}
    keys = generate_keys()
    encoded = ""
    for char in test.rstrip(): #get rid of carriage returns and trialling spaces
        if char no in ['0', '1']:
            mappings[keys.next()] = char
        else:
            encoded += char
This will take care of steps 2 and 3, Both separating out the message and mapping the header to the keys.

Now all that is left is to iterate over the message and decode it. Lets look back at the instructions on iterating over the encoded message. "The encoded message contains only 0's and 1's and possibly carriage returns, which are to be ignored. The message is divided into segments. The first 3 digits of a segment give the binary representation of the length of the keys in the segment. For example, if the first 3 digits are 010, then the remainder of the segment consists of keys of length 2 (00, 01, or 10). The end of the segment is a string of 1's which is the same length as the length of the keys in the segment. So a segment of keys of length 2 is terminated by 11. The entire encoded message is terminated by 000 (which would signify a segment in which the keys have length 0). The message is decoded by translating the keys in the segments one-at-a-time into the header characters to which they have been mapped."

To break the rules out for decoding: To iterate over the message, I will create a small helper function that will return the shortened message and a given number of characters.
def read(message, n):
    return (message[n:], message[:n])
Simple, this allows us to keep track while iterating over the encoded message.

Since we know the final characters of the ecoding are three 0s, we will loop until the last 3 characters, taking out each sequence as it comes
output = ""
while len(encoded) > 3:
    encoded, bitlength = read(encoded, 3)      #read and take off the first 3 characters
    bitlength = int(bitlength,2)                      #convert to base 10
    lookup = ""
    while lookup != "1"*bitlength:                   #loop while not all "1"s
        encoded, lookup = read(encoded, bitlength)
        if lookup != "1"*bitlength:
            output += mappings[lookup]
So here we looped over the encoded message, grabbing the sequence length, and iterated over each sequence, extracting values into the output until there is all "1"s.

So to put it all together:
import sys
test_cases = open(sys.argv[1], 'r')

def generate_keys():
    length_sequence = generate_sequence() #initalize the generator object
    n_length = 1                                       #set how long each number in the set will be starting at 1
    while True:
        set_length = length_sequence.next()
        value = 0                                       #each set starts with 0
        for i in range(set_length):
            yield "{0:b}".format(value).zfill(n_length)   #cast value to binary and fill in preceding zeros until it is the proper length
            value += 1
        n_length += 1
            
def generate_sequence():
    m = 1
    while True:
        yield 2**m -1
        m += 1

def read(message, n):
    return (message[n:], message[:n])

for test in test_cases:
    mappings = {}
    keys = generate_keys()
    output = ""
    coded = test.rstrip()
    encoded = ""
    for char in coded:
        if char not in ['0','1']:
            mappings[keys.next()] = char
        else:
            encoded += char
    while len(encoded) > 3:
        encoded, bitlength = read(encoded,3)
        bitlength = int(bitlength,2)
        lookup = ""
        while lookup != "1"*bitlength:
            encoded,lookup = read(encoded,bitlength)
            if lookup != "1"*bitlength:
                output += mappings[lookup]    
    print output
test_cases.close()
And to run it against the testfileL
>>> echo $#**\0100000101101100011100101000 >> test.txt
>>> decode.py test.txt
##*\$
Success!!

WH deolr01000110110000100100011110100111001010110010110101110101011000
aeiouy Thdgfrnk0110111001110010110000100011111101101000111101010110110000101111000000111101010111000001111101101011110000011111010001100101011101111010011110000101111011110111000