Strings distributed evenly into buckets, Perl 5 and 6.

Sometimes some interesting problem comes by, last week was this Distribute strings evenly between a defined number of buckets. We end up using a hashing and modulus operation.

Hashing would take of the distribution and modulus would constrain the choices to the number we were expecting. I took the chance to remove some rust of my Perl 5 knowledge and also see how it look like in Perl 6 (yes, you read it right Perl 6, here's why I insist).

Perl 5 solution

#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw/ md5 / ;

my %histogram; 

foreach my $serial (map { "ABC-$_" } ('000'..'999')) {
  $histogram{unpack('L*', md5($serial)) % 30} .= "#";
}

foreach my $key (sort {$a <=> $b} (keys %histogram)) {
  print "$key\t", $histogram{$key}, "\n";
}

my($min, $max) = map { length($histogram{$_}) } 
  ((sort { length($histogram{$a}) <=> length($histogram{$b}) } 
      keys %histogram)[0,$#_]);

print "maximum: ",$max, "\n";
print "minimum: ",$min, "\n";

Let's explain what is happening here, you see that this script is using the MD5 (line 4) algorithm I had to use this algorithm so that I could use the same on Perl 6 which unfortunately isn't fully equipped with encryption algorithms yet.

The first foreach does the hard work, it first we use the range operation '000'..'999' which generates a sequence like 000, 001, 002, ... adding a map to it just to produce something like ABC-000, ABC-001, ABC-002, ....

On line 9 lot of things are happening. First we calculate the MD5 of the sequence above. The md5() actually gives the binary result of the calculation, that's why we unpack it into a long. Than we have the % modulus operator, which will restrict our numbers from 0 to 29.

There's a little trick here. We're using a associative array as data structure (known as hash in Perl). Each bucket (the ones between 0 and 29) will be a key and each time we repeat we concatenate the string with a #. Just so that we can print on the console a sort of graphical representation to check our hypothesis, a histogram of the distribution.

The second foreach does the printing, since hashes are not ordered we sort the keys before printing with sort {$a <=> $b} (keys %histogram). After that, comes figuring out the maximum and minimum occupancy of the buckets, and frankly this line is not pretty.

To understand, first focus on the sort until the word keys. It is very similar with the one above with the difference that we call length($histogram{$a}), which really means that we are sorting by the length of the values (remember, the values are strings of #s).

Since my Perl is rusted it took me sometime to code the rest of the expression. Look at the end and you'll see [0,$#_]. The function sort returns an array, but I'm interested only in two elements, the first and the last element of this sort which will give me the maximum (the largest string) and the minimum (the smallest string). The [0,$#_] is called a slice of an array, the $#_ means the last element, therefore of the output of that sort I get the first and the last element by slicing the resulting array.

But what sort is returning is the keys of the histogram sorted by value and I want the value. That's why you see a map, the map is processing the slice (which has only two elements) and calling length on each one which finally give us the maximum and the minimum occupancy of buckets.

Perl 6 solution

#!/usr/bin/env perl6
use v6;
use Digest::MD5;

my $digest = Digest::MD5.new;
my %histogram;

for ('000'..'999').map(-> $s {"ABC-$s"}) -> $serial {
  %histogram{:16($digest.md5_hex($serial)) % 30}~= '#';
}

for %histogram.keys.sort: -> $a, $b {$a <=> $b} -> $key {
  say "$key\t%histogram{$key}";
}

say "maximum: ", [max] %histogram.values».chars;
say "minimum: ", [min] %histogram.values».chars;

On Perl 6 things are pretty similar, the first for (which does the foreach) does the same as in the Perl 5 solution. The only difference is that the md5_hex() function returns a string instead of a binary representation of the MD5, to convert it to a decimal representation we use :16 which is part of the radix operators in Perl 6. Again, we use an associative array and concatenate # on each bucket.

The second for does the same thing again, it prints out the histogram by the order of its keys. Note that in Perl 6 everything is an object like Ruby, that's why the syntax differs, on Perl 5 we call functions most of the time.

The really different part, which is the reason of writing this post altogether, is the calculation of maximum and minimum occupancy. Here I had the chance to use some neat additions in Perl 6.

Fist the [max] and [min] are meta-operators. What they do in this case is a reduction, the [max] will return the maximum of a list and the [min] likewise the minimum. If you remember I'm looking for the max and min of the length of the values of this histogram. And this is where we can apply the » (a hyper operator). What is doing is calling .chars in each of the values, this in turn is fed to [max] which returns the maximum of these. So, that long line in the Perl 5 could be compressed in one line of Perl 6.

Output

Both scripts will give an output similar to this one. Note, Perl 6 solution works on rakudo-star 2013.05:

0   #############################################
1   #################################
2   ####################################
3   ############################
4   ###########################
5   ##########################
6   ##############################
7   ###############################
8   ###################################
9   ###############################
10  ################################
11  ########################################
12  #######################################
13  ######################
14  #####################################
15  ########################################
16  ########################
17  ###############################################
18  ##################################
19  ####################################
20  ##########################################
21  #######################
22  #################################
23  ###########################################
24  #########################
25  ############################
26  ######################################
27  ########################################
28  ################################
29  #######################
maximum: 47
minimum: 22

That's it.

Published in Jul 21, 2013