Strings distributed evenly into buckets, Perl 5 and 6.
Sometimes some interesting problem comes by, last week was this Distribute strings evenly between a defined number of buckets. We end up using a hashing and modulus operation.
Hashing would take of the distribution and modulus would constrain the choices to the number we were expecting. I took the chance to remove some rust of my Perl 5 knowledge and also see how it look like in Perl 6 (yes, you read it right Perl 6, here's why I insist).
Perl 5 solution
#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw/ md5 / ;
my %histogram;
foreach my $serial (map { "ABC-$_" } ('000'..'999')) {
$histogram{unpack('L*', md5($serial)) % 30} .= "#";
}
foreach my $key (sort {$a <=> $b} (keys %histogram)) {
print "$key\t", $histogram{$key}, "\n";
}
my($min, $max) = map { length($histogram{$_}) }
((sort { length($histogram{$a}) <=> length($histogram{$b}) }
keys %histogram)[0,$#_]);
print "maximum: ",$max, "\n";
print "minimum: ",$min, "\n";
Let's explain what is happening here, you see that this script is using the MD5
(line 4
) algorithm I had to use this algorithm so that I could use the same
on Perl 6 which unfortunately isn't fully equipped with encryption algorithms
yet.
The first foreach
does the hard work, it first we use the range operation
'000'..'999'
which generates a sequence like 000, 001, 002, ...
adding a
map
to it just to produce something like ABC-000, ABC-001, ABC-002, ...
.
On line 9
lot of things are happening. First we calculate the MD5 of the
sequence above. The md5()
actually gives the binary result of the
calculation, that's why we unpack
it into a long. Than we have the %
modulus operator, which will restrict our numbers from 0
to 29
.
There's a little trick here. We're using a associative array as data structure
(known as hash in Perl). Each bucket (the ones between 0 and 29) will be a key
and each time we repeat we concatenate the string with a #
. Just so that we
can print on the console a sort of graphical representation to check our
hypothesis, a histogram of the distribution.
The second foreach
does the printing, since hashes are not ordered we sort
the keys before printing with sort {$a <=> $b} (keys %histogram)
. After that,
comes figuring out the maximum and minimum occupancy of the buckets, and frankly
this line is not pretty.
To understand, first focus on the sort
until the word keys
. It is very
similar with the one above with the difference that we call
length($histogram{$a})
, which really means that we are sorting by the length
of the values (remember, the values are strings of #
s).
Since my Perl is rusted it took me sometime to code the rest of the expression.
Look at the end and you'll see [0,$#_]
. The function sort
returns an array,
but I'm interested only in two elements, the first and the last element of this
sort
which will give me the maximum (the largest string) and the minimum (the
smallest string). The [0,$#_]
is called a slice of an array, the $#_
means
the last element, therefore of the output of that sort
I get the first and the
last element by slicing the resulting array.
But what sort
is returning is the keys
of the histogram sorted by value and
I want the value. That's why you see a map
, the map
is processing the slice
(which has only two elements) and calling length
on each one which finally
give us the maximum and the minimum occupancy of buckets.
Perl 6 solution
#!/usr/bin/env perl6
use v6;
use Digest::MD5;
my $digest = Digest::MD5.new;
my %histogram;
for ('000'..'999').map(-> $s {"ABC-$s"}) -> $serial {
%histogram{:16($digest.md5_hex($serial)) % 30}~= '#';
}
for %histogram.keys.sort: -> $a, $b {$a <=> $b} -> $key {
say "$key\t%histogram{$key}";
}
say "maximum: ", [max] %histogram.values».chars;
say "minimum: ", [min] %histogram.values».chars;
On Perl 6 things are pretty similar, the first for
(which does the foreach
)
does the same as in the Perl 5 solution. The only difference is that the
md5_hex()
function returns a string instead of a binary representation of the
MD5, to convert it to a decimal representation we use :16
which is part of the
radix operators in Perl 6. Again, we use an associative array and concatenate
#
on each bucket.
The second for
does the same thing again, it prints out the histogram by the
order of its keys. Note that in Perl 6 everything is an object like Ruby, that's
why the syntax differs, on Perl 5 we call functions most of the time.
The really different part, which is the reason of writing this post altogether, is the calculation of maximum and minimum occupancy. Here I had the chance to use some neat additions in Perl 6.
Fist the [max]
and [min]
are meta-operators. What they do in this case is a
reduction, the [max]
will return the maximum of a list and the [min]
likewise the minimum. If you remember I'm looking for the max and min of the
length of the values of this histogram. And this is where we can apply the »
(a hyper operator). What is doing is calling .chars
in each of the values,
this in turn is fed to [max]
which returns the maximum of these. So, that long
line in the Perl 5 could be compressed in one line of Perl 6.
Output
Both scripts will give an output similar to this one. Note, Perl 6 solution works on rakudo-star 2013.05:
0 #############################################
1 #################################
2 ####################################
3 ############################
4 ###########################
5 ##########################
6 ##############################
7 ###############################
8 ###################################
9 ###############################
10 ################################
11 ########################################
12 #######################################
13 ######################
14 #####################################
15 ########################################
16 ########################
17 ###############################################
18 ##################################
19 ####################################
20 ##########################################
21 #######################
22 #################################
23 ###########################################
24 #########################
25 ############################
26 ######################################
27 ########################################
28 ################################
29 #######################
maximum: 47
minimum: 22
That's it.