Managing Rich Data Structures

Feb 16, 2006 by Dave Baker

As a Perl beginner, I wrote many scripts that created text files to store snippets of data. Eventually I had hundreds of little text files in a data directory, each one holding one piece of information in the form of a text string. I decided to move from snippets to a single database. Here’s how I did it.

I didn’t go all the way to creating a relational database with SQL, in part because I had not yet learned to use Perl modules to work with such databases. Plus I wanted to try the apparently simpler technique described in Chapters 11 and 14 of Perl Cookbook, 2nd Edition, namely the use of one of the DBM libraries. I was intrigued by the idea of creating and using a database without needing to learn SQL or run a database server such as PostgreSQL or MySQL.

One of my scripts created HTML for a clickable banner ad; the HTML would be inserted into an HTML email newsletter I publish daily. For any particular issue, the script pulled data from three different text files, each of which held a different piece of information in the form of a text string on the file’s first and only line:

The URL of the advertiser’s web page, to which the reader would be taken when the banner is clicked,
The location of the .gif file (the banner), or
A one-line headline that would appear above the banner.

A different script created those files and named them according to (i) the type of data in the file and (ii) the date of the newsletter in which the ad would appear. For example, data for the December 9, 2005 ad was contained in three files named url_2005_12_09, gif_2005_12_09 and headline_2005_12_09.

A Hash of Hashes

First I considered the kind of data structure needed to hold the data. I looked at the relationship between the three text files. It became clear that I basically had a hash for each ad. The data for any particular ad consisted of three different keys and their values: URL, gif and headline. The name of any particular file would provide the key (the type of information contained in the file). The contents of the file would provide the key’s value.

I thought about finding some way to store those hashes as an array of anonymous hashes (one hash per ad), but then I realized that an array wouldn’t let me access a particular ad’s data easily. The hashes would be in the order in which I saved them into the array, but that wouldn’t translate easily to the ad for a particular date. For example, how would I know where to find the data for next Monday’s newsletter? Is it in $array[8] or $array[17]?

Hmm. Each anonymous hash could be identified by a particular date–the key (!) to locating the ad for any particular date. What kind of data structure associates a unique key with a value? A hash, of course! My data would fit nicely into a hash of hashes.

The name I chose for the hash was %data_for_ad_on. Choosing a hash name that ends in a preposition provides a more natural-reading and meaningful name; the key for data for the December 8, 2005 banner ad would be 2005_12_08, for example, and the way to access the value associated with that key would be $data_for_ad_on{2005_12_08}.

In code, this is how the data for two days of newsletters could be represented as a hash of hashes:

%data_for_ad_on = (

  '2005_12_08' => { 'url'      => 'http://roadrunners-r-us.com/index.html',
                    'gif'      => 'http://myserver.com/banners/roadrunners_banner.gif',
                    'headline' => 'Use Roadrunners R Us for speedy, reliable deliveries!',
                  },

  '2005_12_09' => { 'url'      => 'http://acme.com/index.html',
                    'gif'      => 'http://myserver.com/banners/acme_banner.gif',
                    'headline' => 'Look to Acme for quality, inexpensive widgets!',
                  },
);

The keys of the named hash are 2005_12_08 and 2005_12_09. Each key’s value is a reference to an anonymous hash that contains its own keys and values. When a hash is created using braces instead of parentheses, its value is a reference to that unnamed, “anonymous” hash. I need to use a reference because a hash is permitted to contain only scalar keys and scalar values; another hash can’t be stored as a value. A reference to that hash works, because it acts like a scalar.

Storing the Data

Now I knew what the data structure would be; but how would I store it? Being Perl, of course there’s more than one way to store data to disk. Some of those solutions involve the parsing of potentially large amounts of data and then the need to programmatically find a particular desired piece of data, however.

I came across a single-file database solution in Recipes 11.14 and 14.6 of Perl Cookbook, 2nd Edition. These describe the use of a DBM file on disk. A Perl script can read data from a DBM file as if it were contained in a hash in memory. For data having unique keys, it’s a great fit. In addition to data persistence from one run of your script to the next, using a DBM database file means your script won’t need to read all the data into memory. Instead, you pull up only data associated with a particular key. This applies especially to large data sets; a DBM database file on disk might hold data for hundreds of thousands of keys.

A straightforward but tedious way to get data into a DBM file is shown below. The code creates a database called ad_data.db and inserts data for two newsletters.

#!/usr/local/bin/perl
use strict;
use warnings;
use MLDBM qw( DB_File Storable );
use Fcntl qw( :flock O_CREAT O_RDWR );

my $db = '/www/cgi-bin/databases/ad_data.db';
my %data_for_ad_on;

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Trouble opening $db, stopped: $!";

%data_for_ad_on = (
  '2005_12_09' => { 'url'      => 'http://acme.com/index.html',
                    'gif'      => 'http://myserver.com/banners/acme_banner.gif',
                    'headline' => 'Look to Acme for quality, inexpensive widgets!',
                  },

  '2005_12_08' => { 'url'      => 'http://roadrunners-r-us.com/index.html',
                    'gif'      => 'http://myserver.com/banners/roadrunners_banner.gif',
                    'headline' => 'Use Roadrunners R Us for speedy, reliable deliveries!',
                  },
);

After the script is run, the data contained in %data_for_ad_on are saved automatically into the DBM database file named /www/cgi-bin/databases/ad_data.db.

The secret is Perl’s tie function. It associates a particular hash (here named %data_for_ad_on ) with a “class” and a file. A class that works for “complex” data (data containing references) is the MLDBM module, available from CPAN.

Note: if the values being stored are scalars only, a tied hash using the DB_File module as the class, rather than the MLDBM module, will be much easier to work with. I’m using MLDBM because my values include references to data structures. I’ll discuss some of MLDBM’s limitations later in this article.

The script shown above uses the Fcntl module to make it easier to create or update the DBM database file. The imported O_CREAT|O_RDWR parameters tell the script to create the database file if it doesn’t yet exist or to update (read/write) the file if it exists.

The DB_File and Storable parameters passed to MLDBM specify the particular kind of DBM database to use and the serialization module to use. They handle the behind-the-scenes conversion of references into strings. (A recent version of MLDBM is necessary to be able to specify such parameters.) DB_File won’t work with every system, however, because it depends upon your system’s configuration, like all DBM modules. In particular, the Berkeley DB library (a C library) must be present; see the DB_File documentation for details. You can get another DBM module from CPAN if DB_File won’t work for you.

Manually entering hash keys and their values into a script is time-consuming and error-prone, of course. I prefer to programmatically add data created or copied from existing files, and used a script to copy data from my hundreds of text files into the DBM database file.

#!/usr/local/bin/perl -T
use strict;
use warnings;
use MLDBM qw( DB_File Storable );
use Fcntl qw( :flock O_CREAT O_RDWR );

my $db = '/www/cgi-bin/databases/ad_data.db';
my %data_for_ad_on;

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Can't open $db: $!";

my $data_dir = '/www/cgi-bin/newsletter_data_text_files';

opendir my $dir_handle, $data_dir
  or die "Can't open $data_dir, stopped: $!";

# All of the data files are in the same directory; each file's name
# is url_YYYY_MM_DD, gif_YYYY_MM_DD or headline_YYYY_MM_DD (where
# YYYY is the year, MM is the month, with a leading zero if needed,
# and DD is the day of the month, with a leading zero if needed).
# No other files in that directory end with YYYY_MM_DD.

my @data_files = grep { /\d\d\d\d_\d\d_\d\d$/ } readdir $dir_handle;

close $dir_handle;

my $entry;
foreach my $file ( @data_files ) {

  # Test the filename to see if the file contains the advertiser's
  # URL...

  if ($file =~ /^url_(\d\d\d\d_\d\d_\d\d)/ ) {

      $entry = $data_for_ad_on{$1};

      open (FILE, "$data_dir/$file") or die;
      my $url = do { local $/; <FILE> };
      close FILE;

      $entry->{url}       = $url;
      $data_for_ad_on{$1} = $entry;
      print "Advertiser's URL for $1 has been saved.\n";

  # Test the filename to see if the file contains the .gif location...

  } elsif ($file =~ /^gif_(\d\d\d\d_\d\d_\d\d)/ ) {

      $entry = $data_for_ad_on{$1};

      open (FILE, "$data_dir/$file") or die;
      my $gif = do { local $/; <FILE> };
      close FILE;

      $entry->{gif}       = $gif;
      $data_for_ad_on{$1} = $entry;
      print ".gif file location for $1 has been saved.\n";

  # Test the filename to see if the file contains the headline...

  } elsif ($file =~ /^headline_(\d\d\d\d_\d\d_\d\d)/ ) {

      $entry = $data_for_ad_on{$1};

      open (FILE, "$data_dir/$file") or die;
      my $headline = do { local $/; <FILE> };
      close FILE;

      $entry->{headline}  = $headline;
      $data_for_ad_on{$1} = $entry;
      print "Headline for $1 has been saved.\n";
  }

}

A limitation of MLDBM is that you can’t work piecemeal with the data structures (the values for keys in the tied hash). You must retrieve a copy of the data structure in the form of a reference ($entry in the example); use that reference to add, update or delete parts of the data structure; and then save the reference back to the database in order for MLDBM to put the changed data structure back onto disk. This is because the data structure is no longer tied, once it has been retrieved from disk.

For example, the script above works piecemeal with the data structure: it builds one key at a time in the anonymous hash. A temporary variable called $entry holds a reference to the retrieved anonymous hash, which now exists in memory; then $entry is used to dereference the anonymous hash (and to select a named key in the hash, on the right side of the infix operator); then the script alters a part of that data structure (assigning a value to the named key); then the now-changed anonymous hash is saved back into the database file on disk by assigning $entry to the key.

As shown in earlier examples, an entire data structure can be stored as the value of a key in the tied hash without fetching the existing data structure (if any) associated with that key. In such a case, you’re overwriting any existing data structure completely, so there is no need to fetch the existing data structure first or to work with it using the temporary variable technique.

Note: a recently added CPAN module called DBM_Deep allows tied hashes of complex data as does MLDBM, but appears to avoid the need for fetching data first in order to work with it through a temporary variable.

By the way, MLDBM allows stored data structures to be arrays, not just hashes. Hence MLDBM might be a good disk-based storage solution for you if a hash of arrays works best to hold your data. Plus, whether you’re storing a hash of hashes or a hash of arrays, those stored hashes or arrays are permitted to contain values that consist of even more hashes or arrays!

Another thing to know about MLDBM is that file locking is not provided automatically. The solution presented in this article works well in a single-user environment, but data corruption is possible when more than one instance of the script (or more than one script) needs to access the disk-based data while another process is writing to it. See section 10.2.1 of CGI Programming with Perl for an example of file locking with the DB_File DBM module.

Reading HoH Data

Here is a portion of a script that prints a report for my stored newsletter data.

I reach into the %data_for_ad_on hash by specifying a key (the date of a particular newsletter); the key’s associated value is a reference to the anonymous hash holding ad data for that date. Use infix (arrow) notation to dereference the anonymous hash and also to access a specified key.

tie %data_for_ad_on, 'MLDBM', $db, O_RDWR, 0644
  or die "Can't open $db: $!";

my @newsletter_dates = sort keys %data_for_ad_on;

foreach my $date ( @newsletter_dates ) {

   my $url      = $data_for_ad_on{$date}->{url};
   my $gif      = $data_for_ad_on{$date}->{gif};
   my $headline = $data_for_ad_on{$date}->{headline};

   print "\nHere are details for the banner ad in the $date newsletter:\n";
   print "          Advertiser's URL: $url\n";
   print "        Location of banner: $gif\n";
   print "     Headline above banner: $headline\n";
}

A temporary variable to work with the data was unnecessary because I was merely reading data. I had no need to save a changed data structure back to disk after working with it. A temporary variable would be needed as a reference to the data structure, as noted earlier, if I were adding, updating or deleting something in the data structure, in order to save the changed data structure back to disk.

Nevertheless, a temporary variable is useful for storing a fetched data structure even when a tied hash is being used only to read data from disk. The MLDBM documentation points out that disk fetch occurs each time a script names a key in the tied hash; a new copy of the entire data structure for that key gets created with each fetch. For efficiency, scripts that merely read data work better if the data structure for a particular key is stored first into a temporary variable and the script works with that single copy of data in memory. For example, the lines in the preceding code that get values for $url, $gif, and $headline should be rewritten to cut the number of fetches from three to one:

   my $entry    = $data_for_ad_on{$date};
   my $url      = $entry->{url};
   my $gif      = $entry->{gif};
   my $headline = $entry->{headline};

Updating HoH Data

My next task was to update existing data. An advertiser informed me that its web page had changed. The advertiser’s banner ad in a particular upcoming newsletter needed to point to the new page rather than to the URL I had stored earlier for the ad. So I tried this:

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Can't open $db: $!";

$data_for_ad_on{'2005_12_14'}->{url}
  = 'http://roadrunners-r-us.com/new_home_page.html';   # WRONG

Despite the apparently correct syntax, the line marked “# WRONG” did not change the data in the tied hash. MLDBM can’t add, update or delete just a part of a data structure stored on disk; it must retrieve the entire data structure first. Unfortunately that doesn’t happen automatically when a part of the tied hash is used as an lvalue (the left side of an assignment).

Here is the correct syntax for updating part of a data structure:

my $entry = $data_for_ad_on{'2005_12_14'};                        # Get
$entry->{url} = 'http://roadrunners-r-us.com/new_home_page.html'; # Change
$data_for_ad_on{'2005_12_14'} = $entry;                           # Save

A different kind of update would be to add a new kind of information to an existing data structure. Until now I have been working with each ad’s data as an anonymous hash of three keys because I only had three pieces of information per ad. Here’s how I can modify the data structure for any particular ad–that is, the value of any particular key in the tied hash–to include a fourth piece of information. I’ll record whether or not a particular ad has been billed.

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Can't open $db: $!";

my $entry = $data_for_ad_on{'2005_12_02'};  # Get
$entry->{has_been_billed} = 'yes';           # Change (by creating a key and value)
$data_for_ad_on{'2005_12_02'} = $entry;      # Save

Adding a fourth key as shown is not remarkable because it’s done the same way I created the original three keys. But note how I’ve added a fourth key to just one of the ads (for the 2005_12_02 newsletter). I don’t need to define such a fourth key in each of the other anonymous hashes because MLDBM doesn’t require that the anonymous hashes have the same structure.

Now that I’ve added a fourth key to one of the ads, I want to see that key’s information when I run my report, even though the information for each of the other newsletters is incomplete. Just as with a hash in memory, the exists operator can test for the existence of a key in a tied hash, or test for the existence of a key in one of the stored anonymous hashes. The earlier reporting script can be improved:

foreach my $date ( @newsletter_dates ) {

   my $entry    = $data_for_ad_on{$date};
   my $url      = $entry->{url};
   my $gif      = $entry->{gif};
   my $headline = $entry->{headline};

   my $has_been_billed = exists ( $entry->{has_been_billed} ) ?
                         $entry->{has_been_billed}
                         : 'Billing status is unknown';

   print "\nHere are details for the banner ad in the $date newsletter:\n";
   print "            Advertiser's URL: $url\n";
   print "          Location of banner: $gif\n";
   print "       Headline above banner: $headline\n";
   print "  Has advertiser been billed? $has_been_billed\n";
}

Note how a default value is provided for $has_been_billed (the string ‘Billing status is unknown’). If $entry->{has_been_billed} had been assigned to $has_been_billed without providing a default value, the script (if it is using the warnings pragma) would report “Use of uninitialized value in concatenation (.) or string” every time it processes the “Has advertiser been billed?” statement for an anonymous hash that has no 'has_been_billed' key. The value of a nonexistent key in a hash is undefined, whether or not the hash is tied.

Because the value of a nonexistent key in a hash is undefined, remember when working with MLDBM not to assume your tied hash’s keys will return either a reference to a data structure or a reference to an empty data structure. For example, this code produces a runtime error if the ‘2005_12_32’ key does not exist in the tied hash (there is no December 32) and the script is using the strict pragma:

`` my $entry = $data_for_ad_on{2005_12_32}; my %named_hash = %{ $entry };

if ( $named_hash{url} ) {
    print "The advertiser's URL for the ad dated 2005_12_32 is $named_hash{url}.\n";

} else {
    print "We have no URL on file for the ad dated 2005_12_32.\n";
}

The runtime error would be ‘Can’t use string ("") as a HASH ref while “strict refs” in use’ on the line number of my %named_hash = %{ $entry }. The code doesn’t create a named hash that has no elements.

Conclusion

I hope this article helps you move from a swamp of weedy text files into the simplicity of a single disk-based database file. A DBM database file enables you to work with your data as if it were a single hash of keys and their values.

The MLDBM module extends the “hash on a disk” capability by letting you store a key’s value as a hash, as illustrated in this article, or store a key’s value as an array.

MLDBM requires a special technique for working piecemeal with stored hashes (or stored arrays). Also, the solution described works well in my single-user environment; file locking to protect the disk-based data would become important in a multi-user environment such as a script executed by a publicly accessible web server. Still, you get all this hash-like goodness for key-based data without running a relational database server or using SQL, without parsing large amounts of disk-based data, and without storing all the data in memory first.

Happy hashing!

Tags

data