Marco Valtas

NFS concurrency on distributed systems

Last week I had a hard time trying to figure out one way to scale one application to many machines and many instances on each machine using one NFS server as point of access. I believe I found one solution to this problem and here I will explain some of what was done.

First let me explain the problem more clearly, these are the requirements:

A arbitrary number of files will be on a exported NFS file system.
One consumer application have to read a file and import it contents to a database.
Each file should be imported only once. After import the file should be moved to a archive directory.
No file can be ignored or skipped if contains valid data.
One or more consumers will run on one machine. One or more machines will run on the network.

We can state this problem as a I/O concurrency problem but robust solutions as Chubby or similar I cannot use because some constrains. These are the constrains:

I have not access to the NFS machine except by the exported file system.
The solution should be a simple refactoring of the current application, not a major project nor a redesign.
The application is written in Perl.

On a first thought I went search for file locking solutions, the idea was simple. Each consumer will lock the file before attempt to read it, other consumers should skip the file if it's locked.

NFS has a problem with locking as you can read "Use of NFS Considered Harmful", to get things worse I was told that the NFS Lock Daemon will not run on the NFS server.

Digging more on file locking you will find this on Linux open() manual:

O_EXCL is not supported on NFSv2 or on Linux before kernel 2.6; it is supported on Linux 2.6 and later, with NFSv3 or later. In environments where NFS O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same file system (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.

So, trying to acquire a lock exclusive on a NFS file system will fail and despite my best efforts the solution given not worked on my tests.

After more researching I found the CPAN module File::NFSLock. Some testing was done and the module worked fine on concurrent processes on one machine. All processes respected the lock and the files were read it without problems, but as you might expected things broken on multiple machines.

Another module was tested, File::LockSimple, this module as a parameter for NFS file system, but still fails on a multiple machines concurrency, as File::NFSLock worked fine on one machine concurrency.

What were the problems found? Well, the most common problem was that multiple consumers failed to move the file to our archive directory. The odd thing was the consumers thought that had the lock, but that was not true. Somehow two or more consumers had all confirmations of a lock achieved but the move operation failed.

As a example I will explore File::LockSimple code when it checks if had the correct lock:

# Attempt to create lock
if (open(FILE, ">$lockfile")) {
    local $\ = undef;
    print FILE "$stamp\n";
    close FILE;
    open(FILE, $lockfile);    # Check lock
    my $l;
    chop($l = );
    $locked = $l eq $stamp;
    $l = ;            # Must be EOF
    $locked = 0 if defined $l;
    close FILE;
    last if $locked;        # Lock seems to be ours
}

The idea is:

Write a unique stamp to our lock file.
Open the file again and read if the stamp is still our stamp.
If it is, we have the lock, if not, we loosed the race.

After some testing I've found that when multiple machines race for the lock sometimes there's no winner, and the lock stays without parent and the file is skipped by any other consumer, I had to go with another approach.

The solution

One premise on file locking is that some other process will try to read/write to a file when other process is using. The solution basically is put the process in some wait state until the owner of the lock releases it. In this line of thought we have one precious requirement, the file should be in a known place and you cannot move it.

But this is not a problem in my requirement, I can move the file away so the solution depends on that.

The algorithm in plain text:

"Create a list of all files in the source directory, create a unique directory path, for each file in the list test if file still exists on the source directory, if not skip this file and go the next file. If file exists in source directory try to move it to your unique directory, if move fails skip this file, if not test file the file is in our unique directory, if it is we have the lock, if not skip this file and go to the next file."

The premise of this algorithm is:

"No file could be moved from one place to two other places at the same time."

This should hold true for any number of requests to the file system. Here is the algorithm in pseudo code:

source_dir = "some directory"

f[] = list_of_files(source_dir)

uniq_dir = create_uniq_directory()

for n=0 where n <= size_of(f[]) do {
    if(file_exists("source_dir/f[n]") == true) {
        if(move_file("source_dir/f[n]","uniq_dir/f[n]") == true) {
            if(file_exists("uniq_dir/f[n]") {
            # We have the lock
            }
        }
    }
}

Let's explore the algorithm, first I create a list of all files on my source directory, the real number can be about 20,000 files. Then a unique directory is created, the directory is unique for any number of consumers, I use the name of machine (hostname) + process ID.

For each file in our array we ask if file still exists, if not we skip because some other consumer probably already got the file.

Now we try move the file away to our unique path. Pay attention that we try to move the file, because after our first test the move can fail if some other consumer move it first. Seems odd, but believe in me this happens.

If the move returned true we go a test if the file is in our unique path. The thing is, move() can return true even if failed. So we test it again to see if is telling the truth.

Why move() return true even if failed? The catch is in NFS system, it has a cache feature and could return true to the caller and then pass call to the kernel of the server system. What happen is that two clients could issue a move() request, NFS responds that OK because think that the file is still in the source directory. But when calls the kernel it detects the absence of the file and fail, but the client will not know what happened. I'm not sure if this is the real case or how NFS handle such calls, my tests point that something like I described above occur.

The trick is in two things. First, the unique path. Since is unique (directory + file) NFS will not have cached information about it. So when I ask "file exists in my unique directory?" NFS has to check the path.

The second, the move operation is in fact a rename() operation and a rename operation is a atomic. This way we guarantee that when two calls arrive to rename one file, one of the calls have to fail. So in the same machine, in a unknown other one call will fail:

rename(file, file_new_name);
rename(file, file_other_name);

The final implementation

Here is a simplified version of what was implemented:

#!/usr/bin/perl

use strict;
use warnings;
use File::Copy;
use Sys::Hostname;

use constant SPOOL_DIR => "/tmp";

my @files; # get files from some object...

# Create unique spool path (concurrency algorithm)
my $file_spool_dir   =  join('/', SPOOL_DIR, hostname, $$);
if(! -d $file_spool_dir) {
    mkpath($file_spool_dir) or die(" ** Could not create spool directory ($!)");
} elsif(! -w $file_spool_dir or ! -r $file_spool_dir) {
  die(" ** Cannot write/read spool dir check permission ($!)");
}

FILE: foreach my $file_path (@files) {

  next FILE if ! -e $file_path; # File still exists?

  my $file_name = File::Spec->splitpath($file_path);

  my $file_spool_path = join('/', $file_spool_dir, $file_name);

  # We try move our file to a unique path where nobody will race with us.
  if(move($file_path, $file_spool_path)) {

     # Check if file was moved to our uniq path;
     if(-e $file_spool_path) {
        # we have the file!
     }
  }
} # end of foreach file.

I have to make a note, using File::Copy method move() I'm in fact renaming the file, the documentation of File::Copy states:

If possible, move() will simply rename the file.

As mentioned before, the rename operation is atomic so we expect that move() will not break our first premise.

Final considerations

I tried to make the algorithm the best I could, formal proof is hard especially in concurrent and distributed systems. But the process worked fine on 25,000 files accessed by 20 consumers in two machines. No error, no file was lost or skipped without a good reason.

Despite I can't say this is a bullet proof algorithm, by the tests it seems hold well.

Published in Jul 20, 2008