Wednesday, September 19, 2012

Updating Delta Files with Mahout FileDataModel

The FileDataModel is a class of the Mahout project, the document of which notifies us of the following:

This class will also look for update "delta" files in the same directory, with file names that start the same way (up to the first period). These files have the same format, and provide updated data that supersedes what is in the main data file. This is a mechanism that allows an application to push updates to without re-copying the entire data file.

As described, all the files that have the same base name will be loaded.

// The class will look for all the files "a.*" newer than "a.txt".
FileDataModel model = new FileDataModel(new File("a.txt"));
// a.txt.1, a.txt.2, ...

That allows an application to avoid re-copying, but it means that we never use the file extension to keep files as backup in the same directory. For instance, the files renamed with a.txt~ or a.(yyyyMMdd).txt, which are newer than a.txt, will be regarded as valid "delta" resource, too.