This site's content was compiled from 1993 to 2006. Beyond that, Google is your friend.
Ian Elliott
This library provides classes that compare two sequences of like items and report their differences. A sequence is an ordered collection of items where each item has an associated positive integer index.
Successive items of the sequence may be accessed and for each access the index of the item accessed increases by 1. Thus files of lines, characters, bits and other objects, arrays and linear lists of items and even traversals through structured objects such as trees and graphs may be treated as sequences. Two sequences are the same if they contain items of equal value (by is_equal) at every corresponding index, otherwise the sequences are different. The smallest index at which corresponding items differ specifies the position of the first difference.
At higher indexes the two sequences may contain subsequences of items in common i.e. they may resynchronise and further on they may differ again, and so on. The aim of the library is to provides means to identify the differences, report them and report where they occur.
A comparison algorithm is implemented for those sequences which are rewindable i.e. sequences for which reading may be re-commenced at a position already read. Such sequences include arrays and random access files. The classes are given in the Comparison cluster.
A simple "brute force" pattern matching algorithm is used and it is reasonably effective for small files. A full implementation is provided for line based ASCII files i.e. for files which are sequences of ASCII strings separated by end of line characters. These classes are in the Line_file_comparison cluster.
The program fcompare provides a simple command line interface to this implementation.
The two sequences are read an item at a time in concert. The items in each pair are compared. If the items are the same i.e. have the same value by is_equal, the next pair is read. This continues until one of the following occurs:
Considering each case in turn:
An attempt is made to re-synchronise the sequences. If synchronisation is achieved, then from this point the pair-wise comparing can continue as before.
To seek re-synchronization the section containing the different items is determined for each sequence. A record of the difference sections is maintained. Initially each difference section is given the first item which differs. The next item is read from one of the sequences. If the other difference section contains this item, it indicates the start of resynchronized sections of a length of at least 1. If the item read is not in the other’s difference section then the first sequence’s difference section is increased by this item. The next item of the other sequence is read and the same process is applied to it with respect to the first sequence. This alternating processing continues until a match is found or one of the sequences exhausts. On establishing resynchronization, pair-wise reading of the two sequences continues as before. If a sequence exhausts then searching for re-synchronization continues with the remaining sequence. A report of the difference is made.
For satisfactory resynchronization to be deemed to have occurred the sections must agree for a length which is usually greater than 1. This is required since accepting agreement of less than this length may generate spurious resynchronizations. In these cases numerous stop-start differences can be produced which obscure the true differences sought.
The comparison is complete and so processing stops.
A report that a sequence has exhausted is made and processing stops.