Project
Delta Force
Introduction
SmallMart, one of the world's largest retailers, wants to hire you to help them manage their inventory. A centralized file contains information about every item SmallMart sells. This file is updated each day. All around the world, there are devices that need to receive a copy of the file each day: computers and cash registers in the stores, vending machines, handheld devices, etc., so each day, SmallMart sends a copy of the inventory file to all these devices. There are 10 million such devices, and the inventory file is about 100 kilobytes long, so about 1 terabyte (i.e., 1000 gigabytes = 100KB * 10 million) is sent each day. Given that their internet provider charges SmallMart roughly $1 per gigabyte, it costs about $1,000 per day to provide uptodate inventory information to the devices. In addition to the inventory files, SmallMart occasionally sends updates of other files to the devices as well (for example, new versions of applications as they are updated).
Marty Small, one of SmallMart's smartest managers (and no relation to the company founder), recently had an idea. He realized that the file that SmallMart sends each day changes very little from the file it sent the day before. For example, consider these two sample inventory files sent by SmallMart to the devices on April 10 and April 11. Each file contains multiple records; each record has an item number, an item name, and the number of those items in the inventory:
April 10 version of the inventory file:
81609,Feather Duster,198,92246,Lawn Chair Set,50,03854,Carrano C++ book,183,27408,Monsters, Inc. DVD,89
April 11 version of the inventory file: 66284,Screwdriver,1000,81609,Feather Duster,195,92246,Lawn Chair
Set,50,03490,Bedspread,87,27408,Monsters, Inc. DVD,89,40411,Hair Spray,380
As you can see, both the earlier and later version of the inventory file have many similarities. There are some new additions in the April 11 version of the file, such as the entry
03490,Bedspread,87
There are also some changes to existing records between the two versions: 81609,Feather Duster,198
changes to
81609,Feather Duster,195
And some of the entries in the original file have been removed (e.g., it looks like SmallMart sold all of the Carrano C++ books in one day, since there is no entry for it in the second file).
Given the significant similarities between successive versions of the inventory file, Marty realized he could save SmallMart hundreds of dollars every day. How? Marty realized that since a given day's inventory file is so similar to the previous day's inventory file, that it doesn't make sense to ship the entire 100K inventory file to each salesperson every day. Instead, SmallMart could just send a smaller file containing the list of changes (called a delta file) to each device to indicate what has changed from yesterday's inventory file to today's file. Each device could then run an incremental update program to use this delta file to convert their existing copy of the inventory file computer to the new version of the inventory file.
Let's number each character in both of inventory files and then see an example. April 10 version of the inventory file:
1 2 3 4
01234567890123456789012345678901234567890123456789
0 81609,Feather Duster,198,92246,Lawn Chair Set,50,0
50 3854,Carrano C++ book,183,27408,Monsters, Inc. DVD 100 ,89
April 11 version of the inventory file:
1 2 3 4
01234567890123456789012345678901234567890123456789
0 66284,Screwdriver,1000,81609,Feather Duster,195,92
50 246,Lawn Chair Set,50,03490,Bedspread,87,27408,Mon
100 sters, Inc. DVD,89,40411,Hair Spray,380
Given these two versions of the inventory file, let's see what a delta file might look like. Remember, the delta file contains a set of instructions on how to convert a first version of the file (e.g., the April 10 version of the file) into a later version of the file (e.g., the April 11 version). This delta file would normally be created at SmallMart headquarters and then sent to each of the 10 million devices, which already have the full April 10 version of the inventory file:
1. Create a new, empty output file
2. Add 23 characters to the output file: 66284,Screwdriver,1000,
3. Copy 23 characters from offset 0 in the source file to the output file.
4. Add 1 character to the output file: 5
5. Copy 27 characters from offset 24 in the source file to the output file.
6. Add 16 characters to the output file: 490,Bedspread,87
7. Copy 28 characters from offset 75 in the source file to the output file.
8. Add 21 characters to the output file: ,40411,Hair Spray,380
Given this set of instructions, an incremental update program on a device could follow the above set of instructions and convert their April 10 version of the file (which is already on their computer) into the April 11 version of the file. How would this work?
After following the first instruction, the incremental update program would create an empty output file. After following the second instruction, that output file would contain
66284,Screwdriver,1000,
After following the third instruction, the output file would contain
66284,Screwdriver,1000,81609,Feather Duster,19
After following the fourth instruction, the output file would contain
66284,Screwdriver,1000,81609,Feather Duster,195
After following the fifth instruction, the output file would contain 66284,Screwdriver,1000,81609,Feather Duster,195,92246,Lawn Chair Set,50,03 And so on.
To perform this conversion, only two different commands are required: a Copy command and an Add command. The Copy command specifies that a specific number of characters should be copied from a particular offset in the first version file to the end of the output file. The Add command is used to add entirely new content to the output file when this content cannot be located and copied from the original (earlier version of the) file.
Of course, the eight delta instructions shown above are in a humanreadable format that is quite wordy. We can make our delta file much smaller by removing all of the humanreadable text and using a more compact encoding. Shown below is a 93character delta file containing all of the instructions above; note that this delta file is 33% smaller than the full April 11 version of the file:
A23:66284,Screwdriver,1000,C23,0A1:5C27,24A16:490,Bedspread,87C28,75A21:,40411,Hair Spray,380
In the example above, each add command is represented by an A, followed by the number of characters to add, followed by a colon character and the actual characters to append. Each copy command is represented by a C followed by the number of characters to copy, a comma, and the offset in the original file from which to start copying the characters. It is possible to convert any file to any other file using just these two types of commands.
So to recap, each day, in its corporate office, SmallMart can create the latest version of the full inventory file (e.g., the April 11 version). Then, SmallMart can use a special tool called a delta creator to create a delta file (made up of Add and Copy commands) that contains the instructions for converting yesterday's version (the April 10 version) of the file to today's version (the April 11 version) of the file. Then, Small Mart can send this delta file to the 10 million devices instead of sending the full (much larger) April 11 inventory file.
Upon receiving the delta file, each of the 10 million devices then runs a program called an incremental updater that takes yesterday's full April 10 inventory file (which is already on the device, produced yesterday) and the justreceived delta file, and follows the instructions in the delta file to create today's April 11 full inventory file. On April 12, the device will receive a delta file that it will use along with this full April 11 file to produce a full April 12 inventory file. Since the delta files are just a fraction of the full inventory files' sizes, this saves SmallMart considerable networking costs.
Of course, this delta approach can be used to update all types of files, not just inventory files. For instance, consider the following two files A and A', where A' is a derived version of A.
1 2 3
01234567890123456789012345678901234
A : ABCDEFGHIJBLAHPQRSTUVPQRSTUV
A': XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQELF
Here's a delta file to convert A into A'. Verify for yourself that it works:
A2:XYC12,0A3:ETCC13,13A5:QQELF
Now it may not be obvious, but there are actually many possible correct delta files that can be created to convert a first version of a file A into a second version A'. For example, here's another correct delta file for
the example above:
A3:XYAC9,1A6:BLETCHC12,14A5:QQELF
And here's another correct, but not very useful, delta file:
A35:XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQELF
This delta file simply contains the entire contents of A' and instructs the incremental updater to write all 35 characters of this data out to the output file. It's a bad solution because it's actually larger in size than A'! It'd be cheaper to just send A' instead of this delta file.
Your Assignment
You have been hired by the Chief Frugality Officer (CFO) of SmallMart to create two functions:
createDelta: This function takes the contents of two files, A and A', and produces a delta containing instructions for converting A into A'. Each day, SmallMart will use this function at their corporate headquarters to create a new delta file that they will then send to all the devices.
applyDelta: This function takes the content of a file A and a delta file, and will apply the instructions in the delta file to produce a new file A'. Each day, every device will use this function to update the previous day's inventory file.
The createDelta function has the following interface:
void createDelta(istream& oldf, istream& newf, ostream& deltaf);
You may name the parameters something else, but there must be three parameters of the indicated types in the indicated order. (The File I/O tutorial explains istream and ostream.) The parameters are
an alreadyopened input source (for yesterday's full file, say) an alreadyopened input source (for today's full file, say)
an alreadyopened output destination to which you will write the instructions for converting the first input to the second.
The applyDelta function has the following interface:
bool applyDelta(istream& oldf, istream& deltaf, ostream& newf);
You may name the parameters something else, but there must be three parameters of the indicated types in the indicated order. The parameters are
an alreadyopened input source (for yesterday's full file, say) an alreadyopened input source (the delta file)
an alreadyopened output destination to which you will write the file resulting from applying the delta file to the first input.
The function returns true if the operation succeeds. If it fails because the delta file is malformed (e.g., there's a character other than an A or C where a command is expected, or an offset or length is invalid), the function returns false. If the function returns false, the caller can make no assumption about what may or may not have been written to the output destination (so you're free to, for example, just return false as soon as you detect an error, without regard to what you may have already written).
Other than the output required to be written to its third parameter, neither function may write to any
destination other than cerr (presumably for debugging purposes). Our testing program will ignore anything you write to cerr.
Your functions must be able to create deltas and apply deltas for files up to 100 kilobytes (102,400 bytes) in length, and if you wish, may be able to handle longer files as well.
Your program must build and run successfully on the SEASnet Linux server. Of course, you can do your primary development in Visual C++ or Xcode, but be sure you check that it works on the Linux server.
Your createDelta function must not run for longer than 15 seconds on the SEASnet Linux server for any file of up to 100 kilobytes. Your applyDelta function must not run for longer than 5 seconds on the SEASnet Linux server for any file of up to 100 kilobytes.
Here's an example of a main routine that performs a test of the functions:
#include
#include
#include
#include using namespace std;
bool runtest(string oldname, string newname, string deltaname, string newname2)
{
ifstream oldfile(oldname); if (!oldfile)
{
cerr << "Cannot open " << oldname << endl; return false;
}
ifstream newfile(newname); if (!newfile)
{
cerr << "Cannot open " << newname << endl; return false;
}
ofstream deltafile(deltaname); if (!deltafile)
{
cerr << "Cannot create " << deltaname << endl; return false;
}
createDelta(oldfile, newfile, deltafile); deltafile.close();
oldfile.clear(); // clear the end of file condition oldfile.seekg(0); // reset back to beginning of the file ifstream deltafile2(deltaname);
if (!deltafile2)
{
cerr << "Cannot read the " << deltaname << " that was just created!" << endl; return false;
}
ofstream newfile2(newname2); if (!newfile2)
{
cerr << "Cannot create " << newname2 << endl; return false;
}
assert(applyDelta(oldfile, deltafile2, newfile2));
cout << "You must compare " << newname << " and " << newname2 << endl;
cout << "If they are not identical, you failed this test." << endl; return true;
}
int main()
{
assert(runtest("myoldfile.txt", "mynewfile.txt", "mydeltafile.txt", "mynewfile2.txt"));
}
(Note: We close the deltafile output stream to ensure that all output to the file is completed before we open it for input. Because createDelta reached the end of file on the oldfile input stream, we need to clear the end of file condition on the stream and reset the stream so that we read from the beginning again.)
Here's another main routine that tests the functions. It uses the types istringstream (an input source where the characters read come from a string instead of a file), and ostringstream (an output destination where the characters written are later available as a string).
#include
#include // for istringstream and ostringstream
#include
#include using namespace std;
void runtest(string oldtext, string newtext)
{
istringstream oldfile(oldtext); istringstream newfile(newtext); ostringstream deltafile; createDelta(oldfile, newfile, deltafile); string result = deltafile.str();
cout << "The delta length is " << result.size()
<< " and its text is " << endl; cout << result << endl;
oldfile.clear(); // clear the end of file condition oldfile.seekg(0); // reset back to beginning of the stream istringstream deltafile2(result);
ostringstream newfile2;
assert(applyDelta(oldfile, deltafile2, newfile2)); assert(newtext == newfile2.str());
}
int main()
{
runtest("There's a bathroom on the right.", "There's a bad moon on the rise.");
runtest("ABCDEFGHIJBLAHPQRSTUVPQRSTUV", "XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQELF");
cout << "All tests passed" << endl;
}
We will provide you with several different old and new files for you to test your program with, in a Windows version and a Mac and Linux version. For each of pair of old and new files, your program must create delta files that are at least 5% smaller than the new version of the file. We will test your program on other files as well, so it can't work just for these sample files. Below we show how big our solution's delta file is for each of these sample pairs of files:
Windows file sizes in bytes Mac and Linux file sizes in bytes old new our delta old new our delta
Windows file sizes in bytes |
Mac and Linux file sizes in bytes |
|
Old |
New |
Our delta
|
Old |
New |
Our Delta
|
SmallMart inventory |
105
|
141
|
95
|
104
|
140
|
94
|
Weird Al's Dr. Seuss |
533
|
606
|
69
|
510
|
579
|
69
|
War and Peace |
77285
|
77333
|
399
|
76634
|
76682
|
393
|
Some strange files |
100764
|
100764
|
8746
|
100440
|
100440
|
8746
|
SmallMart inventory
Weird Al's Dr.
Seuss War and Peace
Some strange files
Algorithm Requirements
You may be wondering how to write a function to create a delta file. This is definitely a nontrivial problem, and most senior job interview candidates can't come up with a viable algorithm (i.e., a conceptual approach) within a onehour interview.
Here's a general, highlevel algorithm that can be used to build a delta file. It's not ideal, however, and you'll have to come up with improvements of your own to make it work well:
1. Read in the entire contents of the old file into a string. Read the entire contents of the new file into another string.
2. For all consecutive Ncharacter sequences in the old file's string, insert that Ncharacter sequence and the offset F where it was found in the old file's string, into a table (e.g. hash table or binary search tree). You might use 8 for N, or maybe 16.
3. Once you have filled up your table with all Nbyte sequences from the source file, start processing the new file's string, starting from offset j=0, until j reaches the end of the string.
a. Look up the Nbyte sequence which starts at offset j ([j,j+N1]) in the new file's string in the table you created from the old file's string.
b. If you find this sequence in the table, you know that that sequence is present in both the old and new versions of the file.
i. Determine how long the match goes on (it might be just N bytes long, or it might extend past the first N matching bytes to be many bytes longer).
ii. Once you have determined how long the match is (call this L), write a Copy instruction to the delta file to copy L bytes from offset F from the source file.
iii. Go back to step 3a, continuing at offset j = j + L in the new file's string.
c. If you don't find the current sequence (new file's string [j,j+N1]) in the table you created, you know that the first version of the file doesn't contain the current N byte sequence.
i. Write an instruction to the delta file to Add the current character.
ii. Increment j by one to continue past the character used in the add instruction.
iii. Go back to step 3a, where we'll try to find the next Nbyte sequence in our table.
Of course, this is a simple version of the algorithm, and many improvements are possible. For example, this version of the algorithm will result in only singlecharacter add instructions: If the new file contains the new text BLAH (not present in the old text), then the above algorithm would produce four Add instructions (A1:BA1:LA1:AA1:H) instead of a single (and more compact) Add instruction that adds all four new characters at once (A4:BLAH).
To obtain the highest score possible and create the smallest delta files, you'll have to improve on the algorithm substantially. Be creative! In addition, this naïve algorithm also has troubles with certain types of files (such as the strange*.txt sample files that we provide). You'll have to figure out why and make sure this doesn't cause problems in your solution.
For this project, we are constraining your implementation:
You are required to use either a hash table or a binary search tree (your choice of which) for your table data structure.
The only containers from the C++ standard library you may use are vector, list, stack, and queue (and string). You'll thus have to implement your BST or hash table yourself instead of using a map or unordered_map, but you can, say, use the list type to help you implement a hash table. Although we're limiting your use of containers from the library, you are free to use library algorithms (e.g., sort).
Compared to creating a delta file, the algorithm for applying a delta file is much easier. You may use the following functions to aid you in extracting commands from the delta file:
bool getInt(istream& inf, int& n)
{
char ch;
if (!inf.get(ch) || !isascii(ch) || !isdigit(ch)) return false;
inf.unget(); inf >> n; return true;
}
bool getCommand(istream& inf, char& cmd, int& length, int& offset)
{
if (!inf.get(cmd) || (cmd == '\n' && !inf.get(cmd)) )
{
cmd = 'x'; // signals end of file return true;
}
char ch; switch (cmd)
{
case 'A':
return getInt(inf, length) && inf.get(ch) && ch == ':'; case 'C':
return getInt(inf, length) && inf.get(ch) && ch == ',' && getInt(inf, offset);
}
return false;
}
Attachment:- Project.rar