1. Fixed-length and variable-length records
Recall the Oscar table from our movie database, which has the following schema:
Oscar(movie_id CHAR(7), person_id CHAR(7), type VARCHAR(23), year INTEGER)
Consider the following tuple from that table:
('2980516', '1519666', 'BEST-ACTOR', 2015)
which captures the fact that Eddie Redmayne (whose id is '1519666') won last year's Oscar for Best Actor for his performance in The Theory of Everything (which has an id of '2980516').
a. What would this tuple look like if we stored it in a fixedlength record? Your answer should be a single string of characters that specifies the contents of the record something similar to the record diagrams in the lecture notes. You should observe the following conventions:
Use a single vertical bar surrounded by spaces (' | ') to indicate where one component of the record ends and another component begins.
Use a number sign ('#') as a delimiter when it is necessary to record the end of a variable length field's value.
Use hyphens ('') for any "wasted" bytes (i.e, bytes that are part of the record's representation but are not actually part of one of the field values).
b. What is the length in bytes of the record from part a? Assume that you are using twobyte integers (both for integer field values and for any integer metadata that may be needed) and one byte characters. You should not include the verticalbar/space combinations (' | ') in the length, since they are not actually stored in the record. Show your work, and clearly indicate your final answer.
c. What would this tuple look like if we stored it in a variablelength record in which each field is preceded by its length? Use the same conventions specified for part a, and make the same assumptions stated in part b.
d. What is the length in bytes of the record from part c? Make the same assumptions stated in part b. Show your work, and clearly indicate your final answer.
e. What would this tuple look like if we stored it in a variablelength record that begins with a header of offsets? Use the same conventions specified for part a, and make the same assumptions stated in part b.
f. What is the length in bytes of the record from part e? Make the same assumptions stated in part b. Show your work, and clearly indicate your final answer.
g. Gradcredit problem (required of gradcredit students; may be completed by other students for "partial" extra credit)
Briefly discuss the pluses and minuses of using each of these three record formats for the Oscar table.
h. Gradcredit problem (required of gradcredit students; may be completed by other students for "partial" extra credit)
Are there certain workloads that would cause you to choose one of these formats over the other, or is one of these formats the best one for the Oscar table regardless of workload? By workload, we mean a particular mix of command types. For example, one workload might be readheavy (e.g., mostly SELECT commands), while another workload might be writeheavy (e.g., mostly UPDATE commands).
2. Index Structures
Let's say that you want to insert the following sequence of keys into a table that uses some form of indexing:
31, 30, 29, 21, 19, 18, 13, 11, 8, 6, 5, 3, 2
a. Insert these values into an initially empty Btree of order 2, showing the state of the tree before and after each split, as well as the final tree.
b. Insert these values into an initially empty B+tree (note the +) of order 2, showing the state of the tree before and after each split, as well as the final tree.
c. Insert these same values into a hash table that uses linear hashing. The table uses the hash function h(x) = x, and it starts out with two empty buckets. Assume that a bucket is added whenever the number of items in the table exceeds twice the number of buckets. Show the state of the table before and after each increase in the number of buckets, as well as the final state of the table.
Part II: Programming Problems Overview
In this assignment, you will begin to implement a simple relational database management system that supports a substantial subset of the SQL language. We have provided you with two of the three components of the system: (1) a SQL parser, and (2) Berkeley DB, an embedded database system that will serve as the storage engine. Your job will be to implement the "middle layer" of the system, which takes the parsed version of a SQL command and performs the necessary lowerlevel actions to execute the command. To help you, we have given you a code framework for the middle layer that already provides some of the necessary functionality.
In the current assignment, you will implement whatever additional functionality is needed to add data to relations, as well as the functionality needed to simply list all of the rows in a given relation.
Getting Started
You will complete this project using a virtual machine that we have created. This will allow you to work on your own computer and use the GUI tools of your choice. You should install the virtual machine as soon as possible, and setup your work environment within the virtual machine.
After installing the virtual machine, you should spend some time familiarizing yourself with the files that we have given you in the dbms directory, and with Berkeley DB. In particular, you should review/read the following resources:
the lecture notes on implementing a logicaltophysical mapping the overview of the code framework the API documentation of the code framework The following additional resources may also be helpful:
a guide to programming with Berkeley DB's Base Java API BDB Java API documentation extra notes on marshalling (solutions) UNIX quick reference
Codereading and design questions
Before you begin coding, we highly recommend that you answer the questions found here, which will help you to prepare for your work on the problems from this part of the assignment.
General Notes
You may see one or more warnings when compiling your code (e.g., "Note: Parser.java uses unchecked or unsafe operations."). These warnings are to be expected and should be ignored. Messages labeled as errors (not warnings) will keep your code from compiling and will need to be addressed. You shouldn't see any errors when you compile the starter code that we've given you. If you do, let us know.
The code that we've given you can be run before you make any changes. It will begin by printing the following prompt:
Enter command (q to quit):
If you enter a valid SQL command, the program will parse the command and display a summary of some of the command's components (see the notes on the DEBUG constant below for how to disable this summary). Entering a lowercase q will allow you to quit the program.
When you run the program for the first time, it will create a directory called db within your code directory. This is the home directory for the Berkeley DB environment, and it will be used to store the files that BDB creates for your database. If your program crashes for any reason, these files may be corrupted. As a result, we recommend that you remove all files from this directory after a crash.
There is a constant named DEBUG that is defined in DBMS.java. When it is set to true (as it is in the files that we have given you), the values of many of the tokens generated by the parser are printed after each SQL command is entered by the user. You may find this information helpful as you implement the various types of commands. You may also wish to add additional debugging code that is only executed when this constant is set to true. To eliminate the debugging messages, set DEBUG to false.
Problems for this Assignment
Important: Before you begin coding, make sure that you have completed the tasks listed under the Getting Started section above, and that you have answered the codereading and design questions mentioned above.
3. Adding support for the marshalling of column values
In order to insert rows into a table, your DBMS needs to be able to marshall a collection of column values into a single Berkeley DB key/data pair. In this problem, you will add support for marshalling by implementing the key method of the InsertRow class.
As you saw when completing the codereading questions, an InsertRow object is used by the execute() method for INSERT commands (the one in the InsertStatement class). That execute() method creates an InsertRow object to represent the row to be inserted, and it calls that object's marshall() method to prepare the marshalled (key, data) pair for the row.
We have already implemented some of the other methods of this class for you:
an InsertRow constructor that initializes the state of object. It takes two parameters: an already opened Table object for the table to which the row will be added, and an array of type Object containing the values in the row to be inserted. We assume that the values are in the appropriate order i.e., that element 0 of the array contains a value for the first column in the table, element 1 contains a value for for the second column in the table, etc. We also assume that the values are valid and that they have been adjusted as needed to correspond to the types of the columns.
a getKey() method that returns a DatabaseEntry object for the key portion of the marshalled (key, data) pair.
a getData() method that returns a DatabaseEntry object for the data portion of the marshalled (key, data) pair.
You will implement the marshall() method, which should take the column values of the InsertRow and turn them into a marshalled (key, data) pair. You should take whatever steps are needed to perform the marshalling, and assign the resulting DatabaseEntry objects to the key and data fields of the InsertRow object.
As suggested in the codereading questions, you may find it helpful to consult the catalog code that we have given you in Catalog.java to see how it performs its marshalling and unmarshalling. There is also an overview of this catalog code in the extra notes on marshalling. However, it is important to remember that the catalog always unmarshalls an entire data item, whereas your row format needs to support the efficient unmarshalling of a single column value. See the section of the codereading questions on marshalling and unmarshalling for more detail.
Notes:
Your marshall() method will need to use one or two TupleOutput objects (two if the table has a primary key, and one if it does not). Because TupleOutput objects fill their associated byte strings
from left to right, you will need to determine all of your offsets before you begin marshalling the values. One way to do this is to use an array for the offsets; once all of the offsets have been computed and stored in this array, you can begin the process of writing into the TupleOutput objects.
The InsertRow constructor takes a reference to the corresponding Table object as a parameter, and it stores that reference in a field called table. Your code can obtain any column information that it needs from the Table object and its associated Column objects.
The Column.getLength() method gives the actual length in bytes of all columns except VARCHARs. In the case of VARCHARs, you should determine the length by invoking the String.length() method on the actual value.
Because the column values are stored in an array of type Object, you will need to use type casts in order to treat them as objects of their actual types. For example, to treat values[i] as a String, you would need to do something like (String)values[i]. Consult the Column class for the method you should use to determine the type of a given column.
Integer values are stored in the values array as objects of Java's Integer class, and real values are stored as objects of Java's Double class. When marshalling these values, you will need to convert them to primitive values of type int and double, and you should use the Integer.intValue() and Double.doubleValue() methods to do so. For example, if you have an Integer object named val, you can convert it to an int by making the method call val.intValue().
When marshalling a String value, you should use TupleOutput.writeBytes(), not
TupleOutput.writeString().
To keep the marshall() method from getting too large, you may want to add one or more private helper methods that can be called to do part of the overall task.
Review the Table and Column classes as needed, as well as the Berkeley DB DatabaseEntry and
TupleOutput classes.
4. Completing the implementation of INSERT commands
We have given you the start of the execute() method of the InsertStatement class, which is used to carry out INSERT commands. As mentioned earlier, our provided code uses an InsertRow object to prepare the row for insertion marshalling it into a (key, data) pair. You will need to complete the execute() method by writing code that adds the (key, data) pair (which you should obtain from the InsertRow object) to the underlying BDB database.
Notes:
The method that you use to add the (key, data) pair will depend on which type of BDB database is being used for the table in question. As mentioned in the codereading questions, the choice of BDB database type (Btree or Recno) depends on whether the table has a primary key.
The insertion can fail if the table has a primary key and there is already a (key, data) pair with the specified key. In such cases, the BDB method will return a value that indicates that the key already exists. Your code should handle this return value by throwing an exception with an appropriate error message. See our CreateStatement code for examples of throwing an exception.
Review the Table and InsertRow classes as needed, as well as the Berkeley DB Database class.
5. Adding support for a table iterator, and for the unmarshalling of column values (20 points)
In order to execute a SELECT command, your DBMS needs to be able to iterate over the rows in one or more tables, and to access the values of the columns in those rows. In this problem, you will add support for a table iterator, which will be able to iterate over all or some of the rows in a single table and access the values of the columns. We can associate a WHERE clause with such an iterator, in which case it will only visit rows that satisfy the WHERE clause.
To complete this problem, you will need to implement two methods of the TableIterator class. We have
already implemented some of the other methods of this class for you:
a TableIterator constructor that takes an already opened table object and initializes the state needed by the table iterator, including a cursor for the underlying database. The constructor also examines the columns mentioned in the SQL statement for which this iterator is needed, and it associates this iterator with those columns; doing so allows the code that evaluated the WHERE clause to use the iterator to obtain the column values that it needs.
a first() method that positions the iterator on the first tuple of the table.
a getColumn() method that takes an index nand returns a Column object for the nth column in the table associated with the iterator. The leftmost column has an index of 0.
a close() method that closes the cursor associated with the iterator.
For this assignment, you should implement the following methods:
a next() method that advances the iterator to the next tuple specified by the SELECT command. In implementing the next() method, you may want to use our first() method as a model. However, whereas the first() method ignores the WHERE clause associated with the iterator, the next() method should keep advancing the iterator until it reaches a row that satisfies the WHERE clause (see the first note below). Provided that the iterator can be positioned on an appropriate tuple, the count of the number of tuples visited by the iterator (stored in the field called numTuples) should be incremented, and the method should return true. If there are no tuples left for the iterator to visit, the method should return false.
a getColumnVal() method that takes an index n and returns the value of the nth column in the tuple on which the iterator is currently positioned. To do so, it will need to unmarshall the appropriate value from the BDB key/data pair associated with that tuple, and it should use the metadata that you included when you marshalled the tuple to efficiently access the value of the specified column. See the notes below for more detail.
You do not need to implement the remaining TableIterator methods (updateRow and deleteRow) in this assignment.
Notes:
If there is no WHERE clause, or if the evalWhere parameter of the TableIterator constructor has a value of false, the constructor will set the TableIterator's where field to an instance of the TrueExpression class; its isTrue() method always returns true. As a result, your next() method can always safely invoke this.where.isTrue() to determine if the iterator should visit the current row.
Your getColumnVal() method will need to use at least one TupleInput object to unmarshall the value of the specified column. Here again, you may find it helpful to consult the catalog code that we have given you in Catalog.java and the extra notes on unmarshalling. However, whereas the catalog code always unmarshalls the entire marshalled data item, your code should not perform unnecessary reads. Rather, it should only read (1) the offset or offsets needed to determine where the column value is located and (when necessary) the length of the column value, and (2) the column value itself.
To prevent unnecessary reads from the TupleInput object, you will need to use one or more of the following methods that TupleInput inherits from Java's InputStream class:
mark() This method allows you to mark a position in the TupleInput's byte string so that you can return to that position after performing one or more reads. We recommend that you use this method immediately after you create a TupleInput object if you know that you will need to read multiple values from that TupleInput. For example, if your TupleInput variable is named dataIn, you would do the following immediately after you create the TupleInput:
dataIn.mark(0);
reset() This method returns the TupleInput to the position in the byte string at which you last called mark().
skip() This method takes an integer n and skips over n bytes in the byte string associated with the TupleInput. You can use this method to jump to the appropriate position for whatever value you need to read. If you called mark after creating the TupleInput, then once you determine the offset of a given value, you can get to that value by doing the following:
dataIn.reset(); // return to the beginning of the byte array dataIn.skip(offset); // jump to the appropriate position
When unmarshalling a String value, you should use TupleInput.readBytes(), not
TupleInput.readString().
Depending on which column value you need to unmarshall and how you perform the unmarshalling, you may need to create two TupleInput objects: one for each component of the (key, data) pair.
Avoid creating a TupleInput for the key unless you actually need it.
Review the Table, Column and ConditionalExpression classes as needed, as well as the Berkeley DB DatabaseEntry, Cursor and TupleInput classes.
6. Adding support for SELECT * commands involving a single table (10 points)
Implement the execute() method of the SelectStatement class, and any necessary helper methods. For this assignment, you will only need to support SELECT * commands involving a single table.
Your method will need to open the table associated with the SELECT command by using the open() method that we have provided for Table objects. (See the start of the execute() method for InsertStatement for an example of this.) The open() method will get the table's catalog metadata and add it to the Table object, and it will also open the underlying BDB database if it isn't already open.
Your code should then create a TableIterator for the appropriate table and invoke the printAll() method on it. This method, which we have provided, will invoke the appropriate iterator methods to obtain the table's column values and display them with appropriate formatting. Note that your SELECTstatement code does not need to advance the iterator by calling next(); printAll() already does all of the iteration
and all of the other work for you, using the TableIterator methods that you wrote.
Your execute() method should check for currently unsupported SELECT commands: those with more than one table in the FROM clause, or with one or more columns specified in the SELECT clause. In addition, it should make sure that there is an existing table with the given name; the open() method of the Table object should make it easy to do so. If there is an error, you should throw an exception with an appropriate error message. Follow the approach to handling exceptions that we have shown in our code for CREATE TABLE, DROP TABLE, and INSERT commands. If there are no errors, your method should finish by printing a message that includes the number of tuples selected.
Notes:
The TableIterator constructor takes a reference to an object of type SQLStatement. You should pass in a reference to the SelectStatement object on which the execute method was invoked, which you can do by using the implicit parameter this:
... new TableIterator(this, ...);
When you call the printAll() method, you should pass in System.out as the parameter, so that the results will be displayed on the console. (The reason that we make printAll() take a parameter for this is for added flexibility. If we wanted to, we could pass in a parameter that corresponds to a text file, and the results would be written to that file instead of to the console.)
Your method should close the iterator before returning, but it should not close the table. That way, the table will stay in the provided table cache (see the code overview) for later use.
Review the Table, TableIterator, RelationIterator, and SQLStatement classes as needed.
The printAll() method is defined in RelationIterator, and TableIterator inherits it.
Sample Interaction
To give you a sense of what your DBMS's output should look like, here is a sample interaction:
Enter command (q to quit): CREATE TABLE Course(name VARCHAR(20), enrollment INT); Created table Course.
Enter command (q to quit): SELECT * FROM Course;
| name | enrollment |
--------------------------------------- Selected 0 tuples.
Enter command (q to quit): DROP TABLE Course; Dropped table Course.
Enter command (q to quit): SELECT * FROM Course; Course: no such table
Enter command (q to quit): CREATE TABLE Course(id CHAR(5) PRIMARY KEY, name VARCHAR(20)); Created table Course.
Enter command (q to quit): INSERT INTO Course VALUES ('01000', 'CS 165'); Added 1 row to Course.
Enter command (q to quit): INSERT INTO Course VALUES ('00050', 'Math 21a'); Added 1 row to Course.
Enter command (q to quit): INSERT INTO Course VALUES ('00050', 'Physics 12'); There is an existing row with the specified primary key.
Could not insert row.
Enter command (q to quit): SELECT * FROM Course;
| id | name |
----------------------------------
| 00050 | Math 21a |
| 01000 | CS 165 |
Selected 2 tuples.
Enter command (q to quit): q
You should use Canvas to submit the following files:
InsertRow.java InsertStatement.java TableIterator.java SelectStatement.java
Attachment:- dbms.zip