Assignment
In this exercise you will create some data "assets" for use by the XYZ company in direct marketing campaigns. You will download data from the SSCC server to your local computer, create relational database tables that you'll store locally, create a "flat" file with selected customers and variables, and report on purchasing by gender. You'll document your work by providing your commented code. You'll save the new assets you created for future use and for sharing with others.
To do this assignment you'll be using the SSCC and a Postgres server running on it, Python and the pandas package, and the sqlite database. You'll use Python and pandas for data manipulation and reporting.
Getting Your Data from the SSCC
XYZ's data are in a Postgres DB server on the SSCC. To get them, log in to the dornick server. You'll need to do this using a VPN client if you are off campus. Then, connect to the Postgres database server.
The Working with the SSCC pdf document and the SSCC Cheat Sheet pdf on Canvas provide some information about how to connect to the SSCC and to get your data from the Postgres DB.
You'll find three (3) tables in the Postgres DB pilot schema that are named item, mail, and customer. Export each table as a csv file with a header record, using a temporary view to do each one. Using psql is the easiest way to do this. After you are done, delete each temporary view.
Download your three csv files to your computer so that you can work with the data in them using Python. Don't forget to log off from dornick.
You'll find documentation about XYZ's data (attached). Note that like most real world documentation, it's not "perfect." But it is the real thing.
Do These Things
Once you have your csv files on your computer, do the following five (5) things, most of which have "subthings:"
#1) Import each of the csv files you downloaded from the SSCC into a pandas DataFrame.
(a) Provide the code you used to do this.
(b) Print out the column names of your item DataFrame and the first four (4) records in it.
(c) Decribe the data types of the columns in the DataFrame.
Include your commented code for each of the above.
#2) Write each of you pandas DataFrames to a local SQLite DB named xyz.db. Include only data for active buyers in these tables. Verify that you have written the tables to your SQLite DB correctly.
(Commented code, of course.)
#3) Now, using the same data as you used for 2, above, create a new table called custSum that you also write to xyz.db, and that has the following characteristics. This table should have one row per customer record.
(a) Include on each customer's record a binary, Y or N, indicator of whether the customer is a 'heavy buyer,' where the definition of a 'heavy buyer' is a customer whose YTD purchasing in 2009 is greater than 90% of the 2009 YTD purchasing of all customers who are active buyers. Verify your coding of this new variable by crosstabulating it with an indicator of whether their 2009 YTD purchasing is at at least the 90th percentile of all 2009 YTD purchasing.
(b) Add to each customer's record whether the customer has the following credit cards: AMEX, Discover, VISA, and Mastercard, with each credit card variable codes as a Y or a N for yes or no, respectively. Document your creation of these codes by showing how they are related to the code values in the data
(c) Add to each customer's record their estimated HH income, and the genders of adults "1" and "2."
(d) Add to each customer's record their ZIP code and ZIP+4 code.
(e) Be sure to include the account number on each record in the SQL tables you create so that the tables can be related to each other, later.
(f) Provide a count of the number of records in each table.
(g) Verify that you have written this table to your SQLite DB correctly.
(Don't forget to comment your code so that a reader can understand what it is supposed to do.)
#4) Create a new pandas DataFrame of data that will be used for target maketing and write it out to a headered csv file.
(a) This DataFrame should have one row per customer. The customers included should be active buyers or lapsed buyers.
(b) The row for each customer should include the customer's account identifier, and an indicator variable (Y/N, or 1/0) for each product category the customer has made at least one purchase in.
(c) Include for each customer their buyer status, and the total dollar amount of the purchases they have made from XYZ using all data available for him or her.
(d) Write your DataFrame to a csv file, and also store it in a shelve database.
(e) Verify that the files you wrote your customer DataFrame to were written correctly.
(Commented code, of course.)
#5) Report the six (6) most frequently purchased product categories by the gender of "adult 1" using the data for the active customers. Include for these categories the total spend in dollars on each category, the total number of products purchased in these categories, and the number of adults in each gender category.
(Be sure to comment your code.)
Your Deliverables
Provide the above in up to six pages, but in no more than 7 pages, in a pdf file. Be sure that everything is readable. Address each of the five above parts in turn. Do 1 by providing your commented code and results. Then do 2 providing code + results, and so on.
Do not provide a list of code for all of the above items in a block, followed by the results of your code in a block. An assignment organized in this way will be returned ungraded. Be sure all of your code is syntactically correct, and that it approximates good Python coding style.