Final Exam: Due by 11:59 PM EST Tuesday, December 13, 2022
This is can and should be a group exam. You are free to choose your groups, preferably less than five people.
When you are finished, you will submit three objects to eLC:
Written pdf Report.
Treat the "report" as your main deliverable to executive management. These executives are smart but non-technical and have little to no experience with machine learning. Your report needs to communicate your results and recommendations in a clear and intuitive way. A good lithmus test is that you should be able to communicate the "big ideas" to other people in your lives (eg, signficant others, parents, non-technical friends).
Your analysis should be professional: i.e., well-written, clear, and concise. Figures should be incorporated in your analysis. Save your report as a pdf ("File/Save As Adobe PDF") with the naming convention 'Report_[insert last names].pdf'. For example, ''Report_Thurk.pdf'.
To provide some structure, your report should include the following sections:
- Executive Summary: Provide a concise description of the problem, your solution methodology, and why your methodology works.
- Data Description: Describe the data. Provide important figures to demonstrate key variation. The figures should look nice and professional. See the lecture on effective visualization for suggestions. Be sure to include a concise description of the figures and how the variation they depict will be important inputs in the model and eventually affect the model's results.
- Model Description: Describe your preferred model. Pictures are good to demonstrate how your model works..
- Results: Discuss the characteristics and performance of your model. What features are most important (and why)?
- Executive Summary: Provide a concise description of the problem, your solution methodology, and why your methodology works.
Price predictions. Submit a csv file of price predictions following naming convention 'Predictions_[insert last names].csv'. For example, 'Predictions_Thurk.csv'
- Your price predictions must consist of two columns where column one (
test_id
) matches thetest_id
column intest.tsv
. Place your predicted price in column two (`price'). This submission should be about 3.5m rows.
- Your price predictions must consist of two columns where column one (
Jupyter Notebook of your Final Model. Submit your ipynb file using the following naming convention 'Model_[insert last names].ipynb'. For example, 'Model_Thurk.ipynb'. Your notebook should be self-contained: ie, it should load the raw data provided, load any ancillary data you collected on your own, do any data manipulation, initiate and tune your model, and test your model.
Office Hours¶
I will hold my regular office hours on wednesdays and fridays, plus by appointment. All office hours will be via the zoom link (ie, no in-person office hours).
Grading¶
The exam is worth 100 points and partial credit is as follows:
A. Report (90 points)
- Is your report professional, clear, and concise?
- Are your figures effective at contributing to the overall message?
- Are your modelling decisions reasonable to accomplish your objective?
- Is your approach creative?
B. Model testing (10 points)
- I will award points based on your relative (to the other teams) overall perfomance on the out-of-sample testing data.
- The top 20% of teams get 10 points, the second 20% of teams get 8 points and so on.
- I reserve the right to deviate in the event all teams to a good job deserve more credit.
The Competition¶
It can be hard to know how much something is really worth. Small details can mean big differences in pricing. For example, can you guess which one of the following sweaters costs 400 and which costs 9.99?
Product pricing gets even harder at scale, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specifications.
Objective¶
In this competition, you will predict the sale price of a listing based on information a seller provides for this listing. Online marketplaces like Amazon provide this kind of insight. While these are real data, they are not from Amazon. You will be provided supplier-provided text descriptions of their products, including details like product category name, brand name, and item condition.
About the Data¶
The file final_data.zip
is located on eLC and has all necessary data to complete this project. There are two tab-delimited data files needed for this contest: train.tsv
and test.tsv
. Each file contains online product listings and contain the following variables:
train_id
ortest_id
: Listing identifier.name
: Listing title. The data is cleaned to remove text that look like prices (e.g. $20) to avoid "leakage." These removed prices are represented as [rm].item_condition_id
: the condition of the items provided by the seller.category_name
: category of the listing brand_name.price
: The price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist intest.tsv
.shipping
: Equals 1 (0) if shipping fee is paid by seller (buyer).item_description
Item description. The data is cleaned to remove text that look like prices (e.g. $20) to avoid "leakage." These removed prices are represented as [rm].
You will use train.tsv
to develop informative figures of descriptive statistics for your final report and develop your model. Once you've trained your model, use test.tsv
to make price predictions. While train.tsv
has about 700k observations, test.tsv
has 3.5m observations. I suggest developing your model initially by using a representative sample of train.tsv
in order to speed things up.
Model Evaluation
I will follow the Kaggle competition's evaluation metric use construct the Root Mean Squared Logarithmic Error (RMSLE) for each submission:
$$ \epsilon = \sqrt{\frac{1}{N} \sum_{i=1}^N\big[\log(\hat{p}_i+1) - \log(p_i+1)\big]^2} $$Where: $$ \begin{align} \epsilon&\equiv \text{ is the RMSLE value (score)}\\ N&\equiv \text{ is the total number of observations in the test data set}\\ \hat{p}_i&\equiv \text{ is your price prediction for prioduct i in test.tsv}\\ p_i &\equiv \text{ is the actual sale price for product i in test.tsv (known only to me)} \end{align} $$
The team with the lowest RMSLE ($\epsilon$) wins!