Spam & Ham Projects, Extended - Data 100
Class Type: Principles and Techniques of Data ScienceThis project series has students build a binary classifier that can distinguish spam (junk, commerical, or bulk) emails from ham (regular non-spam) emails. This extended version emphasizes the planning and delivery stages of the data science lifecycle.
Authors:Presentations
- Slides: https://docs.google.com/presentation/d/1VUAtGDQjdyteAUpwGDWDbJhZR5We12t4FwonkA7JFM4/edit?usp=sharing
Project Source
- Github: https://github.com/jegeronimo/cs294-189
- Google drive: https://drive.google.com/drive/folders/1e_nr7HPBZMsMUK9xQIoZ3V4fCKz13I_6?usp=sharing
Table of Contents
Introduction
In the existing Project B1 and B2, the overarching goal is for students to create a binary classifier that can distinguish spam (junk, commercial, or bulk) emails from ham (regular non-spam) emails. In order to better enable students to start their careers as data scientists, we propose an extended project that provides exposure to the planning, ideation, and delivery stages of data science projects.
In Project B0, students are tasked with building a design document that demonstrates their high-level understanding of the data (e.g., data processing, feature engineering). In Project B1, students will create a simple binary classifier that can distinguish spam emails from ham emails. Here, they will feature engineer text data, use the sklearn library to process data and fit models, validate the performance of the model, and minimize overfitting. Building off of the design document in Project B0, this focuses on initial analysis, feature engineering, and logistic regression.
Learning Goals
For Project B0:
- Formulate a clear data science question
- Understand/describe the dataset
- Plan data processing and representation
- Explore the available data
- Develop/justify feature engineering
For Project B1:
- Incorporate B0 findings to aide analysis
- Feature engineer with text data
- Use sklearn to process data and fit models
- Validate model performance and minimize overfitting
Student Assignment
From the GitHub, the supplementary materials for Project B0 and Project B1 can be found in the README.md file. The relevant links are pasted below for convenience:
Instructor Guides
From the [Google Drive], the supplementary solutions can be found, but are only available upon request.
Requirements
- Standard Data 100 libraries (e.g.,
pandas,numpy,otter-grader,sklearn) - This is a partner project, so students should be able to choose or be assigned pairs, ideally through their discussion sections.
- [Suggested] Pensieve for AI-grading written components and autograding coding components.