Spam & Ham Projects, Extended - Data 100

Class Type: Principles and Techniques of Data Science

This project series has students build a binary classifier that can distinguish spam (junk, commerical, or bulk) emails from ham (regular non-spam) emails. This extended version emphasizes the planning and delivery stages of the data science lifecycle.

Authors:

Presentations

Slides: https://docs.google.com/presentation/d/1VUAtGDQjdyteAUpwGDWDbJhZR5We12t4FwonkA7JFM4/edit?usp=sharing

Project Source

Github: https://github.com/jegeronimo/cs294-189
Google drive: https://drive.google.com/drive/folders/1e_nr7HPBZMsMUK9xQIoZ3V4fCKz13I_6?usp=sharing

Introduction
Learning Goals
Student Assignment
Instructor Guides
Requirements

Introduction

In the existing Project B1 and B2, the overarching goal is for students to create a binary classifier that can distinguish spam (junk, commercial, or bulk) emails from ham (regular non-spam) emails. In order to better enable students to start their careers as data scientists, we propose an extended project that provides exposure to the planning, ideation, and delivery stages of data science projects.

In Project B0, students are tasked with building a design document that demonstrates their high-level understanding of the data (e.g., data processing, feature engineering). In Project B1, students will create a simple binary classifier that can distinguish spam emails from ham emails. Here, they will feature engineer text data, use the sklearn library to process data and fit models, validate the performance of the model, and minimize overfitting. Building off of the design document in Project B0, this focuses on initial analysis, feature engineering, and logistic regression.

Learning Goals

For Project B0:

Formulate a clear data science question
Understand/describe the dataset
Plan data processing and representation
Explore the available data
Develop/justify feature engineering

For Project B1:

Incorporate B0 findings to aide analysis
Feature engineer with text data
Use sklearn to process data and fit models
Validate model performance and minimize overfitting

Student Assignment

From the GitHub, the supplementary materials for Project B0 and Project B1 can be found in the README.md file. The relevant links are pasted below for convenience:

Instructor Guides

From the [Google Drive], the supplementary solutions can be found, but are only available upon request.

Requirements

Standard Data 100 libraries (e.g., pandas, numpy, otter-grader, sklearn)
This is a partner project, so students should be able to choose or be assigned pairs, ideally through their discussion sections.
[Suggested] Pensieve for AI-grading written components and autograding coding components.