Spam & Ham Projects, Extended - Data 100

Class Type: Principles and Techniques of Data Science

This project series has students build a binary classifier that can distinguish spam (junk, commerical, or bulk) emails from ham (regular non-spam) emails. This extended version emphasizes the planning and delivery stages of the data science lifecycle.

Authors:

Presentations

Project Source

Table of Contents

  1. Introduction
  2. Learning Goals
  3. Student Assignment
  4. Instructor Guides
  5. Requirements

Introduction

In the existing Project B1 and B2, the overarching goal is for students to create a binary classifier that can distinguish spam (junk, commercial, or bulk) emails from ham (regular non-spam) emails. In order to better enable students to start their careers as data scientists, we propose an extended project that provides exposure to the planning, ideation, and delivery stages of data science projects.

In Project B0, students are tasked with building a design document that demonstrates their high-level understanding of the data (e.g., data processing, feature engineering). In Project B1, students will create a simple binary classifier that can distinguish spam emails from ham emails. Here, they will feature engineer text data, use the sklearn library to process data and fit models, validate the performance of the model, and minimize overfitting. Building off of the design document in Project B0, this focuses on initial analysis, feature engineering, and logistic regression.

Learning Goals

For Project B0:

  • Formulate a clear data science question
  • Understand/describe the dataset
  • Plan data processing and representation
  • Explore the available data
  • Develop/justify feature engineering

For Project B1:

  • Incorporate B0 findings to aide analysis
  • Feature engineer with text data
  • Use sklearn to process data and fit models
  • Validate model performance and minimize overfitting

Student Assignment

From the GitHub, the supplementary materials for Project B0 and Project B1 can be found in the README.md file. The relevant links are pasted below for convenience:

Instructor Guides

From the [Google Drive], the supplementary solutions can be found, but are only available upon request.

Requirements

  • Standard Data 100 libraries (e.g., pandas, numpy, otter-grader, sklearn)
  • This is a partner project, so students should be able to choose or be assigned pairs, ideally through their discussion sections.
  • [Suggested] Pensieve for AI-grading written components and autograding coding components.