08/21 |
Introduction |
Course Introduction and Logistics
Introduction to Part 1: Data Cleaning and ML |
Xu Chu |
08/23 |
Introduction |
Introduction to Part 2: Data Exploration and ML
Introduction to Part 3: Systems and ML
Discussions of sample course projects
|
Xu Chu |
Optional reading |
Introduction |
Data
Management Challenges in Production Machine Learning
Data
Management in Machine Learning: Challenges, Techniques,
and Systems
|
N.A.
|
08/28 |
N.A. |
No Class (Instructor at VLDB) |
N.A. |
08/30 |
N.A. |
No Class (Instructor at VLDB) |
N.A. |
Part I: Data
Cleaning and ML |
09/04 |
ML for Data Deduplication (1)
|
Introduction to papers in this class
Interactive
Deduplication using Active Learning
On
active learning of record matching packages. |
Xu Chu
Xiang Cheng
Alex Mueller |
09/06 |
ML for Data Deduplication (2)
|
Introduction to papers in this class
Distributed Representations of Tuples for Entity
Resolution
Deep
Learning for Entity Matching: A Design Space Exploration |
Xu Chu
Yuhong Wang
Omar Sharifali |
09/11 |
ML for Data Deduplication (3) |
Introduction to papers in this class
CrowdER:
crowdsourced entity resolution
Distributed
Data Deduplication |
Xu Chu
Alex Mueller
Thibaut Boissin |
Optional reading |
ML for Data Deduplication (4)
|
Duplicate
Record Detection: A Survey |
N.A.
|
09/13 |
ML for Data Cleaning
(1)
|
Introduction to papers in this class
Detecting
Data Errors: Where are we and what needs to be done?
HoloClean:
Holistic Data Repairs with Probabilistic Inference |
Xu Chu
Florina Dutt
Zhuoran Yu |
09/18 |
ML for Data Cleaning
(2)
|
Introduction to papers in this class
Tracing
Data Errors with View-Conditioned Causality∗
Data
X-Ray: A Diagnostic Tool for Data Errors |
Xu Chu
Thibaut Boissin
Saurabh Sawlani |
Optional reading |
ML for Data Cleaning
(3)
|
Data Cleaning
is a ML Problem that Needs Data Systems Help |
N.A. |
09/20 |
Data Cleaning for ML
(1)
|
Introduction to papers in this class
A
Sample-and-Clean Framework for Fast and Accurate Query
Processing on Dirty Data
ActiveClean:
Interactive Data Cleaning For Statistical Modeling |
Xu Chu
Sanya Chaba
Peng Li |
09/25 |
Data Cleaning for ML
(2)
|
Introduction to papers in this class
Cleaning Crowdsourced Labels Using Oracles For
Supervised Learning
BoostClean:
Automated Error Detection and Repair for Machine
Learning |
Xu Chu
Jennifer Blase
Xinran Shi |
Optional reading |
Data Cleaning for ML
(3)
|
Impacts of Dirty Data: an Experimental Evaluation
|
N.A. |
09/27 |
Data
Wrangling/Transformation |
Introduction to papers in this class
Potter’s
Wheel: An Interactive Data Cleaning System
Transform-Data-by-Example
(TDE): Extensible Data Transformation using Functions
|
Xu Chu
Matthew Britton
Prashanth Dintyala |
10/02 |
Training
Data Enrichment |
Introduction to papers in this class
Snorkel:
Rapid Training Data Creation with Weak Supervision
Combining
Labeled and Unlabeled Data with Co-Training |
Xu Chu
Pranshu Trivedi
Peng Li |
10/04 |
Boosting
|
Introduction to papers in this class
Multi-class
AdaBoost
XGBoost:
A Scalable Tree Boosting System |
Xu Chu
Jayant Prakash
Zhanhao Liu |
Part 2: Data
Exploration and ML |
10/09 |
N.A. |
No Class (Fall Recess) |
N.A. |
10/11 |
Relational Data
Profiling (1)
|
Introduction to papers in this class
TANE: An
Efficient Algorithm for Discovering Functional and
Approximate Dependencies
FastFDs: A
Heuristic-Driven, Depth-First
Algorithm for Mining Functional Dependencies from
Relation Instances |
Xu Chu
Zhanhao Liu
Nilaksh Das |
10/16 |
Relational Data
Profiling (2)
|
Introduction to papers in this class
Discovering
Denial Constraints
Efficient
Denial Constraint Discovery with Hydra |
Xu Chu
Yuhong Wang
Xinran Shi |
Optional reading |
Relational Data
Profiling (3)
|
Functional
Dependency Discovery: An Experimental Evaluation of
Seven Algorithms
Profiling
Relational Data – A Survey |
N.A. |
10/18 |
Model Interpretation
(1)
|
Introduction to papers in this class
“Why Should
I Trust You?” Explaining the Predictions of Any
Classifier
Anchors:
High-Precision Model-Agnostic Explanations |
Xu Chu
Yafei Zhang
Saurabh Sawlani |
10/23 |
Model Interpretation
(2)
|
Introduction to papers in this class
A Unified
Approach to Interpreting Model Predictions
Peeking
Inside the Black Box: Visualizing Statistical Learning
with Plots of Individual Conditional Expectation. |
Xu Chu
Yue Hu
Yafei Zhang |
Optional reading |
Model Interpretation
(3)
|
Interpretable ML
Symposium
Interpretable
ML by H2O |
N.A. |
Optional reading |
Visualization and ML
(1)
|
Visual
Exploration of Machine Learning Results using Data Cube
Analysis
ACTIVIS:
Visual Exploration of Industry-Scale Deep Neural Network
Models |
N.A. |
Optional reading |
Visualization and ML
(2)
|
Recent
progress and trends in predictive visual analytics |
N.A. |
10/25 |
Feature Engineering (1)
|
Introduction to papers in this class
Deep
Feature Synthesis: Towards Automating Data Science
Endeavors
ExploreKit:
Automatic Feature Generation and Selection |
Xu Chu
Jayant Prakash
Andrea Hu |
10/30 |
Feature Engineering (2)
|
Introduction to papers in this class
One
button machine for automating feature engineering in
relational databases
Feature
Engineering for Predictive Modeling using Reinforcement
Learning |
Xu Chu
Wendi Du
Yue Zhang |
Optional reading |
Feature Engineering (3)
|
Discover
Feature Engineering, How to Engineer Features and How to
Get Good at It
An
Introduction to Variable and Feature Selection |
N.A. |
Part 3: Systems
and ML |
11/01 |
Managing ML Pipeline
(1)
|
Introduction to papers in this class
TFX: A
TensorFlow-Based Production-Scale Machine Learning
Platform
ProvDB: A
System for Lifecycle Management of Collaborative
Analysis Workflows |
Xu Chu
Jennifer Blase
Nidhi Menon |
11/06 |
Managing ML Pipeline
(2)
|
Introduction to papers in this class
MODELDB:
A System for Machine Learning Model Management
Towards
Unified Data and Lifecycle Management for Deep Learning
|
Xu Chu
Prashanth Dintyala
Pranshu Trivedi |
Optional reading |
Managing ML Pipeline
(3)
|
A
Berkeley View of Systems Challenges for AI |
N.A. |
11/08 |
Training Set Debug (1)
|
Introduction to papers in this class
Training
Set Debugging Using Trusted Items
Flipper:
A Systematic Approach to Debugging Training Sets
|
Xu Chu
Yue Zhang
Sneha Venkatachalam |
11/13 |
Training Set Debug (2)
|
Introduction to papers in this class
Understanding
Black-box Predictions via Influence Functions
Examples are
not Enough, Learn to Criticize! Criticism for
Interpretability |
Xu Chu
Nidhi Menon
Yue Hu |
11/15 |
Reducing Training Set
|
Introduction to papers in this class
LightGBM:
A Highly Efficient Gradient Boosting Decision Tree
BlinkML:
Approximate Machine Learning with Probabilistic
Guarantees
|
Xu Chu
Eric Qin
Xiang Cheng |
Optional Reading
|
Reducing Training Set |
DCNNs on
a Diet: Sampling Strategies for Reducing the Training
Set Size |
N.A.
|
11/20 |
Catastrophic forgetting |
Introduction to papers in this class
Overcoming
catastrophic forgetting in neural networks
Measuring
Catastrophic Forgetting in Neural Networks
Understanding Black-box Predictions via Influence
Functions
|
Xu Chu
Eric Qin
Zhuoran Yu |
Optional reading |
Debug ML
|
Why
is machine learning 'hard' |
N.A. |
Course Project
Presentations |
11/22 |
N.A. |
No Class (Thanksgiving) |
N.A. |
11/27 |
Final Project |
TBD |
Student |
11/29 |
Final Project |
TBD |
Student |
12/04 |
Final Project |
TBD |
Student |