Web Analytics Made Easy -
StatCounter

Research

Our research centers around two important and practical areas in computer science: data management and machine learning. Our research projects can generally be grouped into two themes: using data mining, ML, statistics to solve data management problems, and developing principled data processing systems to tackle ML challenges. Our research aims at both developing principled algorithms and techniques with theoretical guarantees and building practical end-to-end usable and scalable systems.

T1: ML for Data Management

Big data management problems generally fall under the 3Vs (volume, variety, and velocity). While there exists considerable amount of research and commercial products for addressing the volume challenges and the velocity challenges (e.g., RDBMS, MapReduce, Spark), the variety challenges in contrast have much fewer products and present the biggest hurdle in data management. A survey about the state of data science and machine learning reveals that dirty data is the most common barrier faced by workers dealing with data. With the popularity of data science, it has become increasingly evident that data curation, unification, preparation, and cleaning are key enablers in unleashing the value of data, according to New York Times. Not surprisingly, developing effective and efficient data management solutions that address the data variety challenges is extremely timely and important topic, and is rife with deep theoretical and engineering problems.

A non-comprehensive list of projects we actively work on in this theme is as follows. Project websites and source codes will be released when available and appropriate.

  • Automatic Fuzzy Join
  • Automated Entity Resolution
  • Data Cleaning and Integration

T2: Data Management for ML

Machine learning (ML) is increasingly used by organizations to gain insights from data and to solve a diverse set of important problems, ranging from traditional applications such as fraud detection, production recommendation, and customer churn prediction, to more challenging and modern applications such as image recognition, natural language understanding, and even health care and self-driving cars.

While research in theoretical understanding of ML models and developing better and faster training algorithms is certainly interesting and valuable, we believe the more pressing challenges in applying ML to real-world applications is actually composing high-quality ML pipelines, which consist of many data management steps besides model training. In fact, in a ML pipeline, model training may as well be one of the easiest steps due to the existence of many ML packages such as Scikit-learn, Tensorflow, PyTorch. On the other hand, there is surprisingly few tools for curating sufficient high-quality training set, engineering and selecting high-quality features, interpreting and evaluating models, and managing and debugging different pipelines; and this is exactly the focus of our second research theme.

A non-comprehensive list of projects we actively work on in this theme is as follows. Project websites and source codes will be released when available and appropriate.