Overview

Timeline
FAQs
Unsolicited Project Ideas

An important learning objective of this course to get hands-on research experience. Therefore as part of this course, you need to complete a semester-long research project. Each project team can have up to 3 students, and we expect work proportional to team sizes. To help you make consistent progress towards the final presentation, we set up a few milestones over the course of the semester. Research projects will mainly be evaluated based on semantic completeness and clarity of presentation.

Timeline

Here’s a rough timeline of the key milestones:

Week 3: Finalize project team
Week 5: Project Proposal
Week 10: Project Update
Week 15: Peer Review
Week 16: Final Presentation

FAQs

What count as research projects?

The primary requirement is that the project is 1) related to course topics and that it 2) contains some element of research (i.e., something new). In other words, reimplementing a piece of software that someone else proposed without any significant extension/modification does not count as a research project.

There are many forms of novelty: novel problem and solution, novel solution to existing problems, novel application of existing solutions, novel implementation and evaluation or even some cool new datasets. To help you determine whether the project idea is in a reasonable scope, all teams are required to meet with Kexin at least once prior to the project proposal deadline.

If you are a graduate student or have an existing research project, “reusing” the project to this course’s topics is encouraged.

How are projects evaluated?

One of the learning objectives of the class is for you to get hands-on experiences with conducting research. Research projects can vary greatly in scope and complexity and are also highly dependent on your background and skills. Therefore, from a grading perspective, we focus more on the “completeness” of the project, namely:

Is the problem/hypothesis well-defined and motivated?
Is the related work section thorough?
Does the evaluation have the appropriate metrics/experiments for testing the main hypothesis?
Is the writing overall clear and easy to follow for a technical expert in the field? We do not evaluate the project based on the “interestingness” of the ideas.

What are different types of projects?

Projects can come in different flavors:

Research project: identify a new problem or task, propose or extend a solution, evaluate and report findings. The solution can be a new system, a new tool, a new interface or a new algorithm.
Benchmarking and analysis: extensive evaluation of algorithms, data structures, and systems that are of wide interest. The novelty for benchmarking papers comes from 1) new insights about the strengths and weaknesses of existing methods 2) new ways to evaluate exisitng methods, such as by curating new datasets and scenarios.
Reproduce and extend: there are many papers that describe an idea in theoretical terms, or that implemented their ideas in a different context (e.g., maybe they made assumptions that don’t hold anymore on new hardware). Thoroughly understanding a paper (or collection of papers) and reproducing the main ideas in a new context can often lead to new findings and extensions.

I need help with research problems.

Identifying a research problem can be challenging (but also rewarding!), especially if you have not done so before. Don’t know where to start? Have a fuzzy idea? Want some feedback on your current idea? Please come to office hours and we are here to help!

Unsolicited Project Ideas

The list below offers examples of possible projects. Keep in mind that this list is not exhaustive, and you are fully encouraged to come up with a project topic that interests you personally. In fact, a common source of ideas is to take your experience from another domain, and combine it with ideas from the class. Another approach is to take concepts from the papers we read, and apply them to another domain.

If any of the projects listed matches your interest, you are welcomed to come and discuss them during the instructor’s OH.

Evaluate Copilot in recommending data preparation steps

nearest neighbor paper: Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
Objective: Examine the efficiency and effectiveness of GitHub Copilot in suggesting data preparation steps by comparing its performance to the methods proposed in the Auto-Suggest paper.
Both GitHub Copilot and Auto-Suggest models are trained on publicly available code (e.g., the paper trained models over 4M Jupyter notebooks crawled on GitHub). How well does a generic AI-based auto-complete tool like GitHub Copilot suggest data preparation steps? Is it comparable to a more specialized model such as the one trained in the Auto-Suggest paper?
Related reading: Assessing the Quality of GitHub Copilot’s Code Generation

Impact of data cleaning on visualization recommendations

nearest neighbor paper: Lux: always-on visualization recommendations for exploratory dataframe workflows
Objective: Investigate how different data cleaning methods impact visualization recommendation
Lux automatically recommends “interesting” visualizations from your pandas dataset. However, it does not currently support dirty data. Do different data cleaning methods have an effect on the visualizations recommended by Lux? If so, how and to what extend?

Data preprocessing transferability

nearest neighbor paper: DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data
Objective: Evaluate the transferability of data preprocessing pipelines across different tasks and datasets.
DiffPrep essentially fits a transformation function to a dirty dataset to maximize the performance for a given downstream task on the dataset. If users want to perform additional tasks on the same dataset, to what extent do the best preprocessing pipeline for one task “transfer” across different tasks? Can you come up with a more quantitative measure for transferrability?
Related reading: CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Sketch-based labeling interface for time series anomaly detection

nearest neighbor paper: Expressive Time Series Querying with Hand-Drawn Scale-Free Sketches
Objective: Investigate the feasibility of using a sketch-based interface like Qetch to develop labeling functions for time series anomaly detection
Example application: Internet outage detection
Related paper: Snorkel: Rapid Training Data Creation with Weak Supervision

Cross-modality labeling interfaces

nearest neighbor paper: Visual Concept Programming: A Visual Analytics Approach to Injecting Human Intelligence at Scale
Objective: Explore the possibility of using a natural language interface for image labeling, leveraging recent advancements in cross-modality models like CLIP.
Labeling functions for images often uses bounding boxes or image segments as primitives. With recent development in cross-modality models such as CLIP, how can we more effectively use natural language as a primitive for labeling images?
Related reading: https://www.snorkel.org/blog/coral

Similarity search and beyond

nearest neighbor paper: Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data
Our vision paper explores a number of applications of similarity search, highlighting new problem formulations and research challenges that extend beyond traditional approaches. Feel free to use these observations as a starting point for your own project.