Project Proposal

DUE WEDNESDAY, OCTOBER 1 AT 11:59PM.

Overview

Overview

You can choose one of the following project options:

Option 1: Open-ended project on hybrid vector search
Option 2: Replicating research results

Team Formation

You will work in teams of 3-4 people. Smaller or larger teams require instructor approval.

Use Piazza to find teammates and form groups early in the semester
Once you’ve finalized your team, register on Canvas (People → Project Groups) so other students can see who is still looking for teammates.
Team size will be considered when setting project scope and grading expectations.

Timeline and Deliverables

Week	Deliverable	Weight	Description
W7	Project Proposal	5%	Detailed proposal outlining approach, implementation and evaluation plan
W12	Milestone Report	5%	Progress update with implementation status and preliminary results
W16	Final Report	15%	Complete project including final report and code

Option 1: Hybrid Vector Search

You will design and implement a hybrid search system that efficiently combines approximate nearest neighbor (ANN) search over vector embeddings with structured filtering on metadata attributes (e.g., category, price, or timestamp).

Hybrid search systems are essential in many production scenarios such as E-commerce, enterprise search, and content discovery platforms. Here are some examples of hybrid search queries:

Find the 10 most similar product images to this query image, but only consider items under $50 in the electronics category.
Retrieve papers most semantically similar to this text, filtered by publication date between 2020-2023 and author affiliation with Georgia Tech
Search for houses similar to this description, within price range $300K-$500K, built after 2010, and in neighborhoods with rating >= 4.0

Requirements

Performance: Your solution must outperform the following two naive approaches. You are expected to implement these two baselines as part of your evaluation for fair comparison.
- Pre-filtering: Apply metadata filters first, then perform brute-force exact nearest neighbour search on the reduced subset.
- Post-filtering: Run ANN search on the full dataset followed by metadata filtering.
Dataset Scale: We expect a reasonably large dataset with at least 100,000 data points to demonstrate the scalability benefits of your approach.
Implementation Expectations: The project will be graded based on implementation and evaluation quality rather than algorithmic novelty. You are encouraged to draw inspiration from papers and blog posts, but the project needs significant original implementation work.
- Acceptable: Building your solution on existing ANN search indices (e.g., IVF-PQ from the faiss library).
- Not acceptable: Directly using research artifacts that contain specialized hybrid search implementations.

References

ANN Libraries (Source: ANN benchmark)
- faiss: Facebook’s library for efficient similarity search
- hnswlib: Popular HNSW implementation
- pyglass: Graph index library
Datasets:
- SIFT1m: Images represented by 128-dimension SIFT descriptors
- Wikipedia: Wikipedia articles
- LAION-400M: Image dataset with text descriptions
Papers:

Proposal Format

Your proposal should cover the following:

Proposed Approach
- Provide a detailed sketch of your proposed solution. What search strategy will you implement, and under what scenarios do you expect it to outperform the baselines?
- Cite all sources referenced in developing your proposed solution.
Implementation Plan
- What programming language and libraries do you plan to use? Clearly distinguish between what you will reuse versus implement from scratch.
- Briefly describe your baseline implementation strategy to ensure fair comparison with your proposed solution.
Dataset and Query Workload
- Embedding Dataset: Which dataset will you use? Specify the number of vectors, dimensionality, and similarity metric (e.g., cosine, L2). If generating your own embeddings, briefly describe the data generation pipeline (source data, embedding model, chunking strategies).
- Metadata Schema: Describe the metadata schema and provide summary statistics for each column (e.g., value ranges, number of distinct values, data types). If generating synthetic metadata, explain the distributions you will use.
- Query Workload: Design queries with varying selectivity levels and multi-column filters. You can include both point queries (exact matches) and range queries.
Evaluation Plan
- Experimental Setup: Outline your evaluation methodology based on your chosen dataset and query workload. What defines a “success” for your project? Define key performance metrics for your system (e.g., search latency, recall@k, memory usage, index build time).
- Expected Results: Sketch your anticipated main results. What will the x and y axes represent? What trends do you expect in the ideal case? (Hand-drawn sketches are acceptable.)
Timeline and Divison of Work
- What do you plan to complete by the milestone report deadline? At a high level, how do you plan to divide the work between the team members?

Option 2: Replicating Research

You will replicate a prior result from a database research paper, following the ACM’s definitions of repeatability, reproducibility, and replicability:

Repeatability (Same team, same setup): The original team can consistently reproduce their own results using the same methods, tools, and data.
Reproducibility (Different team, same setup): An independent team can verify the results using the original authors’ artifacts (code, data, and methods).
Replicability (Different team, different setup): A new team independently re-implements the experiment (with different tools/data) and confirms the conclusions.

For this project, you will focus on replicability–developing your own implementation to validate the paper’s claims.

Where to find papers?

Your selected paper must be published in one of these venues:

PACMMOD (Proceedings of the ACM on Management of Data)
SIGMOD (ACM Special Interest Group on Management of Data)
PVLDB (Proceedings of the VLDB Endowment)
ICDE (International Conference on Data Engineering)

A good first step is to look over the programs from recent SIGMOD, VLDB, and ICDE conferences. Identify sessions/topics that interest your team. Skim paper titles/abstracts before selecting one.

Here are some example papers for your inspiration:

Proposal Format

Your proposal should cover the following:

Paper Selection & Citation
- Please give a full citation that includes the names of all authors, the title, the venue that published it, the year of publication, and a URL.
Intellectual Contribution
- What is the relevant intellectual contribution of this paper that you will focus on? Explain this in language that is accessible to somebody who has not read the paper.
- Example: “The paper contributes a new method/technique/approach for solving the problem of A, which arises in the context of B. The new method/technique/approach is C.” Or: “This paper describes an interesting measurement/discovery/finding in the area of A, which is important in the context of B because of C. The discovery/finding is reported to be D.”
- Before the paper arrived, what was the “status quo? i.e. what was the state of knowledge before the paper and its intellectual contributions arrived?
Result/Claim to Replicate
- What is the result or claim that you will replicate, that is related to the intellectual contribution you identified above? Please identify a specific figure/table/number and cite where it is mentioned in the paper. Please screenshot the figure or number and include an excerpt in your report.
- In your own words, what is the result saying? E.g., “The experiments found that the new method/technique/approach produces a 13% improvement in query time compared with the best prior work”.
Replication Plan
- Briefly outline the steps your team will do to replicate this result. What infrastructure or software will you reuse from prior work? What will you build by following the paper’s instructions? What part(s)/lines/numbers from the figure or text that you screenshotted will you attempt to replicate?
- Draw a “sketch” of what the resulting replication might look like. E.g. if you’re planning to redraw two of the lines from a graph, draw those lines freehand just to illustrate what the result of your replication might be. (Okay to do this by hand.)
Timeline and Divison of Work
- What do you plan to complete by the milestone report deadline? At a high level, how do you plan to divide the work between the team members?

Submission and Grading

Use the LaTeX template for the proposal here.

Submit a PDF with your project proposal on canvas. Only one submission is required per group.

This assignment is worth 5% of your grade. We will evaluate on the basis of clarity of this report, completeness of the design and overall description of the approach taken.

Project Proposal

DUE WEDNESDAY, OCTOBER 1 AT 11:59PM.

Table of contents

Overview

Team Formation

Timeline and Deliverables

Option 1: Hybrid Vector Search

Requirements

References

Proposal Format

Option 2: Replicating Research

Where to find papers?

Proposal Format

Submission and Grading