Project Proposal
DUE WEDNESDAY, OCTOBER 1 AT 11:59PM.
Table of contents
Overview
You can choose one of the following project options:
Team Formation
You will work in teams of 3-4 people. Smaller or larger teams require instructor approval.
- Use Piazza to find teammates and form groups early in the semester
- Once you’ve finalized your team, register on Canvas (People → Project Groups) so other students can see who is still looking for teammates.
- Team size will be considered when setting project scope and grading expectations.
Timeline and Deliverables
Week | Deliverable | Weight | Description |
W7 | Project Proposal | 5% | Detailed proposal outlining approach, implementation and evaluation plan |
W12 | Milestone Report | 5% | Progress update with implementation status and preliminary results |
W16 | Final Report | 15% | Complete project including final report and code |
Option 1: Hybrid Vector Search
You will design and implement a hybrid search system that efficiently combines approximate nearest neighbor (ANN) search over vector embeddings with structured filtering on metadata attributes (e.g., category, price, or timestamp).
Hybrid search systems are essential in many production scenarios such as E-commerce, enterprise search, and content discovery platforms. Here are some examples of hybrid search queries:
- Find the 10 most similar product images to this query image, but only consider items under $50 in the electronics category.
- Retrieve papers most semantically similar to this text, filtered by publication date between 2020-2023 and author affiliation with Georgia Tech
- Search for houses similar to this description, within price range $300K-$500K, built after 2010, and in neighborhoods with rating >= 4.0
Requirements
- Performance: Your solution must outperform the following two naive approaches. You are expected to implement these two baselines as part of your evaluation for fair comparison.
- Pre-filtering: Apply metadata filters first, then perform brute-force ANN search on the reduced subset.
- Post-filtering: Run ANN search on the full dataset followed by metadata filtering.
Dataset Scale: We expect a reasonably large dataset with at least 100,000 data points to demonstrate the scalability benefits of your approach.
- Implementation Expectations: The project will be graded based on implementation and evaluation quality rather than algorithmic novelty. You are encouraged to draw inspiration from papers and blog posts, but the project needs significant original implementation work.
- Acceptable: Building your solution on existing ANN search indices (e.g., IVF-PQ from the faiss library).
- Not acceptable: Directly using research artifacts that contain specialized hybrid search implementations.
References
- ANN Libraries (Source: ANN benchmark)
- Datasets:
- SIFT1m: Images represented by 128-dimension SIFT descriptors
- Wikipedia: Wikipedia articles
- LAION-400M: Image dataset with text descriptions
- Papers:
- AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data
- Milvus: A Purpose-Built Vector Data Management System
- High-Throughput Vector Similarity Search in Knowledge Graphs
- VBase: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity
- ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
Proposal Format
Your proposal should cover the following:
- Proposed Approach
- Provide a detailed sketch of your proposed solution. What search strategy will you implement, and under what scenarios do you expect it to outperform the baselines?
- Cite all sources referenced in developing your proposed solution.
- Implementation Plan
- What programming language and libraries do you plan to use? Clearly distinguish between what you will reuse versus implement from scratch.
- Briefly describe your baseline implementation strategy to ensure fair comparison with your proposed solution.
- Dataset and Query Workload
- Embedding Dataset: Which dataset will you use? Specify the number of vectors, dimensionality, and similarity metric (e.g., cosine, L2). If generating your own embeddings, briefly describe the data generation pipeline (source data, embedding model, chunking strategies).
- Metadata Schema: Describe the metadata schema and provide summary statistics for each column (e.g., value ranges, number of distinct values, data types). If generating synthetic metadata, explain the distributions you will use.
- Query Workload: Design queries with varying selectivity levels and multi-column filters. You can include both point queries (exact matches) and range queries.
- Evaluation Plan
- Metrics: Define key performance metrics for your system (e.g., search latency, recall@k, memory usage, index build time).
- Experimental Setup: Outline your evaluation methodology based on your chosen dataset and query workload. What defines a “success” for your project?
- Expected Results: Sketch your anticipated main results. What will the x and y axes represent? What trends do you expect in the ideal case? (Hand-drawn sketches are acceptable.)
- Timeline and Divison of Work
- What to plan to complete by the milestone report deadline? At a high level, how do you plan to divide the work between the team members?
Option 2: Replicating Research
You will replicate a prior result from a database research paper, following the ACM’s definitions of repeatability, reproducibility, and replicability:
- Repeatability (Same team, same setup): The original team can consistently reproduce their own results using the same methods, tools, and data.
- Reproducibility (Different team, same setup): An independent team can verify the results using the original authors’ artifacts (code, data, and methods).
- Replicability (Different team, different setup): A new team independently re-implements the experiment (with different tools/data) and confirms the conclusions.
For this project, you will focus on replicability–developing your own implementation to validate the paper’s claims.
Where to find papers?
Your selected paper must be published in one of these venues:
- PACMMOD (Proceedings of the ACM on Management of Data)
- SIGMOD (ACM Special Interest Group on Management of Data)
- PVLDB (Proceedings of the VLDB Endowment)
- ICDE (International Conference on Data Engineering)
A good first step is to look over the programs from recent SIGMOD, VLDB, and ICDE conferences. Identify sessions/topics that interest your team. Skim paper titles/abstracts before selecting one.
Here are some example papers for your inspiration:
- The Case for Learned Index Structures
- Bao: Making Learned Query Optimization Practical
- Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
- Lux: always-on visualization recommendations for exploratory dataframe workflows
- Understanding and Benchmarking the Impact of GDPR on Database Systems
- An Empirical Evaluation of Columnar Storage Formats
- Pollock: A Data Loading Benchmark
Proposal Format
Your proposal should cover the following:
- Paper Selection & Citation
- Please give a full citation that includes the names of all authors, the title, the venue that published it, the year of publication, and a URL.
- Intellectual Contribution
- What is the relevant intellectual contribution of this paper that you will focus on? Explain this in language that is accessible to somebody who has not read the paper.
- Example: “The paper contributes a new method/technique/approach for solving the problem of A, which arises in the context of B. The new method/technique/approach is C.” Or: “This paper describes an interesting measurement/discovery/finding in the area of A, which is important in the context of B because of C. The discovery/finding is reported to be D.”
- Before the paper arrived, what was the “status quo? i.e. what was the state of knowledge before the paper and its intellectual contributions arrived?
- Result/Claim to Replicate
- What is the result or claim that you will replicate, that is related to the intellectual contribution you identified above? Please identify a specific figure/table/number and cite where it is mentioned in the paper. Please screenshot the figure or number and include an excerpt in your report.
- In your own words, what is the result saying? E.g., “The experiments found that the new method/technique/approach produces a 13% improvement in query time compared with the best prior work”.
- Replication Plan
- Briefly outline the steps your team will do to replicate this result. What infrastructure or software will you reuse from prior work? What will you build by following the paper’s instructions? What part(s)/lines/numbers from the figure or text that you screenshotted will you attempt to replicate?
- Draw a “sketch” of what the resulting replication might look like. E.g. if you’re planning to redraw two of the lines from a graph, draw those lines freehand just to illustrate what the result of your replication might be. (Okay to do this by hand.)
- Timeline and Divison of Work
- What to plan to complete by the milestone report deadline? At a high level, how do you plan to divide the work between the team members?
Submission and Grading
Use the LaTeX template for the proposal here.
Submit a PDF with your project proposal on canvas. Only one submission is required per group.
This assignment is worth 5% of your grade. We will evaluate on the basis of clarity of this report, completeness of the design and overall description of the approach taken.