A scalable search engine built from scratch over 28,000 Wikipedia documents — with a MapReduce indexing pipeline, tf-idf and PageRank scoring, three parallel index servers, and a search UI that returns results in under 200ms.
Search engines look simple from the outside — type something, get results. Under the hood they're one of the most technically demanding systems in software: distributed data pipelines, information retrieval algorithms, multi-server architectures, and real-time query processing, all working together. This project built all of it from scratch.
The system has three major components that work together in a pipeline: a MapReduce program that indexes 28,000+ Wikipedia pages into a segmented inverted index, three Flask index servers that load the index into memory and serve query results via REST API, and a search server that queries all three in parallel and returns ranked results to the user.
The architecture follows a service-oriented design. Each component is independent and communicates through well-defined interfaces — the MapReduce pipeline writes files that the index servers read, and the index servers expose a REST API that the search server calls. Separating concerns this way meant each piece could be built, tested, and scaled independently.
Build time — run once to create the index
Query time — live request/response
The indexing pipeline is a chain of MapReduce jobs, each one transforming the data one step closer to a final inverted index. The first job counts documents. Subsequent jobs parse HTML, compute term frequencies, calculate tf-idf weights and normalization factors, and finally partition the output into three segments by document ID.
The output is a segmented inverted index where each line represents one term followed by its IDF score and a list of documents containing it — each with term frequency and a normalization factor precomputed for fast cosine similarity at query time.
The final MapReduce job uses a custom partitioner to split output into exactly 3 files. Each document with doc_id % 3 == i ends up in segment i, ensuring each index server loads a predictable, non-overlapping slice of the corpus.
Document normalization factors are calculated during indexing — not at query time. This means cosine similarity at query time is a simple dot product and division, keeping latency low regardless of how many terms a document contains.
Each document is cleaned before indexing: non-alphanumeric characters are stripped, text is lowercased via casefold(), and common stop words are removed. The same cleaning runs on queries at search time so the lookup always matches.
Each MapReduce job reads from the previous job's output directory. This chaining keeps each step small and testable — the example crawl with 3 documents could be run and diffed against expected output before scaling to the full 28K document corpus.
Results are ranked using a weighted combination of two signals. tf-idf cosine similarity captures how relevant a document is to the specific query. PageRank captures how important a document is across the entire corpus based on link structure. A user-controlled weight parameter lets them shift the balance between the two.
Term frequency rewards documents where the query term appears often. Inverse document frequency down-weights terms that appear in many documents. Together they surface documents that are specifically relevant to the query, not just ones that use common words.
PageRank scores are precomputed and loaded into memory once at server startup. Documents linked to by many other documents rank higher. This captures the "authority" of a page independent of the specific query terms used.
The weight w is a URL parameter. The search UI exposes it as a slider, letting users tune results in real time — useful for comparing how much PageRank authority vs. query relevance should drive the ranking.
"Searching for 'michigan wolverine' with w=0.3 correctly surfaces the University of Michigan Wikipedia page — ranked first from a corpus of 28,000 documents."
— Verified against instructor solutionEach of the three index servers loads its segment of the inverted index into memory at startup — once. All query requests then operate entirely in memory with no disk I/O, which is the key to sub-200ms latency. The Flask REST API accepts a query string, cleans and parses it the same way as the indexer, looks up matching documents, computes scores, and returns a ranked JSON response.
The search server sits in front of all three index servers. When a query arrives, it fires requests to all three simultaneously using Python threads, then merges and re-ranks the combined result sets before returning the top 10 to the user.
The index, PageRank scores, and stop words are all loaded into memory exactly once when the Flask app initializes. Every subsequent query hits an in-memory data structure rather than reading from disk, which is the primary reason query latency stays low under concurrent load.
The search server uses Python threads to query all three index servers simultaneously rather than sequentially. Total query time is bounded by the slowest server, not the sum of all three — a simple but meaningful performance win at this scale.
A separate searchdb script parses all 28K HTML files using BeautifulSoup and stores each document's title, URL, and summary in a SQLite database. The search server queries this at render time to display human-readable results rather than raw doc IDs.
Shell scripts handle starting, stopping, restarting, and status-checking all four servers (3 index + 1 search) as background processes with log file output. The search server validates that the index is running and the database exists before it starts — failing early with a clear error message otherwise.
Every component was written in Python. No search library, no pre-built index, no shortcuts — the tf-idf calculation, the cosine similarity scoring, the MapReduce jobs, the REST API, the query parser, and the HTML document parser were all written from scratch.
The biggest latency win wasn't parallelism — it was loading the entire index into memory at startup. The decision of what to precompute (normalization factors, PageRank scores) versus what to compute at query time defined the system's speed ceiling before a single request was made.
Having three independent servers that communicate through REST made the system much easier to build and debug than a monolith would have been. Each component could be tested in isolation. If the index server returned wrong scores, the search server didn't need to change.
Writing each Map and Reduce as a standalone program that only reads from stdin and writes to stdout was a constraint that taught a lot about functional data transformation. It also meant each step was independently testable, which was invaluable when debugging a 5-job pipeline over 28,000 files.