Search Engine — Mahima Shah

01Overview

Building Google, from scratch.

Search engines look simple from the outside — type something, get results. Under the hood they're one of the most technically demanding systems in software: distributed data pipelines, information retrieval algorithms, multi-server architectures, and real-time query processing, all working together. This project built all of it from scratch.

The system has three major components that work together in a pipeline: a MapReduce program that indexes 28,000+ Wikipedia pages into a segmented inverted index, three Flask index servers that load the index into memory and serve query results via REST API, and a search server that queries all three in parallel and returns ranked results to the user.

28K+ Wikipedia documents indexed via MapReduce pipeline

<200ms query latency under concurrent load across 3 index segments

3 parallel index servers, each serving a separate document segment

2 ranking signals combined: tf-idf cosine similarity + PageRank

02Architecture

Three systems, one pipeline.

The architecture follows a service-oriented design. Each component is independent and communicates through well-defined interfaces — the MapReduce pipeline writes files that the index servers read, and the index servers expose a REST API that the search server calls. Separating concerns this way meant each piece could be built, tested, and scaled independently.

Build time — run once to create the index

28K Wikipedia HTML documents

→

MapReduce Pipeline map0–mapN + reduce0–reduceN

→

              Inverted Index
              3 segments · tf-idf · PageRank
            

Query time — live request/response

User Query Search UI · port 8000

→

              Search Server
              parallel threads → 3 index APIs
            

→

Index 0 port 9000

Index 1 port 9001

Index 2 port 9002

→

Ranked Results top 10 · merged + sorted

03MapReduce Pipeline

Turning 28,000 pages into a searchable index.

The indexing pipeline is a chain of MapReduce jobs, each one transforming the data one step closer to a final inverted index. The first job counts documents. Subsequent jobs parse HTML, compute term frequencies, calculate tf-idf weights and normalization factors, and finally partition the output into three segments by document ID.

The output is a segmented inverted index where each line represents one term followed by its IDF score and a list of documents containing it — each with term frequency and a normalization factor precomputed for fast cosine similarity at query time.

        # Each line of the inverted index looks like this:

        program  0.2818  00111573 3 325.75  00350748 4 7197.13  ...

        # term    idf     doc_id   tf norm_factor  doc_id   tf norm_factor

Segmentation by doc_id % 3

The final MapReduce job uses a custom partitioner to split output into exactly 3 files. Each document with doc_id % 3 == i ends up in segment i, ensuring each index server loads a predictable, non-overlapping slice of the corpus.

Precomputed normalization

Document normalization factors are calculated during indexing — not at query time. This means cosine similarity at query time is a simple dot product and division, keeping latency low regardless of how many terms a document contains.

Stop word removal + case folding

Each document is cleaned before indexing: non-alphanumeric characters are stripped, text is lowercased via casefold(), and common stop words are removed. The same cleaning runs on queries at search time so the lookup always matches.

Pipeline chaining

Each MapReduce job reads from the previous job's output directory. This chaining keeps each step small and testable — the example crawl with 3 documents could be run and diffed against expected output before scaling to the full 28K document corpus.

04Ranking

Two signals, one score.

Results are ranked using a weighted combination of two signals. tf-idf cosine similarity captures how relevant a document is to the specific query. PageRank captures how important a document is across the entire corpus based on link structure. A user-controlled weight parameter lets them shift the balance between the two.

        Score(q, d, w)  =  w × PageRank(d)  +  (1 − w) × cosSim(q, d)

        # w = 0.5 by default. w=1.0 boosts popular pages. w=0.0 is pure tf-idf.

tf-idf

Term frequency rewards documents where the query term appears often. Inverse document frequency down-weights terms that appear in many documents. Together they surface documents that are specifically relevant to the query, not just ones that use common words.

PageRank

PageRank scores are precomputed and loaded into memory once at server startup. Documents linked to by many other documents rank higher. This captures the "authority" of a page independent of the specific query terms used.

Adjustable weight

The weight w is a URL parameter. The search UI exposes it as a slider, letting users tune results in real time — useful for comparing how much PageRank authority vs. query relevance should drive the ranking.

"Searching for 'michigan wolverine' with w=0.3 correctly surfaces the University of Michigan Wikipedia page — ranked first from a corpus of 28,000 documents."

— Verified against instructor solution

05Index & Search Servers

REST APIs, parallel threads, fast results.

Each of the three index servers loads its segment of the inverted index into memory at startup — once. All query requests then operate entirely in memory with no disk I/O, which is the key to sub-200ms latency. The Flask REST API accepts a query string, cleans and parses it the same way as the indexer, looks up matching documents, computes scores, and returns a ranked JSON response.

The search server sits in front of all three index servers. When a query arrives, it fires requests to all three simultaneously using Python threads, then merges and re-ranks the combined result sets before returning the top 10 to the user.

In-memory index loading

The index, PageRank scores, and stop words are all loaded into memory exactly once when the Flask app initializes. Every subsequent query hits an in-memory data structure rather than reading from disk, which is the primary reason query latency stays low under concurrent load.

Parallel index requests

The search server uses Python threads to query all three index servers simultaneously rather than sequentially. Total query time is bounded by the slowest server, not the sum of all three — a simple but meaningful performance win at this scale.

SQLite document store

A separate searchdb script parses all 28K HTML files using BeautifulSoup and stores each document's title, URL, and summary in a SQLite database. The search server queries this at render time to display human-readable results rather than raw doc IDs.

Init scripts for orchestration

Shell scripts handle starting, stopping, restarting, and status-checking all four servers (3 index + 1 search) as background processes with log file output. The search server validates that the index is running and the database exists before it starts — failing early with a clear error message otherwise.

06Stack

What it's built with.

Every component was written in Python. No search library, no pre-built index, no shortcuts — the tf-idf calculation, the cosine similarity scoring, the MapReduce jobs, the REST API, the query parser, and the HTML document parser were all written from scratch.

Python Flask MapReduce (Madoop) SQLite REST API BeautifulSoup Python threading Bash scripting HTML/CSS

07Reflection

What we took away.

⚡

Performance is about data structure decisions

The biggest latency win wasn't parallelism — it was loading the entire index into memory at startup. The decision of what to precompute (normalization factors, PageRank scores) versus what to compute at query time defined the system's speed ceiling before a single request was made.

🔗

Service-oriented architecture makes complexity manageable

Having three independent servers that communicate through REST made the system much easier to build and debug than a monolith would have been. Each component could be tested in isolation. If the index server returned wrong scores, the search server didn't need to change.

📐

MapReduce forces you to think in transformations

Writing each Map and Reduce as a standalone program that only reads from stdin and writes to stdout was a constraint that taught a lot about functional data transformation. It also meant each step was independently testable, which was invaluable when debugging a 5-job pipeline over 28,000 files.