Search Engine Design - cshamback/WinGrep GitHub Wiki

Objectives

Make WinGrep's design as simple as possible.
Make WinGrep user-friendly to Windows users, including installation.
WinGrep's search algorithm should be reasonably optimized and faster on average than Windows Search performing the same task.

Architecture

Original Architecture:

Components

Crawler when called, starts at the root directory (or other specified directory) and explores every sub-directory and all files. Every unique file it finds is added to the database.

Database is a temporary array of file paths, which the indexer deconstructs and the program empties.

Indexer is called when the crawler finishes. Iterates through the entire database array, creating JSON objects for each distinct word found. These are stored in MainMap.

For readable files like .txt and .HTML formats, the indexer extracts only the text content and removes any useless words or symbols. It then stores the file in the MainMap JSON under each word with its frequency.
For non-readable files like images, zip folders, and binary, the indexer only uses the words from the file name, even if those words are not separated by spaces.

MainMap a .JSON file that follows this format:

{ "word": "example", "pages": [ {"URL": "~/home/example", "freq": 3}, {"URL": "~/home/Downloads", "freq": 1} ] } { "word": "help", "pages": [ ... ] }

The Query Engine will parse this file for matching words with high frequencies when given a query. Query Engine takes queries from the Client, splits them into words, and parses the MainMap for each word. It stores all results in a dictionary, where it orders them based on priority (highest frequency of each word, with highest frequency of all words in the query being the highest). It then returns this list.

Client is the front-end. The user can bring it up with a configurable keyboard shortcut (ctrl + F by default) or by opening the app. From there, they can use a Google-like search box to enter queries, or run the spider and indexer to re-index their files. Searches and indexes will both have a progress bar. Results will resemble Google Search results.