Writing Your Own Search Engine Using SQL Server

My coffee site INeedCoffee needed a better search engine. I had thrown some basic SQL together when the site was launched back in 1999. It did an OK job when the site didn’t have much content. Over the years, the quality of the search results got worse and worse. So I did what any coder would do, I looked for a free solution.

Google did better job searching my site than my own code, so I looked at their Google Custom Search solution. I didn’t like their free ad version and I didn’t what to pay them $100 each and every year for the non-ad version. I decided that not only could I write my own search engine that was just as fast, but I could also deliver better results to the users. After all, I knew my content better than anyone else.

Assigning a Quality Score

The first thing I noticed about Google’s search results is that the best article on a given topic often wasn’t listed first. It had no way to know quality, but I did. So I added a quality score of 1 to 5 for every article. The default was 3. The best content was rated a 4 or 5. Articles that needed better photos or improved in some way, were given a 1 or 2. Later I’d also use this quality score when assigning weight on the sitemap.

Web Form -> Server Side Code -> Stored Procedure

The HTML search form is pretty basic. A single text box and a submit button. What server side code you use to call the stored procedure is irrelevant. ASP.NET, Classic ASP, PHP – it is all good. The server side code will call the search stored procedure.

Two Temp Tables

The search stored procedure will have two temp tables: #searchWords and #searchResults. The purpose of #searchWords is to chop up any search phrase into individual words and then record their position. Later that position will be used to order search results, which more weight being placed on the first and second word in a search query. The #searchResults table are the results being returned to the web page.

CREATE TABLE #searchWords (
    word      VARCHAR(100),
    POSITION    INT
)
CREATE TABLE #searchResults (
    url        VARCHAR(100),
    title      VARCHAR(100),
    longDesc   VARCHAR(MAX),
    quality    TINYINT,
    score      INT
)

Splitting Search Phrases

For this functionality, I found some code on StackOverflow that did the job. The SplitWordList user-defined function by Terrapin works perfectly. If the user places the search term inside quotes, I do not call the SplitWordFunction and inside enter the entire phrase as one row in the #searchWords table.

INSERT INTO #searchWords SELECT word, POSITION FROM SplitWordList(@searchString)

Count String Function

For the actual search, I used the Count String Occurrence Function. The search words are compared first against the article title and then the content itself.

CREATE FUNCTION [dbo].[udfCountString](
    @InputString    VARCHAR(MAX),
    @SearchString    VARCHAR(100)
)
RETURNS INT
BEGIN
    RETURN (LEN(@InputString) -
            LEN(REPLACE(@InputString, @SearchString, ''))) /
            LEN(@SearchString)
END

I Like Cursors

The most straight forward approach I could think of for getting search results was to use two cursors. One with the content and one with the search words. Then write the hits to the #searchResults temp table. But cursors are often frowned upon for poor performance. I decided I would first code the search engine using Cursors and then if I ran into a performance problem, I’d come up with an alternate solution. But I didn’t need to, as I got rocking fast results using CURSORS.

DECLARE ContentCursor CURSOR FAST_FORWARD FOR
SELECT url, title, longDesc, quality, page
FROM Articles 
 
DECLARE SearchWordCursor CURSOR DYNAMIC FOR
SELECT word, POSITION FROM #searchWords
OPEN SearchWordCursor 
 
OPEN ContentCursor
FETCH NEXT FROM ContentCursor INTO @url, @title, @longDesc, @quality, @page
 
WHILE @@FETCH_STATUS = 0
BEGIN
    FETCH FIRST FROM SearchWordCursor INTO @word, @POSITION
    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- place more weight on the first search term
        SELECT @score = CASE @POSITION
            WHEN 1 THEN 3
            WHEN 2 THEN 2
            ELSE 1
        END
        -- search the TITLE 
        SET @COUNT = dbo.udfCountString(@title, @word)
        IF @COUNT > 0
        BEGIN
            INSERT INTO #searchResults VALUES (@url, @title, @longDesc, @quality, @score * 10)
        END
        -- search the PAGE
        SET @COUNT = dbo.udfCountString(@page, @word)
        IF @COUNT > 0
        BEGIN
            INSERT INTO #searchResults VALUES (@url, @title, @longDesc, @quality, @score)
        END                    
 
        FETCH NEXT FROM SearchWordCursor INTO @word, @POSITION
    END
    FETCH NEXT FROM ContentCursor INTO @url, @title, @longDesc, @quality, @page
END
 
CLOSE ContentCursor
DEALLOCATE ContentCursor
 
CLOSE SearchWordCursor
DEALLOCATE SearchWordCursor

Working With the Results

Before dropping both temp tables, here is the query used to return the search results. If you look at the SQL above you will see that it is possible (likely) that a search hit will take place on both the title and the page content. I ran some tests and determined that a search hit against a word in the title was 10 times more important than the content, so I multiply the score time ten if there is a title match.

To flatten the results, I use a GROUP BY clause in the SQL. Then the results are returned order from highest to lowest scores.

SELECT TOP 20 S.url, S.title, S.longDesc, S.quality, SUM(S.score) AS Score
FROM #searchResults S
GROUP BY S.url, S.title, S.longDesc, S.quality
ORDER BY SUM(S.score) DESC, S.Quality DESC

Better Than Google?

I ran numerous tests comparing my search engine to Google. My hand-coded INeedCoffee search engine delivered better results at equal or faster speeds. And the best part is I don’t need to send Google a check for $100 every year.

Code

All the above code is available on GitHub.

Comments

  1. Jim says

    Came back to say thank you for the code. The code is brilliant and fast.

    Keep up the good work.

  2. Paa says

    Hello Sir,
    I’m building a basic “search engine” and i mean a true search engine such as yahoo, bing and google
    i can use crawlers, I program using python… but these search engine stuff is new to me,
    i have read a lot on the internet but still can really grasp the big picture on the server side.
    i have asked many questions concerning SQL and have received alot of different answers…
    such you don’t need SQL and you do need SQL i’m just very confused. i would seriously appreciate a very detailed
    explanation of what goes on behind the HTML page. example after you hit the enter key what program grabs the keywords and then sends it to the server and how do the server interacts with the database”if there is any” which i think there should be.

  3. says

    @Paa – This post is for people with knowledge of SQL and SQL Server. I’m sure there are many ways to write a search engine. This is but one solution, for which I am not offering support.

Trackbacks