Ad

How To Do Hit-highlighting Of Results From A SQL Server Full-text Query

We have a web application that uses SQL Server 2008 as the database. Our users are able to do full-text searches on particular columns in the database. SQL Server's full-text functionality does not seem to provide support for hit highlighting. Do we need to build this ourselves or is there perhaps some library or knowledge around on how to do this?

BTW the application is written in C# so a .Net solution would be ideal but not necessary as we could translate.

Ad

Answer

Expanding on Ishmael's idea, it's not the final solution, but I think it's a good way to start.

Firstly we need to get the list of words that have been retrieved with the full-text engine:

declare @SearchPattern nvarchar(1000) = 'FORMSOF (INFLECTIONAL, " ' + @SearchString + ' ")' 
declare @SearchWords table (Word varchar(100), Expansion_type int)
insert into @SearchWords
select distinct display_term, expansion_type
from sys.dm_fts_parser(@SearchPattern, 1033, 0, 0)
where special_term = 'Exact Match'

There is already quite a lot one can expand on, for example the search pattern is quite basic; also there are probably better ways to filter out the words you don't need, but it least it gives you a list of stem words etc. that would be matched by full-text search.

After you get the results you need, you can use RegEx to parse through the result set (or preferably only a subset to speed it up, although I haven't yet figured out a good way to do so). For this I simply use two while loops and a bunch of temporary table and variables:

declare @FinalResults table 
while (select COUNT(*) from @PrelimResults) > 0
begin
    select top 1 @CurrID = [UID], @Text = Text from @PrelimResults
    declare @TextLength int = LEN(@Text )
    declare @IndexOfDot int = CHARINDEX('.', REVERSE(@Text ), @TextLength - dbo.RegExIndexOf(@Text, '\b' + @FirstSearchWord + '\b') + 1)
    set @Text = SUBSTRING(@Text, case @IndexOfDot when 0 then 0 else @TextLength - @IndexOfDot + 3 end, 300)

    while (select COUNT(*) from @TempSearchWords) > 0
    begin
        select top 1 @CurrWord = Word from @TempSearchWords
        set @Text = dbo.RegExReplace(@Text, '\b' + @CurrWord + '\b',  '<b>' + SUBSTRING(@Text, dbo.RegExIndexOf(@Text, '\b' + @CurrWord + '\b'), LEN(@CurrWord) + 1) + '</b>')
        delete from @TempSearchWords where Word = @CurrWord
    end

    insert into @FinalResults
    select * from @PrelimResults where [UID] = @CurrID
    delete from @PrelimResults where [UID] = @CurrID
end

Several notes:
1. Nested while loops probably aren't the most efficient way of doing it, however nothing else comes to mind. If I were to use cursors, it would essentially be the same thing?
2. @FirstSearchWord here to refers to the first instance in the text of one of the original search words, so essentially the text you are replacing is only going to be in the summary. Again, it's quite a basic method, some sort of text cluster finding algorithm would probably be handy.
3. To get RegEx in the first place, you need CLR user-defined functions.

Ad
source: stackoverflow.com
Ad