Full-text retrieval fundamental (2)

How to build Index

1. Prepare origin Document

file1: Researches of Chinese full-text search technologies based on word indexing is related to many fields.
file2: The Index Data Service provides basic full-text functions for storage and retrieval of terms and indexed summary documents.
2. Put Document TOKENIZER

split Document into words
separate symbols
separate Stop word

Stop word in english like: “like”, “a”, “this”…
After Tokenier got Token:
“Researches” “Chinese” “full” “text” “search” “technologies” “word” “indexing” “related” “many” “fields” “Index” “Data” “Service” “provides” “basic” “full” “text” “functions” “storage” “retrieval” “terms” “indexed” “summary” “documents”

3. Put TOKEN to LINGUISTIC PROCESSOR

to Lowercase
words reduce to root type like “fields” to “field” stemming
words to origin type like “indexed” to “index” lemmatization

the difference between “Stemming” and “lemmatization”
- same: make words to initial
- difference:
  - Stemming is reduce
  - lemmatization is change
- difference in algorithm:
  - Stemming is delete “s”, “ing”->”e”, “ational”->”ate”, “tional”-> “tion”
  - lemmatization is “drove” -> “drive”
- they are not mutex, but mates
After linguistic processor result be call Term:
“researche” “chinese” “full” “text” “search” “technologie” “word” “index” “relate” “many” “field” “index” “data” “service” “provide” “basic” “full” “text” “function” “storage” “retrieve” “term” “index” “summary” “document”

Because the linguistic processor when search drove, drive’s documents can be found.

4. Put TERM to INDEXER

Build a dictionary in Term

Term	Document ID
researche	1
chinese	1
full	1
text	1
search	1
technologie	1
word	1
index	1
relate	1
many	1
field	1
index	2
data	2
service	2
provide	2
basic	2
full	2
text	2
function	2
storage	2
retrieve	2
term	2
index	2
summary	2
document	2

sort table by key’s first letter

Term	Document ID
basic	2
chinese	1
data	2
document	2
field	1
full	1
full	2
function	2
index	1
index	2
index	2
many	1
provide	2
relate	1
researche	1
retrieve	2
search	1
service	2
storage	2
summary	2
technologie	1
term	2
text	1
text	2
word	1

merge same Term into Posting List

Term-[Document Frequency] | DocumentID-Frequency
basic-1        2-1
chinese-1     1-1
data-1        2-1
document-1    2-1
field-1       1-1
full-2        1-1 -> 2-1
function-1    2-1
index-2       2-2 -> 1-1
many-1        1-1
provide-1     2-1
relate-1      1-1
researche-1   1-1
retrieve-2    2-1
search-1      1-1
service-2     2-1
storage-2     2-1
summary-2     2-1
technologie-1 1-1
term-2        2-1
text-2        2-1 -> 1-1
word-1        1-1

Document Frequency: Document appear times
Frequency: Term appear times in Document

When searching “drive” “driving” “drove” “driven” will be processor to drive like build Index process

How to build Index

1. Prepare origin Document

2. Put Document TOKENIZER

3. Put TOKEN to LINGUISTIC PROCESSOR

4. Put TERM to INDEXER