How to build Index
1. Prepare origin Document
- file1: Researches of Chinese full-text search technologies based on word indexing is related to many fields.
- file2: The Index Data Service provides basic full-text functions for storage and retrieval of terms and indexed summary documents.
2. Put Document TOKENIZER
- split Document into words
- separate
symbols
separate
Stop word
Stop word in english like: “like”, “a”, “this”…
After Tokenier got Token:
“Researches” “Chinese” “full” “text” “search” “technologies” “word” “indexing” “related” “many” “fields” “Index” “Data” “Service” “provides” “basic” “full” “text” “functions” “storage” “retrieval” “terms” “indexed” “summary” “documents”
3. Put TOKEN to LINGUISTIC PROCESSOR
- to Lowercase
- words reduce to root type like “fields” to “field”
stemming
words to origin type like “indexed” to “index”
lemmatization
the difference between “Stemming” and “lemmatization”
- same: make words to initial
difference:
- Stemming is reduce
- lemmatization is change
difference in algorithm:
- Stemming is delete “s”, “ing”->”e”, “ational”->”ate”, “tional”-> “tion”
- lemmatization is “drove” -> “drive”
- they are not mutex, but mates
After linguistic processor result be call
Term
:
“researche” “chinese” “full” “text” “search” “technologie” “word” “index” “relate” “many” “field” “index” “data” “service” “provide” “basic” “full” “text” “function” “storage” “retrieve” “term” “index” “summary” “document”Because the linguistic processor when search drove, drive’s documents can be found.
4. Put TERM to INDEXER
- Build a dictionary in Term
Term | Document ID |
---|---|
researche | 1 |
chinese | 1 |
full | 1 |
text | 1 |
search | 1 |
technologie | 1 |
word | 1 |
index | 1 |
relate | 1 |
many | 1 |
field | 1 |
index | 2 |
data | 2 |
service | 2 |
provide | 2 |
basic | 2 |
full | 2 |
text | 2 |
function | 2 |
storage | 2 |
retrieve | 2 |
term | 2 |
index | 2 |
summary | 2 |
document | 2 |
- sort table by key’s first letter
Term | Document ID |
---|---|
basic | 2 |
chinese | 1 |
data | 2 |
document | 2 |
field | 1 |
full | 1 |
full | 2 |
function | 2 |
index | 1 |
index | 2 |
index | 2 |
many | 1 |
provide | 2 |
relate | 1 |
researche | 1 |
retrieve | 2 |
search | 1 |
service | 2 |
storage | 2 |
summary | 2 |
technologie | 1 |
term | 2 |
text | 1 |
text | 2 |
word | 1 |
- merge same Term into Posting List
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22Term-[Document Frequency] | DocumentID-Frequency
basic-1 2-1
chinese-1 1-1
data-1 2-1
document-1 2-1
field-1 1-1
full-2 1-1 -> 2-1
function-1 2-1
index-2 2-2 -> 1-1
many-1 1-1
provide-1 2-1
relate-1 1-1
researche-1 1-1
retrieve-2 2-1
search-1 1-1
service-2 2-1
storage-2 2-1
summary-2 2-1
technologie-1 1-1
term-2 2-1
text-2 2-1 -> 1-1
word-1 1-1
- Document Frequency: Document appear times
- Frequency: Term appear times in Document
When searching “drive” “driving” “drove” “driven” will be processor to drive like build Index process