INDEXER:
It is a word which the system evaluates to be on a high priority category using position, frequency and other factors.INTERESTING WORD:
It is a narrower category of words and contains the words which the system after training, will present as a clickable index in the slave window.
1. Obtain training document (Display in master window).
2. Identify individual text words.
3. Use stop list to delete common words.
4. Use suffix stripping algorithms.
5. Identify the retrieved words as relevant(score=1)
and non-relevant(score=0) to the user.
6. Compute term weights of relevant words using prescribed formula.
7. Place words and weights in user preferences file.
8. Obtain new document and repeat steps 2-4.
9. Place the keywords in a document vector.
10. Find similarity coefficient of u.p.f and document vector.
11. Find weights of newly found terms.
12. Reformulate contents of u.p.f. using relevance feedback formula.
13. Create a clickable index of "interesting words" which are
the contents of u.p.f. (Display in slave window)
14. Return to step 8
w(i,j) = t(i,j)*log(N/d(i))
where w(i,j) = weight of ith word in jth document
t(i,j) = ith term frequency in jth document
N = number of documents evaluated
d(i) = number of documents in which word i appears
The preference vector is one whose elements are the weights of the relevant words placed in order.
P = w(i,j) for all i, j
Once the system is trained to satisfaction, a new document is retrieved and its keywords are determined. The weights of the new words, d(i,j), found using the above formula are placed in a document vector.
D = d(i,j) for all i,j
The similarity coefficient which gives
placed
that the and the system adjusts its parameters and makes a more precise list of "interesting" words. The system undergoes training till it can identify the "interesting" words on its own, for the rest of the documents. It then displays a clickable index of "interesting" words for each of the documents.
Future Work
1. Integrating Learning capacity to INDEXER
2. "Stemming of words" will be included as part of INDEXER and will be based on
the heuristics used by the 'SMART' system.
3. It would be nice to have a pop up window to set up selection heuristics
and display learning But for now it will be done transparently.
4. The implementation of more advanced Visual techniques to choose keywords
(TAU system, Swaminathan 1993) can be added as a further development but
may not be included in this project.
5. "Synonym analysis" can also be added as a feature of the system.