TF-IDF
Page content
Definition
- TF: Term Frequency.
- IDF: Inverse Document Frequency
Document (sentence) -> words.
Terms
- Capital : A corpus.
- Subbed (D_i) or lower : i-th document in the corpus.
TF
https://en.wikipedia.org/wiki/Tf–idf
This term is irrelavant to the corpus (defined at each document).
Raw count
There are several definitions of “term frequency.”
The most simple and “meaningful” for NLP is frequency form,
Logarithmically scaled
There are also logarithmically scaled version,
The preferred base of the logarithm are 2 or 10.
IDF
There are several definitions also,
When the word t is rare, IDF of the word become high.
In scikit-learn,
TFIDF
TF times IDF means how important a word is.
We can assign a importance vector (TFIDF vector) to a document in corpus.
For next
SVM categorizing.