TF-IDF
Definition
- TF: Term Frequency.
- IDF: Inverse Document Frequency
Document (sentence) -> words.
Terms
- Capital $D$: A corpus.
- Subbed (D_i) or lower $d_i$: i-th document in the corpus.
TF
https://en.wikipedia.org/wiki/Tf–idf
This term is irrelavant to the corpus (defined at each document).
Raw count
There are several definitions of “term frequency.”
$$ f_{t,d_i} := \mathrm{the\ number\ of\ the\ words\ }t\ \mathrm{in\ the\ document\ } d_i $$
The most simple and “meaningful” for NLP is frequency form,
$$ \mathrm{tf}(t,d_i) = \frac{f_{t,d_i}} {\mathrm{the\ total\ number\ of\ words\ in\ a\ document\ }d_i} $$
Logarithmically scaled
There are also logarithmically scaled version, $$ \mathrm{tf}(t,d) = \log(1+f_{t,d}) $$ The preferred base of the logarithm are 2 or 10.
IDF
There are several definitions also,
$$ \mathrm{idf}(t,D) = \log(\frac{\mathrm{the\ total\ number\ of\ documents\ in\ the\ corpus\ } D}{\mathrm{The\ number\ of\ documents\ which\ contain\ the\ word\ }t}) $$
When the word t is rare, IDF of the word become high.
In scikit-learn, $$ \mathrm{idf}(t,D) = 1+\log(\frac{\mathrm{the\ total\ number\ of\ documents\ in\ the\ corpus\ } D}{\mathrm{The\ number\ of\ documents\ which\ contain\ the\ word\ }t}) $$
TFIDF
TF times IDF means how important a word is.
We can assign a importance vector (TFIDF vector) to a document in corpus.
For next
SVM categorizing.