TF-IDF

Page content

Definition

  • TF: Term Frequency.
  • IDF: Inverse Document Frequency

Document (sentence) -> words.

Terms

  • Capital $D$: A corpus.
  • Subbed (D_i) or lower $d_i$: i-th document in the corpus.

TF

https://en.wikipedia.org/wiki/Tf–idf

This term is irrelavant to the corpus (defined at each document).

Raw count

There are several definitions of “term frequency.”

$$ f_{t,d_i} := \mathrm{the\ number\ of\ the\ words\ }t\ \mathrm{in\ the\ document\ } d_i $$

The most simple and “meaningful” for NLP is frequency form,

$$ \mathrm{tf}(t,d_i) = \frac{f_{t,d_i}} {\mathrm{the\ total\ number\ of\ words\ in\ a\ document\ }d_i} $$

Logarithmically scaled

There are also logarithmically scaled version, $$ \mathrm{tf}(t,d) = \log(1+f_{t,d}) $$ The preferred base of the logarithm are 2 or 10.

IDF

There are several definitions also,

$$ \mathrm{idf}(t,D) = \log(\frac{\mathrm{the\ total\ number\ of\ documents\ in\ the\ corpus\ } D}{\mathrm{The\ number\ of\ documents\ which\ contain\ the\ word\ }t}) $$

When the word t is rare, IDF of the word become high.

In scikit-learn, $$ \mathrm{idf}(t,D) = 1+\log(\frac{\mathrm{the\ total\ number\ of\ documents\ in\ the\ corpus\ } D}{\mathrm{The\ number\ of\ documents\ which\ contain\ the\ word\ }t}) $$

TFIDF

TF times IDF means how important a word is.

We can assign a importance vector (TFIDF vector) to a document in corpus.

For next

SVM categorizing.