TF-IDF

Page content

Definition

  • TF: Term Frequency.
  • IDF: Inverse Document Frequency

Document (sentence) -> words.

Terms

  • Capital D: A corpus.
  • Subbed (D_i) or lower di: i-th document in the corpus.

TF

https://en.wikipedia.org/wiki/Tf–idf

This term is irrelavant to the corpus (defined at each document).

Raw count

There are several definitions of “term frequency.”

ft,di:=the number of the words t in the document di

The most simple and “meaningful” for NLP is frequency form,

tf(t,di)=ft,dithe total number of words in a document di

Logarithmically scaled

There are also logarithmically scaled version,

tf(t,d)=log(1+ft,d)
The preferred base of the logarithm are 2 or 10.

IDF

There are several definitions also,

idf(t,D)=log(the total number of documents in the corpus DThe number of documents which contain the word t)

When the word t is rare, IDF of the word become high.

In scikit-learn,

idf(t,D)=1+log(the total number of documents in the corpus DThe number of documents which contain the word t)

TFIDF

TF times IDF means how important a word is.

We can assign a importance vector (TFIDF vector) to a document in corpus.

For next

SVM categorizing.