Detailed explanation of TF-IDF algorithm and Python implementation

Article Directory

Preface
1.1 The concept of TF-IDF algorithm
- 1.1.1 TF
- 1.1.2 IDF
- 1.1.3 TF-IDF
1.2 Code implementation TF-IDF algorithm
- 1.2.1 Implementing TF-IDF algorithm in Python
- 1.2.2 Implementing TF-IDF algorithm with sklearn
1.3 Summary
refer to

Preface

This content mainly introduces the TF-IDF algorithm and Python implementation.

1.1 The concept of TF-IDF algorithm

TF-IDF (Term Frequency - Inverse Document Frequency, word frequency - inverse document frequency), is a commonly used weighting technology for information retrieval and data mining, and is often used to mine keywords in articles. TF-IDF is a statistical analysis method used to evaluate the importance of a word to a file set or a corpus.

TF-IDF has two parts, TF and IDF, which will be explained below.

1.1.1 TF

TF (Term Frequency, word frequency), the number or frequency of a word appearing in a document. If a word in a document appears multiple times, then this word may be a more important word. Of course, stop words need to be excluded. The calculation formula is as follows:

$\tag{1}$

Considering that documents have different lengths, in order to facilitate comparison of different documents, the "word frequency" is standardized. The calculation formula is as follows:

$\tag{2}$

1.1.2 IDF

IDF (Inverse Document Frequency, Inverse Document Frequency), this is a measure of the word "weight". If a word is low in multiple documents, it means that this is a relatively rare word, and the greater the IDF value of the word. The calculation formula is as follows:

$\log(Total number of documents in the corpus/(Number of documents containing the word +1)) \tag{3}$

If a word is more common, the larger the denominator, and the smaller the frequency of the inverse document, the closer it is to 0. The reason why the denominator is added to 1 is to avoid the denominator being 0 (that is, all documents do not contain the word), which is a smoothing method.

Notice: In different libraries, the smoothing methods used are not exactly the same when implementing IDF.

1.1.3 TF-IDF

Multiply TF and IDF to get TF-IDF. The calculation formula is as follows:

$(TF)\times Inverse document frequency (IDF) \tag{4}$

The importance of a word is proportional to the number of times it appears in the document and inversely proportional to the number of times it appears in the corpus. This calculation method can effectively avoid the influence of common words on keywords and improve the correlation between keywords and articles.

1.2 Code implementation TF-IDF algorithm

1.2.1 Implementing TF-IDF algorithm in Python

Use Python to manually implement the TF-IDF algorithm, the specific code is as follows:

import math

class TfIdf:
    def __init__(self):
        self.num_docs = 0
        self.vocab = {}

    def add_corpus(self, corpus):
        self._merge_corpus(corpus)

        tfidf_list = []
        for sentence in corpus:
            tfidf_list.append(self.get_tfidf(sentence))
        return tfidf_list

    def _merge_corpus(self, corpus):
        """
         Statistics the corpus, outputs the vocabulary list, and counts the number of documents containing each word.
         """
        self.num_docs = len(corpus)
        for sentence in corpus:
            words = sentence.strip().split()
            words = set(words)
            for word in words:
                self.vocab[word] = self.vocab.get(word, 0.0) + 1.0

    def _get_idf(self, term):
        """
         Calculate IDF value
         """
        return math.log(self.num_docs / (self.vocab.get(term, 0.0) + 1.0))

    def get_tfidf(self, sentence):
        tfidf = {}
        terms = sentence.strip().split()
        terms_set = set(terms)
        num_terms = len(terms)
        for term in terms_set:
            # Calculate the TF value
            tf = float(terms.count(term)) / num_terms
            # Calculate the IDF value. When actually implementing it, you can calculate the IDF of all words in advance and then use it directly.
            idf = self._get_idf(term)
            # Calculate TF-IDF value
            tfidf[term] = tf * idf
        return tfidf

corpus = [
    "What is the weather like today",
    "what is for dinner tonight",
    "this is question worth pondering",
    "it is a beautiful day today"
]

tfidf = TfIdf()
tfidf_values = tfidf.add_corpus(corpus)
for tfidf_value in tfidf_values:
    print(tfidf_value)

1.2.2 Implementing TF-IDF algorithm with sklearn

When using sklearn to implement TF-IDF, you need to use itTfidfVectorizer, the specific code is as follows:

from sklearn.feature_extraction.text import TfidfVectorizer


corpus = [
    "What is the weather like today",
    "what is for dinner tonight",
    "this is question worth pondering",
    "it is a beautiful day today"
]

tfidf_vec = TfidfVectorizer()

# Use fit_transform() to obtain the TF-IDF matrix
tfidf_matrix = tfidf_vec.fit_transform(corpus)
print(tfidf_matrix)

# Use get_feature_names() to get non-repetitive words
print(tfidf_vec.get_feature_names())

# Get the ID corresponding to each word
print(tfidf_vec.vocabulary_)

The following information will be output:

  (0, 11)	0.3710221459250386
  (0, 6)	0.47059454669821993
  (0, 13)	0.47059454669821993
  (0, 9)	0.47059454669821993
  (0, 4)	0.24557575678403082
  (0, 14)	0.3710221459250386
  (1, 12)	0.506765426545092
  (1, 2)	0.506765426545092
  (1, 3)	0.506765426545092
  (1, 4)	0.2644512224141842
  (1, 14)	0.3995396830595886
  (2, 7)	0.4838025881780501
  (2, 15)	0.4838025881780501
  (2, 8)	0.4838025881780501
  (2, 10)	0.4838025881780501
  (2, 4)	0.25246826075544676
  (3, 1)	0.506765426545092
  (3, 0)	0.506765426545092
  (3, 5)	0.506765426545092
  (3, 11)	0.3995396830595886
  (3, 4)	0.2644512224141842
['beautiful', 'day', 'dinner', 'for', 'is', 'it', 'like', 'pondering', 'question', 'the', 'this', 'today', 'tonight', 'weather', 'what', 'worth']
{'what': 14, 'is': 4, 'the': 9, 'weather': 13, 'like': 6, 'today': 11, 'for': 3, 'dinner': 2, 'tonight': 12, 'this': 10, 'question': 8, 'worth': 15, 'pondering': 7, 'it': 5, 'beautiful': 0, 'day': 1}

1.3 Summary

The advantages of TF-IDF are that it is simple and fast and easy to understand.

The disadvantage of TF-IDF is that sometimes the word frequency is used to measure the importance of a word in the article. Sometimes the important words may not appear enough, and this calculation cannot reflect the location information and cannot reflect the importance of the word in the context.

refer to

[1] Machine Learning: Vivid understanding of TF-IDF algorithms

[2] Detailed explanation of the principle of TF-IDF algorithm and its use

[3] Extracting article keywords based on TF-IDF algorithm