Article Directory
- Preface
- 1.1 The concept of TF-IDF algorithm
- 1.1.1 TF
- 1.1.2 IDF
- 1.1.3 TF-IDF
- 1.2 Code implementation TF-IDF algorithm
- 1.2.1 Implementing TF-IDF algorithm in Python
- 1.2.2 Implementing TF-IDF algorithm with sklearn
- 1.3 Summary
- refer to
Preface
This content mainly introduces the TF-IDF algorithm and Python implementation.
1.1 The concept of TF-IDF algorithm
TF-IDF (Term Frequency - Inverse Document Frequency, word frequency - inverse document frequency), is a commonly used weighting technology for information retrieval and data mining, and is often used to mine keywords in articles. TF-IDF is a statistical analysis method used to evaluate the importance of a word to a file set or a corpus.
TF-IDF has two parts, TF and IDF, which will be explained below.
1.1.1 TF
TF (Term Frequency, word frequency), the number or frequency of a word appearing in a document. If a word in a document appears multiple times, then this word may be a more important word. Of course, stop words need to be excluded. The calculation formula is as follows:
Word Frequency ( T F ) = Some indivual Word exist arts File middle of out now Second-rate number (1) Word frequency (TF) = the number of times a word appears in the document \tag{1}wordfrequency(TF)=certainindivualwordexistartsfilesmiddleofoutnowSecond-ratenumber(1)
Considering that documents have different lengths, in order to facilitate comparison of different documents, the "word frequency" is standardized. The calculation formula is as follows:
Word Frequency ( T F ) = Some indivual Word exist arts File middle out now of Second-rate number / arts File of total Word number (2) Word frequency (TF) = the number of times a word appears in a document / the total number of words in the document \tag{2}wordfrequency(TF)=certainindivualwordexistartsfilesmiddleoutnowofSecond-ratenumber/artsfilesoftotalwordnumber(2)
1.1.2 IDF
IDF (Inverse Document Frequency, Inverse Document Frequency), this is a measure of the word "weight". If a word is low in multiple documents, it means that this is a relatively rare word, and the greater the IDF value of the word. The calculation formula is as follows:
Reverse arts File Frequency Rate ( I D F ) = log ( language Material Library of arts File total number / ( Bag Include Should Word of arts File number + 1 ) ) (3) Inverse Document Frequency (IDF) = \log(Total number of documents in the corpus/(Number of documents containing the word +1)) \tag{3}inverseartsfilesfrequencyRate(IDF)=log(languagematerialLibraryofartsfilestotalnumber/(BagIncludeShouldwordofartsfilesnumber+1))(3)
If a word is more common, the larger the denominator, and the smaller the frequency of the inverse document, the closer it is to 0. The reason why the denominator is added to 1 is to avoid the denominator being 0 (that is, all documents do not contain the word), which is a smoothing method.
Notice: In different libraries, the smoothing methods used are not exactly the same when implementing IDF.
1.1.3 TF-IDF
Multiply TF and IDF to get TF-IDF. The calculation formula is as follows:
T F − I D F = Word Frequency ( T F ) × Reverse arts File Frequency Rate ( I D F ) (4) TF-IDF=word frequency (TF)\times Inverse document frequency (IDF) \tag{4}TF−IDF=wordfrequency(TF)×inverseartsfilesfrequencyRate(IDF)(4)
The importance of a word is proportional to the number of times it appears in the document and inversely proportional to the number of times it appears in the corpus. This calculation method can effectively avoid the influence of common words on keywords and improve the correlation between keywords and articles.
1.2 Code implementation TF-IDF algorithm
1.2.1 Implementing TF-IDF algorithm in Python
Use Python to manually implement the TF-IDF algorithm, the specific code is as follows:
import math
class TfIdf:
def __init__(self):
self.num_docs = 0
self.vocab = {}
def add_corpus(self, corpus):
self._merge_corpus(corpus)
tfidf_list = []
for sentence in corpus:
tfidf_list.append(self.get_tfidf(sentence))
return tfidf_list
def _merge_corpus(self, corpus):
"""
Statistics the corpus, outputs the vocabulary list, and counts the number of documents containing each word.
"""
self.num_docs = len(corpus)
for sentence in corpus:
words = sentence.strip().split()
words = set(words)
for word in words:
self.vocab[word] = self.vocab.get(word, 0.0) + 1.0
def _get_idf(self, term):
"""
Calculate IDF value
"""
return math.log(self.num_docs / (self.vocab.get(term, 0.0) + 1.0))
def get_tfidf(self, sentence):
tfidf = {}
terms = sentence.strip().split()
terms_set = set(terms)
num_terms = len(terms)
for term in terms_set:
# Calculate the TF value
tf = float(terms.count(term)) / num_terms
# Calculate the IDF value. When actually implementing it, you can calculate the IDF of all words in advance and then use it directly.
idf = self._get_idf(term)
# Calculate TF-IDF value
tfidf[term] = tf * idf
return tfidf
corpus = [
"What is the weather like today",
"what is for dinner tonight",
"this is question worth pondering",
"it is a beautiful day today"
]
tfidf = TfIdf()
tfidf_values = tfidf.add_corpus(corpus)
for tfidf_value in tfidf_values:
print(tfidf_value)
1.2.2 Implementing TF-IDF algorithm with sklearn
When using sklearn to implement TF-IDF, you need to use itTfidfVectorizer
, the specific code is as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"What is the weather like today",
"what is for dinner tonight",
"this is question worth pondering",
"it is a beautiful day today"
]
tfidf_vec = TfidfVectorizer()
# Use fit_transform() to obtain the TF-IDF matrix
tfidf_matrix = tfidf_vec.fit_transform(corpus)
print(tfidf_matrix)
# Use get_feature_names() to get non-repetitive words
print(tfidf_vec.get_feature_names())
# Get the ID corresponding to each word
print(tfidf_vec.vocabulary_)
The following information will be output:
(0, 11) 0.3710221459250386
(0, 6) 0.47059454669821993
(0, 13) 0.47059454669821993
(0, 9) 0.47059454669821993
(0, 4) 0.24557575678403082
(0, 14) 0.3710221459250386
(1, 12) 0.506765426545092
(1, 2) 0.506765426545092
(1, 3) 0.506765426545092
(1, 4) 0.2644512224141842
(1, 14) 0.3995396830595886
(2, 7) 0.4838025881780501
(2, 15) 0.4838025881780501
(2, 8) 0.4838025881780501
(2, 10) 0.4838025881780501
(2, 4) 0.25246826075544676
(3, 1) 0.506765426545092
(3, 0) 0.506765426545092
(3, 5) 0.506765426545092
(3, 11) 0.3995396830595886
(3, 4) 0.2644512224141842
['beautiful', 'day', 'dinner', 'for', 'is', 'it', 'like', 'pondering', 'question', 'the', 'this', 'today', 'tonight', 'weather', 'what', 'worth']
{'what': 14, 'is': 4, 'the': 9, 'weather': 13, 'like': 6, 'today': 11, 'for': 3, 'dinner': 2, 'tonight': 12, 'this': 10, 'question': 8, 'worth': 15, 'pondering': 7, 'it': 5, 'beautiful': 0, 'day': 1}
1.3 Summary
The advantages of TF-IDF are that it is simple and fast and easy to understand.
The disadvantage of TF-IDF is that sometimes the word frequency is used to measure the importance of a word in the article. Sometimes the important words may not appear enough, and this calculation cannot reflect the location information and cannot reflect the importance of the word in the context.
refer to
[1] Machine Learning: Vivid understanding of TF-IDF algorithms
[2] Detailed explanation of the principle of TF-IDF algorithm and its use
[3] Extracting article keywords based on TF-IDF algorithm