1. Use jieba to first process word segmentation on Chinese documents
The output clean_title.txt content
result:
The content of the clean_data.csv file that needs to be processed (three columns)
/travels/1322/ Mediterranean cruise + Rome in-depth free travel
/travels/1400/ Berlin & Annecy Rat m
import sys
reload(sys)
("utf-8")
import jieba
import
wf = open('clean_title.txt','w+')
for line in open('/root/clean_data/clean_data.csv'):
item = ('\n\r').split('\t') //Table segmentation
# print item[1]
tags = .extract_tags(item[1]) //jieba word segmentation
tagsw = ",".join(tags) //Comma-connected word segmentation
(tagsw)
()
The output clean_title.txt content
cruise ship, Mediterranean, depth, Rome, free nasi, Berlin visa, walking, three days, approval of the Schengen, step by step, visa, application, how to praise, flange, cross, wine, scenery, river valley, world, European color, a, country, aquarium, Palau, seven days, God Olympia, running Santo,
Rini, ancient civilization, visit, Aegean, charm, Greece
2. Statistical word frequency
#!/usr/bin/python
# -*- coding:utf-8 -*-
word_lst = []
word_dict= {}
with open('/root/clean_data/clean_title.txt') as wf,open("",'w') as wf2: //Open the file
for word in wf:
word_lst.append((',')) // Use commas to slice
for item in word_lst:
for item2 in item:
if item2 not in word_dict://statistic
word_dict[item2] = 1
else:
word_dict[item2] += 1
for key in word_dict:
print key,word_dict[key]
(key+' '+str(word_dict[key])+'\n') //Write to the document
result:
Last 4
European Blue 1
Jimei 1
Portugal Fado 1
Construction site 1
Know the scenery of the lake and the mountains 1
Holy 7
European Girls Switzerland Plus Tour 1
Sort by number of words:
cat |sort -nr -k 2|more
Holy 7
Last 4
European Blue 1
Jimei 1
Portugal Fado 1
Construction site 1
Know the scenery of the lake and the mountains 1
European Girls Switzerland Plus Tour 1