web123456

Use python to perform word frequency statistics on Chinese documents

1. Use jieba to first process word segmentation on Chinese documents

The content of the clean_data.csv file that needs to be processed (three columns)

/travels/1322/   Mediterranean cruise + Rome in-depth free travel
/travels/1400/  Berlin & Annecy       Rat m

import sys
 reload(sys)
 ("utf-8")

 import jieba
 import

 wf = open('clean_title.txt','w+')
 for line in open('/root/clean_data/clean_data.csv'):

     item = ('\n\r').split('\t') //Table segmentation
     # print item[1]
     tags = .extract_tags(item[1]) //jieba word segmentation
     tagsw = ",".join(tags) //Comma-connected word segmentation
     (tagsw)

 ()

The output clean_title.txt content

cruise ship, Mediterranean, depth, Rome, free nasi, Berlin visa, walking, three days, approval of the Schengen, step by step, visa, application, how to praise, flange, cross, wine, scenery, river valley, world, European color, a, country, aquarium, Palau, seven days, God Olympia, running Santo,
 Rini, ancient civilization, visit, Aegean, charm, Greece

2. Statistical word frequency

#!/usr/bin/python
 # -*- coding:utf-8 -*-

 word_lst = []
 word_dict= {}
 with open('/root/clean_data/clean_title.txt') as wf,open("",'w') as wf2: //Open the file

     for word in wf:
         word_lst.append((',')) // Use commas to slice
         for item in word_lst:
              for item2 in item:
                 if item2 not in word_dict://statistic
                     word_dict[item2] = 1
                 else:
                     word_dict[item2] += 1

     for key in word_dict:
         print key,word_dict[key]
         (key+' '+str(word_dict[key])+'\n') //Write to the document

result:

Last 4
 European Blue 1
 Jimei 1
 Portugal Fado 1
 Construction site 1
 Know the scenery of the lake and the mountains 1
 Holy 7
 European Girls Switzerland Plus Tour 1

Sort by number of words:

cat |sort -nr -k 2|more

Holy 7
 Last 4
 European Blue 1
 Jimei 1
 Portugal Fado 1
 Construction site 1
 Know the scenery of the lake and the mountains 1
 European Girls Switzerland Plus Tour 1