web123456

Jieba, NLTK and other Chinese and English word segmentation tools to divide word

Purpose of the experiment:

Using the given Chinese and English text sequence (see and ), respectively, using the following given Chinese

The English word segmentation tool performs word segmentation and briefly compares and analyzes the results produced by different word segmentation tools.

 

Experimental tools:

Chinese Jieba (key point), try three word participle modes and custom dictionary functions, SnowNLP, THULAC, NLPIR, StanfordCoreNLP,

English NLTK, SpaCy, StanfordCoreNLP

 

Experimental environment:

Language: Python 3.7.0

IDE: Pycharm

Many packages need to be installed using pip, please search for related tutorials here to install

 

Experimental steps:

First, do Chinese participle:

1. Jieba

import jieba
 import re
 Chinese='CCTV 315 Gala exposed that the well-known Shendan brand and Liantian brand "local eggs" in Hubei Province were actually impersonating ordinary eggs. At the same time, they played tricks on trademarks, registering "fresh soil" and "good soil" trademarks respectively, making consumers mistakenly think it was "local eggs".  On the evening of March 15, a reporter from the Beijing News called Hubei Shendan Health Food Co., Ltd. on the matter. Its staff said that they were unaware of it and needed to understand the situation clearly. No latest response was obtained as of press time.  A reporter from the Beijing News also found that Hubei Shendan Health Food Co., Ltd. is a national key leading enterprise in agricultural industrialization and a high-tech enterprise. It was previously fined 60,000 yuan for suspected false propaganda of "China's largest egg company."  ’

 str=('[^\w]','',chinese) #Use regular symbols, and then use this str string

 seg_list=(s_list, cut_all=False) #Precise mode
 print('/'.join(seg_list))

Results of word participle:

CCTV/315/Gallery/Exposure/Hubei Province/Westfamous/Shendan/Party Lotus/Eggs/Real/Ordinary/Eggs/Impossible/Simultaneous/On/Trademark/Play/Maowei/Register/Fresh earth/Register/Good earth/Trademark/Maowei/Trademark/Mao/Consumers/Missue/Minister/Ministry/Incorrectly think that it is/Turn/Eggs/March/15/Day/Night/New/Beijing News/Reporter/Call/Call/About/This matter/Call/Hubei/Shendan/Health/Food/Co., Ltd. /It/Staff/Express/Not aware/Recognition/Security/As of the time of publication/No/Getting/Latest/Response/New/Beijing News/Reporter/Also/Inquiry/Discover/Hubei/Shendan/Health/Food/Co., Ltd./For/Agriculture/Industrialization/National/Key/Leading Enterprise/High-tech/Enterprise/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/

Load the self-created dictionary

Use jieba.load_userdict(file)

file=open(dict,'r') # Load a dictionary, the contents of this dictionary are: Shendan card, Lotus card, Local egg, Beijing News
 jieba.load_userdict(file)
 ()
 seg_list=(str, cut_all=False) #Exact mode str is the previous string
 print('/'.join(seg_list))

result:

CCTV/315/Gallery/Exposed/Hubei Province/Westfamous/Shendan Card/Lotus Brand/Lotus Egg/Real Ordinary/Egg/Imitation/At the same time/on/Trademark/play/Maowei/Register/Fresh Earth/Register/Good Earth/Trademark/Maowei/Search/Consumer/Missing/Mindully Thinking/It is/Native Egg/March/15/Day/Evening/New Beijing News/Reporter/Call/Call/About/This/Call/Hotel/Hotel/Hotel/Food/Co., Ltd./Information/Details/Administrative/Industrialization/National/Key Leading Enterprise/High-tech/Enterprise/Previous/Free/Food/Co., Ltd./Free/Food/Egg Products/Company/Previous/Free/Food/Advanced/Free/Egg Products/Company/Advanced/Information/National/Key/Leading Enterprise/High-tech/Enterprise/Previous/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/

Obviously, after loading into the dictionary, the Divine Pill Card, Lotus Card, Local Egg, and the Beijing News were combined together.

2. SnowNlp

from snowlp import SnowNLP
 s=SnowNLP(str) #str is a Chinese string with the symbols removed before
 print() #Separate words
 print() #get pinyin
 print((3)) #Summary
 print((3)) # Get keywords
 print() #Turn traditional Chinese into simplified Chinese

The result of word participle: is a list

['CCTV', '315', 'gala', 'exposure', 'Hubei Province', 'famous', 'shendan', 'card', 'loose', 'field', 'plain', 'earth', 'earth', 'egg', 'real', 'ordinary', 'egg', 's 'should', 'on the 'trade', 'play cat', 's 's 'tween', 'trademark', 'play cat', 's 'reliable', 'respective', 'register', 'fresh', 'register', 'good', 'trademark', 'let', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think' 'Xinjing', 'news', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'resource', 'report', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'key', 'leading', 'enterprise', 'high-tech', 'enterprise', 'previous', 'has been', 'false', 'propaganda', 'China', 'most', 'big', 'egg products', 'enterprise', 'enterprise', '6', '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10

3. Thulac

t=() #Process word segmentation and annotation word nature
 text=(str,text=False) #Perform word segmentation and annotate part of speech. If text=True, it is str, otherwise the default mode returns the value as list
 print(text)

[['CCTV', 'v'], ['315', 'm'], ['Gate', 'n'], ['Exposure', 'v'], ['Hubei Province', 'nn'], ['Well-known', 'a'], ['The', 'u'], ['Shendan Card', 'nz'], ['Lotus Field Card', 'nz'], ['Native Egg', 'n'], ['Real', 'a'], ['For', 'v'], ['Ordinary', 'a'], ['Egg', 'n'], ['On the same time', 'd'], ['Trademark', 'n'], ['Trademark', 'n'], ['Following', 'v'], ['Type', 'v'], ['Type', 'v'], ['Caoya', 'n'], ['respective', 'd'], ['register', 'v'], ['fresh soil', 'n'], ['register', 'v'], ['good', 'a'], ['trademark', 'n'], ['let', 'v'], ['consumer', 'n'], ['wrong', 'v'], ['consumer', 'n'], ['wrong', 'd'], ['think', 'p'], ['this matter', ', 'v'], ['this matter', 'r'], ['Call', 'v'], ['Hubei', 'nns'], ['Shendan', 'nz'], ['Healthy', 'a'], ['Food', 'n'], ['Co., Ltd', 'n'], ['Attend', 'n'], ['Only', 'n'], ['Only', 'n'], ['Project', 'v'], ['No', 'd'], ['Information', 'v'], ['Demand', 'v'], ['Desired', 'u'], ['Solution', 'a'], ['Certification', 'a'], ['State', 'n'], ['As of ', 'v'], ['As of ', 'v'], ['Publication', 'v'], ['Temporary', 'd'], ['No', 'd'], ['Get', 'v'], ['Latest', 'a'], ['Response', 'v'], ['New', 'a'], ['Beijing News', 'n'], ['Reporter', 'n'], ['Also', 'd'], ['Query', 'v'], ['Discover', 'v'], ['Hubei', 'nn'], ['Healthy', 'a'], ['Food', 'n'], ['Co., Ltd.', 'n'], ['For', 'p'], ['Agriculture', 'n'], ['Industrialization', 'v'], ['National', 'm'], ['Home', 'q'], ['Key', 'n'], ['Leader', 'n'], ['enterprise', 'n'], ['high-tech', 'n'], ['enterprise', 'n'], ['previous', 't'], ['d'], ['since', 'p'], ['suspected', 'v'], ['propaganda', 'v'], ['propaganda', 'v'], ['China', 'n'], ['most', 'd'], ['big', 'a'], ['big', 'a'], ['egg', 'n'], ['company', 'n'], ['and', 'c'], ['p'], ['p'], ['p'], ['p'], ['p'], ['p'], ['p'], ['60,000', 'm'], ['yuan', 'q']]

like

t2=(seg_only=True) #Only perform word segmentation segment

Only word segmentation is performed, no word quality is marked

4. Pynlpir

()
 print((str)) #全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全

[('yang', 'verb'), ('visual', 'verb'), ('315', 'numeral'), ('game', 'noun'), ('exposure', 'verb'), ('hubei Province', 'noun'), ('famous, 'adjective'), ('particle'), ('god', 'noun'), ('dian', 'disting word'), ('card', 'noun'), (', 'punctuation mark'), ('lover', 'noun'), ('field', 'noun'), ('noun'), ('"', 'punctuation mark'), ('earth', 'noun'), ('oun', 'noun'), ('"', 'punctuation mark'), ('soon', 'noun'), ('egg', 'noun'), ('"', 'punctuation mark'), ('real', 'adjective'), ('for', 'verb'), ('ordinary', 'adjective'), ('egg', 'noun'), ('prefix', 'verb'), (', 'punctuation mark'), ('same', 'conjunction'), ('preposition'), ('trademark', 'noun'), ('on', 'noun of locality'), ('play', 'verb'), ('make', 'noun'), (', 'punctuation mark'), ('respective', 'adverb'), ('register', 'verb'), ('"', 'punctuation mark'), ('fresh', 'adjective'), ('soil', 'noun'), ('"', 'punctuation mark'), (', 'punctuation mark'), (', 'verb'), ('"', 'punctuation mark'), ('re 'adjective'), ('good', 'adjective'), ('food', 'noun'), ('"', 'punctuation mark'), ('trademark', 'noun'), (', 'punctuation mark'), ('let', 'verb'), ('consumer', 'noun'), ('wrong', 'adverb'), ('think', 'verb'), ('what', 'verb'), ('think', 'verb'), ('yes', 'verb'), ('"', 'punctuation mark'), ('soil', 'noun'), ('egg', 'noun'), ('"', 'punctuation mark'), ('. ', 'punctuation mark'), ('March', 'time word'), ('15th', 'time word'), ('eventh', 'time word'), (', 'punctuation mark'), ('Beijing News', None), ('Reporter', 'noun'), ('on', 'adverb'), ('this matter', 'pronoun'), ('call', 'verb'), ('Hubei', 'noun'), ('God', 'noun'), ('Dan', 'disting word'), ('Health', 'adjective'), ('Food', 'noun'), ('Coalition', 'noun'), ('Food', 'noun'), ('Food', 'noun'), ('Food', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities (',', 'punctuation mark'), ('it', 'pronoun'), ('work', 'verb'), ('person', 'noun'), ('representing', 'verb'), ('noun'), ('representing', 'verb'), ('noun'), ('knowing', 'verb'), ('because', 'verb'), ('re prescription', 'verb'), ('re prescription', 'verb'), ('not', 'adverb'), ('not', 'punctuation mark'), ('re prescription', 'verb'), ('not', 'adverb'), ('not', 'adverb'), ('get', 'verb'), ('latest', 'adjective'), ('response', 'verb'), ('.', 'punctuation mark'), ('Beijing News', None), ('report', 'noun'), ('re also', 'adverb'), ('query', 'verb'), ('discover', 'verb'), (', 'punctuation mark'), ('hubei', 'noun'), ('god', 'noun'), ('dian', 'disting word'), ('health', 'adjective'), ('food', 'noun'), ('collective', 'noun'), ('for', 'preposition'), ('agriculture', 'noun'), ('industrialization', 'verb'), ('country', 'noun'), ('key', 'noun'), ('leader', 'noun'), ('enterprise', 'noun'), ('enterprise', 'noun'), (', 'punctuation mark'), ('high-tech', 'noun'), ('enterprise', 'noun'), (', 'punctuation mark'), ('previous', 'time word'), ('zen', 'adverb'), ('because', 'preposition'), ('suspected', 'verb'), ('false', 'adjective'), ('promotion', 'verb'), ('"', 'punctuation mark'), ('China', 'noun'), ('most', 'adverb'), ('big', 'adjective'), ('particle'), ('egg', 'noun'), ('enterprise', 'noun'), ('"', 'punctuation mark'), ('and', 'conjunction'), ('being', 'preposition'), ('punctuation', 'verb'), ('60,000', 'numeral'), ('yuan', 'classifier'), ('. ', 'punctuation mark')]

5. StanfordCoreNLP:

nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='zh')
 print(nlp.word_tokenize(s_list)) #Return a list
 # print(nlp.pos_tag(str)) # Part of speech tagging
 # print((str)) #Analysis

result:

['CCTV', '315', 'gala', 'exposed', 'Hubei Province', 'famous', 'shendan', 'card', 'loose', 'field', 'brand', 'earth', 'earth', 'egg', 'real', 'ordinary', 'egg', 's 'apple', 'at the same time', 'trademark', 'play', 'make up', 'respective', 'register', 'fresh soil', 'register', 'good', 'trademark', 'let', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'wrong', 'whole', 'consumer', 'wrong', 'consumer', 'wrong', 'whole', 'murder', 'will be', 'earth', 'egg', 'March', '15th', 'evening', 'Beijing News', 'reporter', 're on this matter', 'Call', 'Hubei', 'Shendan', 'Healthy', 'Food', 'Limited', 'Company', 'Only', 'Work', 'People', 'Express', 'Need', 'Know', 'Clear', 'State', 'As of ', 'Promise', 'Temporary', 'No', 'Get', 'Latest', 'Response', 'New Beijing News', 'Reporter', 'Search', 'Search', 'Since', 'Since', 'Shendan', 'Healthy', 'Food', 'Limited', 'Company', 'Agriculture', 'Industrialization', 'National', 'Key', 'Leader', 'Entertainment', 'New', 'Technology', 'enterprise', 'previous', 'had been', 'finished', '60,000', 'yuan']

To perform English participle:

Englisth=‘Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and *lyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.’

6. nltk:

import nltk
 import re
 english='H:\\Natural Language Processing\\Experiment2\\'
 with open(english,'r',encoding='utf-8') as file:
     u=()
 str=('[^\w ]','',u)
 print(nltk.word_tokenize(str))
 print(nltk.pos_tag(nltk.word_tokenize(str))) #About part-of-speech results of the word segmentation

result:

['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', '*lyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']

7. Sparcy:

import spacy
nlp=('en_core_web_sm')
document=nlp(str)
print(())

result:

['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', '*lyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']

8. StanfordcoreNLP:

nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='en')
print(nlp.word_tokenize(str))

result;

['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', '*lyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']

The above is the word segmentation process of the eight word segmentation tools. I suggest: Chinese word segmentation is used to divide words, and English word segmentation is used to divide words.