Purpose of the experiment:
Using the given Chinese and English text sequence (see and ), respectively, using the following given Chinese
The English word segmentation tool performs word segmentation and briefly compares and analyzes the results produced by different word segmentation tools.
Experimental tools:
Chinese Jieba (key point), try three word participle modes and custom dictionary functions, SnowNLP, THULAC, NLPIR, StanfordCoreNLP,
English NLTK, SpaCy, StanfordCoreNLP
Experimental environment:
Language: Python 3.7.0
IDE: Pycharm
Many packages need to be installed using pip, please search for related tutorials here to install
Experimental steps:
First, do Chinese participle:
1. Jieba
import jieba
import re
Chinese='CCTV 315 Gala exposed that the well-known Shendan brand and Liantian brand "local eggs" in Hubei Province were actually impersonating ordinary eggs. At the same time, they played tricks on trademarks, registering "fresh soil" and "good soil" trademarks respectively, making consumers mistakenly think it was "local eggs". On the evening of March 15, a reporter from the Beijing News called Hubei Shendan Health Food Co., Ltd. on the matter. Its staff said that they were unaware of it and needed to understand the situation clearly. No latest response was obtained as of press time. A reporter from the Beijing News also found that Hubei Shendan Health Food Co., Ltd. is a national key leading enterprise in agricultural industrialization and a high-tech enterprise. It was previously fined 60,000 yuan for suspected false propaganda of "China's largest egg company." ’
str=('[^\w]','',chinese) #Use regular symbols, and then use this str string
seg_list=(s_list, cut_all=False) #Precise mode
print('/'.join(seg_list))
Results of word participle:
CCTV/315/Gallery/Exposure/Hubei Province/Westfamous/Shendan/Party Lotus/Eggs/Real/Ordinary/Eggs/Impossible/Simultaneous/On/Trademark/Play/Maowei/Register/Fresh earth/Register/Good earth/Trademark/Maowei/Trademark/Mao/Consumers/Missue/Minister/Ministry/Incorrectly think that it is/Turn/Eggs/March/15/Day/Night/New/Beijing News/Reporter/Call/Call/About/This matter/Call/Hubei/Shendan/Health/Food/Co., Ltd. /It/Staff/Express/Not aware/Recognition/Security/As of the time of publication/No/Getting/Latest/Response/New/Beijing News/Reporter/Also/Inquiry/Discover/Hubei/Shendan/Health/Food/Co., Ltd./For/Agriculture/Industrialization/National/Key/Leading Enterprise/High-tech/Enterprise/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/
Load the self-created dictionary
Use jieba.load_userdict(file)
file=open(dict,'r') # Load a dictionary, the contents of this dictionary are: Shendan card, Lotus card, Local egg, Beijing News
jieba.load_userdict(file)
()
seg_list=(str, cut_all=False) #Exact mode str is the previous string
print('/'.join(seg_list))
result:
CCTV/315/Gallery/Exposed/Hubei Province/Westfamous/Shendan Card/Lotus Brand/Lotus Egg/Real Ordinary/Egg/Imitation/At the same time/on/Trademark/play/Maowei/Register/Fresh Earth/Register/Good Earth/Trademark/Maowei/Search/Consumer/Missing/Mindully Thinking/It is/Native Egg/March/15/Day/Evening/New Beijing News/Reporter/Call/Call/About/This/Call/Hotel/Hotel/Hotel/Food/Co., Ltd./Information/Details/Administrative/Industrialization/National/Key Leading Enterprise/High-tech/Enterprise/Previous/Free/Food/Co., Ltd./Free/Food/Egg Products/Company/Previous/Free/Food/Advanced/Free/Egg Products/Company/Advanced/Information/National/Key/Leading Enterprise/High-tech/Enterprise/Previous/Free/Previous/Free/Free/Previous/Free/Free/Previous/Free/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/Free/Advanced/
Obviously, after loading into the dictionary, the Divine Pill Card, Lotus Card, Local Egg, and the Beijing News were combined together.
2. SnowNlp
from snowlp import SnowNLP
s=SnowNLP(str) #str is a Chinese string with the symbols removed before
print() #Separate words
print() #get pinyin
print((3)) #Summary
print((3)) # Get keywords
print() #Turn traditional Chinese into simplified Chinese
The result of word participle: is a list
['CCTV', '315', 'gala', 'exposure', 'Hubei Province', 'famous', 'shendan', 'card', 'loose', 'field', 'plain', 'earth', 'earth', 'egg', 'real', 'ordinary', 'egg', 's 'should', 'on the 'trade', 'play cat', 's 's 'tween', 'trademark', 'play cat', 's 'reliable', 'respective', 'register', 'fresh', 'register', 'good', 'trademark', 'let', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think', 'think' 'Xinjing', 'news', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'report', 'resource', 'report', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'resource', 'resource', 'resource', 'resource', 'report', 'key', 'leading', 'enterprise', 'high-tech', 'enterprise', 'previous', 'has been', 'false', 'propaganda', 'China', 'most', 'big', 'egg products', 'enterprise', 'enterprise', '6', '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10, '10
3. Thulac
t=() #Process word segmentation and annotation word nature
text=(str,text=False) #Perform word segmentation and annotate part of speech. If text=True, it is str, otherwise the default mode returns the value as list
print(text)
[['CCTV', 'v'], ['315', 'm'], ['Gate', 'n'], ['Exposure', 'v'], ['Hubei Province', 'nn'], ['Well-known', 'a'], ['The', 'u'], ['Shendan Card', 'nz'], ['Lotus Field Card', 'nz'], ['Native Egg', 'n'], ['Real', 'a'], ['For', 'v'], ['Ordinary', 'a'], ['Egg', 'n'], ['On the same time', 'd'], ['Trademark', 'n'], ['Trademark', 'n'], ['Following', 'v'], ['Type', 'v'], ['Type', 'v'], ['Caoya', 'n'], ['respective', 'd'], ['register', 'v'], ['fresh soil', 'n'], ['register', 'v'], ['good', 'a'], ['trademark', 'n'], ['let', 'v'], ['consumer', 'n'], ['wrong', 'v'], ['consumer', 'n'], ['wrong', 'd'], ['think', 'p'], ['this matter', ', 'v'], ['this matter', 'r'], ['Call', 'v'], ['Hubei', 'nns'], ['Shendan', 'nz'], ['Healthy', 'a'], ['Food', 'n'], ['Co., Ltd', 'n'], ['Attend', 'n'], ['Only', 'n'], ['Only', 'n'], ['Project', 'v'], ['No', 'd'], ['Information', 'v'], ['Demand', 'v'], ['Desired', 'u'], ['Solution', 'a'], ['Certification', 'a'], ['State', 'n'], ['As of ', 'v'], ['As of ', 'v'], ['Publication', 'v'], ['Temporary', 'd'], ['No', 'd'], ['Get', 'v'], ['Latest', 'a'], ['Response', 'v'], ['New', 'a'], ['Beijing News', 'n'], ['Reporter', 'n'], ['Also', 'd'], ['Query', 'v'], ['Discover', 'v'], ['Hubei', 'nn'], ['Healthy', 'a'], ['Food', 'n'], ['Co., Ltd.', 'n'], ['For', 'p'], ['Agriculture', 'n'], ['Industrialization', 'v'], ['National', 'm'], ['Home', 'q'], ['Key', 'n'], ['Leader', 'n'], ['enterprise', 'n'], ['high-tech', 'n'], ['enterprise', 'n'], ['previous', 't'], ['d'], ['since', 'p'], ['suspected', 'v'], ['propaganda', 'v'], ['propaganda', 'v'], ['China', 'n'], ['most', 'd'], ['big', 'a'], ['big', 'a'], ['egg', 'n'], ['company', 'n'], ['and', 'c'], ['p'], ['p'], ['p'], ['p'], ['p'], ['p'], ['p'], ['60,000', 'm'], ['yuan', 'q']]
like
t2=(seg_only=True) #Only perform word segmentation segment
Only word segmentation is performed, no word quality is marked
4. Pynlpir
()
print((str)) #全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全全
[('yang', 'verb'), ('visual', 'verb'), ('315', 'numeral'), ('game', 'noun'), ('exposure', 'verb'), ('hubei Province', 'noun'), ('famous, 'adjective'), ('particle'), ('god', 'noun'), ('dian', 'disting word'), ('card', 'noun'), (', 'punctuation mark'), ('lover', 'noun'), ('field', 'noun'), ('noun'), ('"', 'punctuation mark'), ('earth', 'noun'), ('oun', 'noun'), ('"', 'punctuation mark'), ('soon', 'noun'), ('egg', 'noun'), ('"', 'punctuation mark'), ('real', 'adjective'), ('for', 'verb'), ('ordinary', 'adjective'), ('egg', 'noun'), ('prefix', 'verb'), (', 'punctuation mark'), ('same', 'conjunction'), ('preposition'), ('trademark', 'noun'), ('on', 'noun of locality'), ('play', 'verb'), ('make', 'noun'), (', 'punctuation mark'), ('respective', 'adverb'), ('register', 'verb'), ('"', 'punctuation mark'), ('fresh', 'adjective'), ('soil', 'noun'), ('"', 'punctuation mark'), (', 'punctuation mark'), (', 'verb'), ('"', 'punctuation mark'), ('re 'adjective'), ('good', 'adjective'), ('food', 'noun'), ('"', 'punctuation mark'), ('trademark', 'noun'), (', 'punctuation mark'), ('let', 'verb'), ('consumer', 'noun'), ('wrong', 'adverb'), ('think', 'verb'), ('what', 'verb'), ('think', 'verb'), ('yes', 'verb'), ('"', 'punctuation mark'), ('soil', 'noun'), ('egg', 'noun'), ('"', 'punctuation mark'), ('. ', 'punctuation mark'), ('March', 'time word'), ('15th', 'time word'), ('eventh', 'time word'), (', 'punctuation mark'), ('Beijing News', None), ('Reporter', 'noun'), ('on', 'adverb'), ('this matter', 'pronoun'), ('call', 'verb'), ('Hubei', 'noun'), ('God', 'noun'), ('Dan', 'disting word'), ('Health', 'adjective'), ('Food', 'noun'), ('Coalition', 'noun'), ('Food', 'noun'), ('Food', 'noun'), ('Food', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities', 'noun'), ('Facilities (',', 'punctuation mark'), ('it', 'pronoun'), ('work', 'verb'), ('person', 'noun'), ('representing', 'verb'), ('noun'), ('representing', 'verb'), ('noun'), ('knowing', 'verb'), ('because', 'verb'), ('re prescription', 'verb'), ('re prescription', 'verb'), ('not', 'adverb'), ('not', 'punctuation mark'), ('re prescription', 'verb'), ('not', 'adverb'), ('not', 'adverb'), ('get', 'verb'), ('latest', 'adjective'), ('response', 'verb'), ('.', 'punctuation mark'), ('Beijing News', None), ('report', 'noun'), ('re also', 'adverb'), ('query', 'verb'), ('discover', 'verb'), (', 'punctuation mark'), ('hubei', 'noun'), ('god', 'noun'), ('dian', 'disting word'), ('health', 'adjective'), ('food', 'noun'), ('collective', 'noun'), ('for', 'preposition'), ('agriculture', 'noun'), ('industrialization', 'verb'), ('country', 'noun'), ('key', 'noun'), ('leader', 'noun'), ('enterprise', 'noun'), ('enterprise', 'noun'), (', 'punctuation mark'), ('high-tech', 'noun'), ('enterprise', 'noun'), (', 'punctuation mark'), ('previous', 'time word'), ('zen', 'adverb'), ('because', 'preposition'), ('suspected', 'verb'), ('false', 'adjective'), ('promotion', 'verb'), ('"', 'punctuation mark'), ('China', 'noun'), ('most', 'adverb'), ('big', 'adjective'), ('particle'), ('egg', 'noun'), ('enterprise', 'noun'), ('"', 'punctuation mark'), ('and', 'conjunction'), ('being', 'preposition'), ('punctuation', 'verb'), ('60,000', 'numeral'), ('yuan', 'classifier'), ('. ', 'punctuation mark')]
5. StanfordCoreNLP:
nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='zh')
print(nlp.word_tokenize(s_list)) #Return a list
# print(nlp.pos_tag(str)) # Part of speech tagging
# print((str)) #Analysis
result:
['CCTV', '315', 'gala', 'exposed', 'Hubei Province', 'famous', 'shendan', 'card', 'loose', 'field', 'brand', 'earth', 'earth', 'egg', 'real', 'ordinary', 'egg', 's 'apple', 'at the same time', 'trademark', 'play', 'make up', 'respective', 'register', 'fresh soil', 'register', 'good', 'trademark', 'let', 'consumer', 'wrong', 'consumer', 'wrong', 'consumer', 'wrong', 'whole', 'consumer', 'wrong', 'consumer', 'wrong', 'whole', 'murder', 'will be', 'earth', 'egg', 'March', '15th', 'evening', 'Beijing News', 'reporter', 're on this matter', 'Call', 'Hubei', 'Shendan', 'Healthy', 'Food', 'Limited', 'Company', 'Only', 'Work', 'People', 'Express', 'Need', 'Know', 'Clear', 'State', 'As of ', 'Promise', 'Temporary', 'No', 'Get', 'Latest', 'Response', 'New Beijing News', 'Reporter', 'Search', 'Search', 'Since', 'Since', 'Shendan', 'Healthy', 'Food', 'Limited', 'Company', 'Agriculture', 'Industrialization', 'National', 'Key', 'Leader', 'Entertainment', 'New', 'Technology', 'enterprise', 'previous', 'had been', 'finished', '60,000', 'yuan']
To perform English participle:
Englisth=‘Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and *lyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.’
6. nltk:
import nltk
import re
english='H:\\Natural Language Processing\\Experiment2\\'
with open(english,'r',encoding='utf-8') as file:
u=()
str=('[^\w ]','',u)
print(nltk.word_tokenize(str))
print(nltk.pos_tag(nltk.word_tokenize(str))) #About part-of-speech results of the word segmentation
result:
['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', '*lyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']
7. Sparcy:
import spacy
nlp=('en_core_web_sm')
document=nlp(str)
print(())
result:
['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', '*lyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']
8. StanfordcoreNLP:
nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='en')
print(nlp.word_tokenize(str))
result;
['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', '*lyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']
The above is the word segmentation process of the eight word segmentation tools. I suggest: Chinese word segmentation is used to divide words, and English word segmentation is used to divide words.