The problem is this: I use a program written in python to read the data in the .htm file. At first, when I used:fr = open("" , "r"), the program crashed directly after running. Later, according to the prompt error message: ValueError encoding must be one of 'utf_8','big5', or 'gbk'., so I used codecs to rewritten it into the following form:
- coding: utf-8 -
import sys
reload(sys)
("utf-8")
import codecs
fr = (“” ,”r” , “utf-8”)
At least the problem of crashing the program is solved.
But when reading the contents in the file:
When you read a line containing Chinese, the program crashed directly: the content of this line is as follows:
.....-ActiveX
The error prompt is as follows:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb3 in position 0: invalid start byte
The encoding in my file is as follows:
reason:
The encoding in the file is as follows:
<html>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
<meta name="ProgId" content="">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
Its character set is: gb2312
Therefore, it must be read in gbk encoding
Solution:
Can't decode with "utf-8", use "gbk"
fr = ("" ,"r" , "gbk")