Character encoding is an unavoidable problem in computer programming. Whether you use Python 2 or Python 3, or C++, Java, etc., I feel that it is very necessary to clarify the concept of character encoding in computers. This article is mainly divided into the following sections:
basic concept
Introduction to common character encoding
Python's default encoding
Character type in Python 2
UnicodeEncodeError & UnicodeDecodeError root
basic concept
Character (Character)
In the field of computer and telecommunications, a character is a unit of information. It is a general term for various words and symbols, including national characters, punctuation marks, graphic symbols, numbers, and so on. For example, a Chinese character, an English letter, a punctuation mark, etc. are all one character.
Character set
A character set is a collection of characters. There are many types of character sets, and each character set contains a different number of characters. For example, common character sets include ASCII character set, GB2312 character set, Unicode character set, etc. Among them, ASCII character set has 128 characters, including displayable characters (such as English uppercase and lowercase characters, Arabic numerals) and control characters (such as spaces). Key, enter key); GB2312 character set is the Chinese national standard Simplified Chinese character set, including simplified Chinese characters, general symbols, numbers, etc.; Unicode character set contains all the characters used in the world's languages,
Character encoding
Character encoding is the encoding of a character in a character set into a specific binary number for computer processing. Common character encodings are ASCII encoding, UTF-8 encoding, GBK encoding, and so on. In general, character sets and character encodings are often considered synonymous concepts. For example, for character set ASCII, in addition to the meaning of "set of characters", it also includes the meaning of "encoding", that is, Say, ASCII represents both the character set and the corresponding character encoding.
Below we use a table to summarize:
Introduction to common character encoding
Common character encodings are ASCII encoding, GBK encoding, Unicode encoding, and UTF-8 encoding. Here, we mainly introduce ASCII, Unicode and UTF-8.
ASCII
Computers were born in the United States. People use English in their home countries. In the English world, they are just a combination of English letters, numbers and some common symbols.
In the 1960s, the United States developed a character encoding scheme that defined the conversion relationship between English letters, numbers, and some common symbols and binary. It is called ASCII (American Standard Code for Information Interchange). code.
For example, the binary representation of the uppercase English letter A is 01000001 (decimal 65), the binary representation of the lowercase English letter a is 01100001 (decimal 97), and the binary representation of the space SPACE is 00100000 (decimal 32).
Unicode
The ASCII code only specifies a 128-character encoding, which is sufficient in the United States. However, the computer was later transmitted to Europe, Asia, and even the world, and the languages ​​of the world are almost completely different. It is not enough to express other languages ​​in ASCII code. Therefore, different countries and regions have developed themselves. The coding scheme, such as GB2312 encoding and GBK encoding in mainland China, Shift_JIS encoding in Japan, etc.
Although countries and regions can develop their own coding schemes, computers in different countries and regions will have a variety of garbled (mojibake) in the process of data transmission, which is undoubtedly a disaster.
How to do? The idea is also very simple, is to unify all the languages ​​of the world into a set of coding schemes. This coding scheme is called Unicode. It sets a unique binary code for each character of each language, so that it can be cross-language and cross-language. The text processing of the platform is not great!
Unicode version 1.0 was born in October 1991, and it is still being updated. Each new version will add more new characters. The latest version is 9.0.0 announced on June 21, 2016.
The Unicode standard uses hexadecimal digits, and prefixes the number with U+. For example, the unicode encoding of the uppercase letter "A" is U+0041, and the unicode encoding of the Chinese character "strict" is U+4E25. For more symbol correspondence tables, you can query unicode.org or a special Chinese character correspondence table.
UTF-8
Unicode seems to be perfect, and it has achieved a big unity. However, Unicode has a big problem: a waste of resources.
Why do you say that? Originally, in order to represent all the characters in the world, Unicode used two bytes at first, and later found that two bytes were not enough, and four bytes were used. For example, the unicode encoding of the Chinese character "strict" is hexadecimal number 4E25, which is converted to binary with fifteen digits, that is, 100111000100101, so at least two bytes are required to represent the Chinese character, but for other characters, three may be required. Or four bytes or even more.
At this point, the problem is coming. If the previous ASCII character set is also represented in this way, then it is not a waste of storage space. For example, the uppercase letter "A" has a binary code of 01000001. It only needs one byte. If unicode uses three bytes or four bytes to represent characters, the first few digits of the "A" binary code. The bytes are all 0, which is a waste of storage space.
To solve this problem, UTF-16, UTF-32 and UTF-8 were implemented on the basis of Unicode. Let's just talk about UTF-8.
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that uses one to four bytes to represent characters. For example, ASCII characters continue to use one-byte encoding, Arabic, Greek Text uses two-byte encoding, commonly used Chinese characters use three-byte encoding, and so on.
Therefore, we say that UTF-8 is one of the implementations of Unicode, and other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes).
Python's default encoding
The default encoding for Python2 is ascii, and the default encoding for Python3 is utf-8, which can be obtained in the following way:
Python2
Python2.7.11 (default, Feb242016, 10:48:05)
[GCC4.2.1Compatible Apple LLVM7.0.2(clang-700.1.81)]on darwin
Type"help","copyright","credits"or"license"formore information.
>>> importsys
>>> sys.getdefaultencoding()
'ascii'
Python3
Python 3.5.2 (default, Jun292016, 13:43:58)
[GCC4.2.1Compatible Apple LLVM7.3.0(clang-703.0.31)]on darwin
Type"help","copyright","credits"or"license"formore information.
>>> importsys
>>> sys.getdefaultencoding()
'utf-8'
Character type in Python 2
There are two types of string-related types in Python 2: str and unicode, whose parent class is basestring. Among them, the string of type str has multiple encoding methods. The default is ascii, and gbk, utf-8, etc. The string of unicode type is represented by u'...'. The following figure shows str and The relationship between unicode:
The mutual conversion of the two strings is summarized as follows:
Convert the string 'xxx' represented by UTF-8 encoding to the Unicode string u'xxx' with the decode('utf-8') method:
>>> 'Chinese'.decode('utf-8')
u'ä¸æ–‡'
Convert u'xxx' to UTF-8 encoded 'xxx' with the encode('utf-8') method:
>>> u'ä¸æ–‡'.encode('utf-8')
'ä ̧æ????'
UnicodeEncodeError & UnicodeDecodeError root
When writing programs in Python 2, I often encounter UnicodeEncodeError and UnicodeDecodeError. The root cause of them is that if the code uses a string of str type and unicode type, Python will try to encode the unicode type string by default using ascii encoding (encode ), or decode the string of type str, which is likely to occur.
Here are two common scenarios, and we'd better keep in mind:
When performing string operations that include both str and unicode types, Python2 will decode and decode str into unicode and then UnicodeDecodeError.
Let us look at the example:
>>> s = 'hello' # str type, utf-8 encoding
>>> u = u'world' # unicode type
>>> s + u # will perform implicit conversion, ie s.decode('ascii') + u
Traceback(most recent call last):
File"
UnicodeDecodeError: 'ascii'codec can'tdecode byte0xe4inposition0: ordinal notinrange(128)
In order to avoid errors, we need to display the specified use 'utf-8' for decoding, as follows:
>>> s = 'hello' # str type, utf-8 encoding
>>> u = u'world'
>>>
>>> s.decode('utf-8') + u # Display the specified 'utf-8' for conversion
U'hello world' # note that this is not an error, this is a unicode string
If a function or class receives a string of type str, but you pass unicode, Python2 will use ascii to encode it to str type by default, and UnicodeEncodeError will easily appear.
Let us look at the example:
>>> u_str = u'hello'
>>> str(u_str)
Traceback(most recent call last):
File"
UnicodeEncodeError: 'ascii'codec can'tencode characters inposition0-1: ordinal notinrange(128)
In the above code, u_str is a string of type unicode. Since the argument to str() can only be of type str, Python will try to encode it as ascii using ascii, which is:
U_str.encode('ascii') // u_str is a unicode string
The above uses unicode type Chinese to use ascii encoding, which will definitely go wrong.
Looking at an example using raw_input, note that raw_input only accepts strings of type str:
>>> name = raw_input('input your name: ')
Inputyour name: ethan
>>> name
'ethan'
>>> name = raw_input('Enter your name:')
Enter your name: Xiao Ming
>>> name
'å°??æ????'
>>> type(name)
>>> name = raw_input(u'enter your name: ') # will try to enter your name '.encode('ascii') using u'
Traceback(most recent call last):
File"
UnicodeEncodeError: 'ascii'codec can't encode characters in position 0-5: ordinal not in range(128)
>>> name = raw_input(u'Enter your name: '.encode('utf-8')) #å¯ä»¥, but the name is not a unicode type at this time
Enter your name: Xiao Ming
>>> name
'xe5xb0x8fxe6x98x8e'
>>> type(name)
>>> name = raw_input(u'Enter your name: '.encode('utf-8')).decode('utf-8') # Recommend
Enter your name: Xiao Ming
>>> name
U'u5c0fu660e'
>>> type(name)
Look at an example of a redirect:
Hello = u' hello'
Printhello
Save the above code to the file hello.py, and execute python hello.py on the terminal to print normally, but if you redirect it to the file python hello.py > result you will find UnicodeEncodeError.
This is because when printing to the console, print uses the console's default encoding, and when redirecting to a file, print doesn't know what encoding to use, so the default encoding ascii is used to cause encoding errors.
It should be changed as follows:
Hello = u' hello'
Printhello.encode('utf-8')
This will run python hello.py > result without problems.
summary
UTF-8 is a variable-length character encoding for Unicode, which is one of the implementations of Unicode.
The Unicode character set has several encoding standards, such as UTF-8, UTF-7, UTF-16.
When doing string operations that include both str and unicode types, Python2 will decode and decode str into unicode.
If a function or class receives a string of type str, but you pass unicode, Python2 will use ascii to encode it into a str type by default.
The sound quality of small speakers is also good. It does not have the same large speakers and power as HIFI speakers, and its sound quality cannot compete with large speakers due to physical limitations. But for the vast majority of users who are not golden ears, the use of small speakers with tablets and mobile phones is sufficient to meet their hearing needs.
For Bluetooth retro speakers, I personally think that when many people buy speakers, the appearance is the first priority and the function is second.
We focus on retro Bluetooth speakers,It has high energy density, mini size, light weight and diversified shapes;Excellent fast charging performance, support fast charging and other excellent features
with a brand-new design, showing retro nostalgia without losing fashion. It uses a 2.5-inch speaker and has many functions such as FM radio, Bluetooth fast connection, multi-mode switching, and HIFI high fidelity. Wireless Bluetooth 4-10 hours of playback (at 50% volume), which adds to its unique charm.
Wireless Retro speaker,Waterproof Retro speaker, Portable Vintage speaker,Rechargeable Vintage speaker
Shenzhen Focras Technology Co.,Ltd , https://www.focrass.com