第七章文本处理和正则表达式

Alan

5 年前

精通Python自动化脚本-运维人员宝典完整目录：

第一章 Python脚本概述
第二章 Python脚本调试和性能测试
第三章单元测试-单元测试框架的介绍
第四章自动化常规运维活动
第五章文件、目录和数据处理
第六章文件存档、加密和解密
第七章文本处理和正则表达式
第八章文档和报告
第九章操作各类文件
第十章网络基础 – Socket编程
第十一章使用Python脚本处理邮件
第十二章使用Telnet和SSH远程监控主机
第十三章创建图形化用户界面
第十四章处理Apache和其它的日志文件
第十五章 SOAP和REST API通讯
第十六章网络抓取 – 从网站上提取有用的信息
第十七章数据收集及报表
第十八章 MySQL和SQLite数据库管理

本章中我们将学习文本处理和正则表达式。文本处理是一个创建和修改文本的过程。Python有一个称为正则表达式的强大的库，可处理搜索和提取数据等任务。我们将学习如何对文件使用它，同时学习如何读取和写入文件。

我们将学习Python的正则表达式模块re，以及在Python中处理文件。同时学习re模块中的match(), search(), findall()和sub()函数。我们还将使用textwrap来学习Python中的文本封装。最后，我们会学习unicode字符。

本章中我们会学习如何课题：

文本封装
正则表达式
Unicode字符串

文本封装

本章中我们将学习Python的textwrap模块。该模块提供了TextWrapper来完成所有任务。textwrap模块用于格式化或封装普通文本。该模块提供了5个主要的函数：wrap(), fill(), dedent(), indent()和shorten()。下面我们将逐一来学习这些函数。

wrap()函数

wrap()函数用于将整个段落封装为单个字符串。输出结果是一个输出行列表。

语法为textwrap.wrap(text, width)：

text: 要封装的文本
width: 封装行允许的最大长度，默认值为70

下面我们来看一个wrap()的示例。创建一个脚本wrap_example.py并编写如下内容：

import textwrap

sample_string = '''Python is an interpreted high-level programming language
for general-purpose programming. Created by Guido van Rossum and first
released in 1991, Python has a design philosophy that emphasizes code
readability, notably using significant whitespace.'''

w = textwrap.wrap(text=sample_string, width=30)
print(w)

import textwrap

sample_string = '''Python is an interpreted high-level programming language

for general-purpose programming. Created by Guido van Rossum and first

released in 1991, Python has a design philosophy that emphasizes code

readability, notably using significant whitespace.'''

w = textwrap.wrap(text=sample_string, width=30)

print(w)

运行脚本，将会得到如下输出：

$ python3 wrap_example.py
['Python is an interpreted high-', 'level programming language for', 'general-purpose programming.', 'Created by Guido van Rossum', 'and first released in 1991,', 'Python has a design philosophy', 'that emphasizes code', 'readability, notably using', 'significant whitespace.']

$ python3 wrap_example.py

['Python is an interpreted high-', 'level programming language for', 'general-purpose programming.', 'Created by Guido van Rossum', 'and first released in 1991,', 'Python has a design philosophy', 'that emphasizes code', 'readability, notably using', 'significant whitespace.']

上例中我们使用了Python中的textwrap模块。首先我们创建一个名为sample_string的字符串。接着我们使用TextWrapper类指定了width。再后使用了wrap函数，字符串被封装长度30。最后我们打印出了各行。

fill()函数

fill()函数与textwrap.wrap的作用相似，只是它返回的数据是连接的、以新行分隔的字符串。该函数将输入封装为文本并返回包含封装文本的单个字符串。

该函数的语法为：textwrap.fill(text, width)

text：封装的文本
width：封装的一行允许的最大长度。默认值为70。

下面我们来看一个fill()的示例。创建一个fill_example.py脚本并编写如下内容：

import textwrap

sample_string = '''Python is an interpreted high-level programming
language.'''

w = textwrap.fill(text=sample_string, width=50)
print(w)

import textwrap

sample_string = '''Python is an interpreted high-level programming

language.'''

w = textwrap.fill(text=sample_string, width=50)

print(w)

运行脚本，我们将得到如下输出;

$ python3 fill_example.py
Python is an interpreted high-level programming
language.

$ python3 fill_example.py

Python is an interpreted high-level programming

language.

上例中我们使用了fill()函数。整个过程和我们在wrap()中的操作相同。首先，我们创建了一个字符串变量。接着我们创建了一个textwrap对象。然后，我们应用了fill()函数。最后，我们打印了输出。

dedent()函数

dedent()是textwrap模块中的另一个函数。该函数将我们文本每一行的普通前置空格进行移除。

该函数的语法如下：

textwrap.dedent(text)

其中text为取消缩进的文本。

下面我们来看一个dedent()的示例。创建一个脚本dedent_example.py并编写如下内容：

import textwrap

str1 = '''
        Hello Python World \tThis is Python 101
        Scripting language\n
        Python is an interpreted high-level programming language for general-purpose programming.
'''
print("Original: \n", str1)
print()

t = textwrap.dedent(str1)
print("Dedented: \n", t)

import textwrap

str1 = '''

Hello Python World \tThis is Python 101

Scripting language\n

Python is an interpreted high-level programming language for general-purpose programming.

'''

print("Original: \n", str1)

print()

t = textwrap.dedent(str1)

print("Dedented: \n", t)

运行脚本，我们将得到如下输出：

$ python3 dedent_example.py
Original:

	Hello Python World 	This is Python 101
	Scripting language

	Python is an interpreted high-level programming language for general-purpose programming.


Dedented:

Hello Python World 	This is Python 101
Scripting language

Python is an interpreted high-level programming language for general-purpose programming.

$ python3 dedent_example.py

Original:

Hello Python World This is Python 101

Scripting language

Python is an interpreted high-level programming language for general-purpose programming.

Dedented:

Hello Python World This is Python 101

Scripting language

Python is an interpreted high-level programming language for general-purpose programming.

上例中，我们创建了一个字符串变量str1。然后我们使用了textwrap.dedent()来移除了常见前置空白内容。制表符和空格都可视作空白内容，但两者并不等价。因此，本例中的对唯一空白内容 tag 进行了删除。

indent()函数

indent()函数用于在文本选择定行的起始处添加指定前置内容。

该函数的语法为：textwrap.indent(text, prefix)

text: 主字符串
prefix: 要添加的前置内容

创建一个脚本indent_example.py并编写如下内容：

import textwrap

str1 = "Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, \n\nPython has a design philosophy that emphasizes code readability, notably using significant whitespace."

w = textwrap.fill(str1, width=30)
i = textwrap.indent(w , '*')
print(i)

import textwrap

str1 = "Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, \n\nPython has a design philosophy that emphasizes code readability, notably using significant whitespace."

w = textwrap.fill(str1, width=30)

i = textwrap.indent(w , '*')

print(i)

运行脚本，我们将得到如下输出：

$ python3 indent_example.py
*Python is an interpreted high-
*level programming language for
*general-purpose programming.
*Created by Guido van Rossum
*and first released in 1991,
*Python has a design philosophy
*that emphasizes code
*readability, notably using
*significant whitespace.

$ python3 indent_example.py

*Python is an interpreted high-

*level programming language for

*general-purpose programming.

*Created by Guido van Rossum

*and first released in 1991,

*Python has a design philosophy

*that emphasizes code

*readability, notably using

*significant whitespace.

在上例中，我们使用了textwrap模块的fill()和indent()函数。首先，我们使用了fill方法将数据存储在变量w中。接着我们使用了indent方法，输出的每行都会添加一个前缀*。然后我们打印了输出。

shorten()函数

textwraps模块中的这一函数用于截断文本来适配指定的宽度。例如，我们想要创建摘要或预览时，可使用 shorten()函数。使用 shorten()时，文本中所有的空白内容都会被标准化为单个空格。

该函数的语法为：textwrap.shorten(text, width)

下面我们来看一个shorten()的示例。创建一个脚本shorten_example.py并编写如下内容：

import textwrap

str1 = "Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, \n\nPython has a design philosophy that emphasizes code readability, notably using significant whitespace."

s = textwrap.shorten(str1, width=50)
print(s)

import textwrap

s = textwrap.shorten(str1, width=50)

print(s)

运行脚本，我们将得到如下输出：

$ python3 shorten_example.py
Python is an interpreted high-level [...]

1 2	$ python3 shorten_example.py Python is an interpreted high-level [...]

在上例中，我们使用了shorten()函数来截取文本来适配指定的宽度。首先，所有的空白内容都被截断为单个空格。如果结果符合指定宽度，则会将结果显示在屏幕上。否则，指定宽度的单词会显示在屏幕上，而其余的则放在占位符中。

正则表达式

这一部分中，我们将来学习Python中的正则表达式。正则表达式是一种专用编程语言，内置在Python中，用户可通过re来进行使用。我们为想要匹配的字符串集定义规则。使用正则表达式，我们可以从文件、代码、文档、电子表格等中提取指定的信息。

在Python中，正则表达式由re表示并可通过re模块来进行导入。正则表达式支持4种内容：

标识符
修饰符
空白字符
Flag标记

以下表格列出了标识符（identifier），对每一个都有相关的描述：

标识符	描述
\w	匹配字字母数字字符，包含下划线(_)
\W	匹配非字母数字字符，不包含下划线(_)
\d	匹配一个数字
\D	匹配一个非数字字符
\s	匹配一个空格
\S	匹配空格以外的字符
.	匹配一个点号 (.)，注：应为任意字符
\b	匹配任意新行以外的字符

下表中列出了修饰符（modifier）以及对应的描述：

修饰符	描述
^	匹配字符串的起始处
$	匹配字符串的结尾处
?	匹配0次或11次
*	匹配0次或多次
+	匹配1次或多次
\|	匹配两边内容之一x/y
[ ]	范围匹配
{x}	前面标识符的匹配次数

下表中列出了空白字符（whitespace）及相应描述：

字符	描述
\s	空格
\t	制表符 Tab
\n	新起一行
\e	回撤Escape
\f	换页Form feed
\r	返回

以下表格列出了标记及相应描述：

Flag 标记	描述
re.IGNORECASE	忽略大小写匹配
re.DOTALL	匹配包括新行在内的任意字符
re.MULTILINE	多行匹配
Re.ASCII	仅对ASCII字符进行转义匹配

下面我们来看看一些正则表达式的示例。我们将学习match(), search(), findall()和sub()函数。

ℹ️要使用Python中的正则表达式，必须要在脚本中导入re模块，这样才能对正则表达式使用所有的函数和方法。

下面我们就来逐一地学习这些函数。

match()函数

match()是re模块中的一个函数。该函数会将指定的re模式与字符串相匹配。如果找到了匹配，会返回一个匹配对象。匹配对象中包含匹配的相关信息。如果未找到匹配，我们得到的结果是None。match对象有两个方法：

group(num): 返回整个匹配
groups(): 在元组中返回所有匹配的子组

该函数的语法如下：

re.match(pattern, string)

下面我们来看一个re.match()的示例。创建一个脚本re_match.py并编写如下的内容：

import re

str_line = "This is python tutorial. Do you enjoy learning python ?"
obj = re.match(r'(.*) enjoy (.*?) .*', str_line)
if obj:
        print(obj.groups())

import re

str_line = "This is python tutorial. Do you enjoy learning python ?"

obj = re.match(r'(.*) enjoy (.*?) .*', str_line)

if obj:

print(obj.groups())

运行该脚本，将得到如下输出：

$ python3 re_match.py
('This is python tutorial. Do you', 'learning')

1 2	$ python3 re_match.py ('This is python tutorial. Do you', 'learning')

在以上脚本中，我们导入了re模块来在Python中使用正则表达式。然后我们创建了一个字符串str_line。接着我们创建了一个匹配对象obj并在其中存储了模式匹配的结果。本例中，(.*) enjoy (.*?) .*模式会打印出enjoy关键字之前的所有内容，以及只打出enjoy关键字之后的一个单词。然后我们使用了匹配对象的groups()方法。它会打印出元组中的所有匹配子字符串。因此，我闪将得到的输出是(‘This is python tutorial. Do you’, ‘learning’)。

search()函数

re模块中的search()函数会对字符串进行搜索。它会查找re模式中的所有位置。search()会接收一个模式和文本，并在指定的字符串中搜索匹配内容。它会在查找到匹配时会返回一个匹配对象，否则返回None。match对象有两个方法：

group(num): 返回整个匹配
groups(): 在元组中返回所有匹配的子组

该函数的语法如下：

re.search(pattern, string)

创建脚本re_search.py 并编写如下内容：

import re

pattern = ['programming', 'hello']
str_line = 'Python programming is fun'
for p in pattern:
        print('Searching for %s in %s' % (p, str_line))
        if re.search(p, str_line):
                print('Match found')
        else:
                print('No match found')

import re

pattern = ['programming', 'hello']

str_line = 'Python programming is fun'

for p in pattern:

print('Searching for %s in %s' % (p, str_line))

if re.search(p, str_line):

print('Match found')

else:

print('No match found')

运行脚本，将会得到如下输出：

$ python3 re_search.py
Searching for programming in Python programming is fun
Match found
Searching for hello in Python programming is fun
No match found

$ python3 re_search.py

Searching for programming in Python programming is fun

Match found

Searching for hello in Python programming is fun

No match found

上例中，我们使用了匹配对象的search()方法来查找正则模式。在导入re模块后，我们在列表中指定了模式。在该列表中我们编写了两个字符串：programming的hello。接着我们创建了一个字符串：Python programming is fun。我们编写了一个for循环来对指定模式进行逐一检查。如果找到了匹配内容，则执行if代码块，否则执行else代码块。

findall()函数

这是match对象的另一个方法。findall() 方法查找所有的匹配内容，然后以字符串列表的形式返回。列表中的每一个元素代表一个匹配。该方法搜索模式的匹配，并不重叠。

创建一个脚本 re_findall_example.py并编写如下内容：

import re

pattern = 'Red'
colors = 'Red, Blue, Black, Red, Green'
p = re.findall(pattern, colors)
print(p)

str_line = 'Peter Piper picked a peck of pickled peppers. How many pickled peppe
rs did Peter Piper pick?'
pt = re.findall('pe\w+', str_line)
pt1 = re.findall('pic\w+', str_line)
print(pt)
print(pt1)

line = 'Hello hello HELLO bye'
p = re.findall('he\w+', line, re.IGNORECASE)
print(p)

import re

pattern = 'Red'

colors = 'Red, Blue, Black, Red, Green'

p = re.findall(pattern, colors)

print(p)

str_line = 'Peter Piper picked a peck of pickled peppers. How many pickled peppe

rs did Peter Piper pick?'

pt = re.findall('pe\w+', str_line)

pt1 = re.findall('pic\w+', str_line)

print(pt)

print(pt1)

line = 'Hello hello HELLO bye'

p = re.findall('he\w+', line, re.IGNORECASE)

print(p)

运行脚本，将得到如下输出：

$ python3 re_findall_example.py
['Red', 'Red']
['per', 'peck', 'peppers', 'peppers', 'per']
['picked', 'pickled', 'pickled', 'pick']
['Hello', 'hello', 'HELLO']

$ python3 re_findall_example.py

['Red', 'Red']

['per', 'peck', 'peppers', 'peppers', 'per']

['picked', 'pickled', 'pickled', 'pick']

['Hello', 'hello', 'HELLO']

以上脚本中，我们使用findall() 方法编写了三个示例。第一个示例中，我们定义了一个模式和一个字符串。我们使用findall() 方法在这个字符串查找该模式，然后进行打印。第二个示例中，我们创建了一个字符串，并使用findall()查找该字符串中前面两个字母为pe的单词，然后进行打印。我们得到了一个前两个字母为pe的单词列表。

此外，我们查找前三个字母为pic的单词并进行打印。这里我们同样将得到一个字符串列表。第三个示例中，我们创建了一个字符串，包含大小写的hello以及单词bye。我们使用findall()查找前两个字母为he的单词。同时在findall()中我们使用了re.IGNORECASE标记来忽略单词的大小写并进行了打印。

sub()函数

这是re模块中最重要的函数之一。sub()用于将指定的替代文本替换掉re模式。它将以替换字符串来替换掉所有re模式发生的内容。语法如下：

re.sub(pattern, repl_str, string, count=0)

pattern: re模式
repl_str: 替换字符串
string: 主字符串
count: 进行替换的次数。默认值为0，表示替换所有发生之处。

下面我们来创建脚本re_sub.py并编写如下内容：

import re

str_line = 'Peter Piper picked a peck of pickled peppers. How many pickled peppe
rs did Peter Piper pick?'

print('Original: ', str_line)
p = re.sub('Peter', 'Mary', str_line)
print('Replaced: ', p)

p = re.sub('Peter', 'Mary', str_line, count=1)
print('Replacing only one occurence of Peter... ')
print('Replaced: ', p)

import re

str_line = 'Peter Piper picked a peck of pickled peppers. How many pickled peppe

rs did Peter Piper pick?'

print('Original: ', str_line)

p = re.sub('Peter', 'Mary', str_line)

print('Replaced: ', p)

p = re.sub('Peter', 'Mary', str_line, count=1)

print('Replacing only one occurence of Peter... ')

print('Replaced: ', p)

运行脚本，将得到如下输出：

$ python3 re_sub.py
Original:  Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?
Replaced:  Mary Piper picked a peck of pickled peppers. How many pickled peppers did Mary Piper pick?
Replacing only one occurence of Peter...
Replaced:  Mary Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

$ python3 re_sub.py

Original: Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

Replaced: Mary Piper picked a peck of pickled peppers. How many pickled peppers did Mary Piper pick?

Replacing only one occurence of Peter...

Replaced: Mary Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

上例中，我们使用了sub()来以指定的字符串来替换re模式。我们以Mary替换了Peter。因此，所有的Peter都以Mary进行了替换。接着，我们增加了count参数。传入了count=1，表示只对Peter进行一次替换，其它的Peter都不进行替换。

下面，我们将学习re模块中的subn()函数。subn()函数和sub()作用相同，并包含更多的功能。subn() 函数返回一个包含新字符串和执行的替换次数的元组。我们来看一个subn() 的示例。创建一个脚本re_subn.py并编写如下内容：

import re

print('str1:- ')
str1 = 'Sky is blue. Sky is beautiful.'

print('Original: ', str1)
p = re.subn('beautiful', 'stunning', str1)
print('Replaced: ', p)
print()

print('str_line- :')
str_line = 'Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?'

print('Original: ', str_line)
p = re.subn('Peter', 'Mary', str_line)
print('Replaced: ', p)

import re

print('str1:- ')

str1 = 'Sky is blue. Sky is beautiful.'

print('Original: ', str1)

p = re.subn('beautiful', 'stunning', str1)

print('Replaced: ', p)

print()

print('str_line- :')

str_line = 'Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?'

print('Original: ', str_line)

p = re.subn('Peter', 'Mary', str_line)

print('Replaced: ', p)

运行脚本，我们将得到如下输出：

$ python3 re_subn.py
str1:-
Original:  Sky is blue. Sky is beautiful.
Replaced:  ('Sky is blue. Sky is stunning.', 1)

str_line- :
Original:  Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?
Replaced:  ('Mary Piper picked a peck of pickled peppers. How many pickled peppers did Mary Piper pick?', 2)

$ python3 re_subn.py

str1:-

Original: Sky is blue. Sky is beautiful.

Replaced: ('Sky is blue. Sky is stunning.', 1)

str_line- :

Original: Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

Replaced: ('Mary Piper picked a peck of pickled peppers. How many pickled peppers did Mary Piper pick?', 2)

上例中，我们使用了subn()函数来替换RE模式。结果我们得到了一个包含替换后文本和替换次数的元组。

Unicode字符串

这一部分中我们将学习如何在Python中打印Unicode字符串。Python可以很轻易地处理Unicode字符串。字符串类型中包含的实际上是Unicode字符串，而非字节序列。

在系统中启动python3终端并编写如下内容：

student@python-scripting:~/Chapter07$ python3
Python 3.7.2 (default, Feb 26 2019, 15:56:02)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u2713')
✓
>>> print('\u2724')
✤
>>> print('\u2750')
❐
>>> print('\u2780')
➀
>>> chinese = '\u4e16\u754c\u60a8\u597d!'
>>> chinese
'世界您好!'
>>> s = '\u092E\u0941\u0902\u092C\u0908'
>>> s
'मुंबई'                   ------(印地语中“孟买”的意思)
>>> s = '\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0'
>>> s
'გამარჯობა'                   ------(格鲁吉亚语中“Hello”的意思)
>>> s = '\u03b3\u03b5\u03b9\u03b1\u03c3\u03b1\u03c2'
>>> s
'γειασας'                   ------(希腊语中“Hello”的意思)
>>>

student@python-scripting:~/Chapter07$ python3

Python 3.7.2 (default, Feb 26 2019, 15:56:02)

[GCC 5.4.0 20160609] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> print('\u2713')

✓

>>> print('\u2724')

✤

>>> print('\u2750')

❐

>>> print('\u2780')

➀

>>> chinese = '\u4e16\u754c\u60a8\u597d!'

>>> chinese

'世界您好!'

>>> s = '\u092E\u0941\u0902\u092C\u0908'

>>> s

'मुंबई' ------(印地语中“孟买”的意思)

>>> s = '\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0'

>>> s

'გამარჯობა' ------(格鲁吉亚语中“Hello”的意思)

>>> s = '\u03b3\u03b5\u03b9\u03b1\u03c3\u03b1\u03c2'

>>> s

'γειασας' ------(希腊语中“Hello”的意思)

>>>

Unicode代码点

这一部分中，我们将学习unicode代码点（code point）。Python有一个名为ord() 的强大内置函数，可从指定字符获取Unicode代码点。因此我们来看一个从字符获取Unicode代码点的示例，参见如下代码：

>>> str1 = u'Office'
>>> for char in str1:
... 	print('U+%04x' % ord(char))
...
U+004f
U+0066
U+0066
U+0069
U+0063
U+0065
>>> str2 = '中文'
>>> for char in str2:
... 	print('U+%04x' % ord(char))
...
U+4e2d
U+6587

>>> str1 = u'Office'

>>> for char in str1:

... print('U+%04x' % ord(char))

...

U+004f

U+0066

U+0069

U+0063

U+0065

>>> str2 = '中文'

>>> for char in str2:

... print('U+%04x' % ord(char))

...

U+4e2d

U+6587

编码

将Unicode代码点转换为字节串称为编码。我们来看一个如何对Unicode代码点编码的示例，参见如下代码：

>>> str = u'Office'
>>> enc_str = type(str.encode('utf-8'))
>>> enc_str
<class 'bytes'>

>>> str = u'Office'

>>> enc_str = type(str.encode('utf-8'))

>>> enc_str

解码

由字节串转换为Unicode代码点称为解码。下面我们来看一个如何对字节串解码获取Unicode代码点的示例，参见如下代码：

>>> str = bytes('Office', encoding='utf-8')
>>> dec_str = str.decode('utf-8')
>>> dec_str
'Office'

>>> str = bytes('Office', encoding='utf-8')

>>> dec_str = str.decode('utf-8')

>>> dec_str

'Office'

避免UnicodeDecodeError

UnicodeDecodeError在字节串无法解码为Unicode代码点时会报出。要避免这一异常，我们可以向decode中错误参数传递replace, backslashreplace或ignore，如下所示：

>>> str = b"\xaf"
>>> str.decode('utf-8', 'strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 0: invalid start byte
>>> str.decode('utf-8', 'replace')
'�'       # 原文是'\ufffd'，测试了3.6和3.7均是前面的结果
>>> str.decode('utf-8', 'backslashreplace')
'\\xaf'
>>> str.decode('utf-8', 'ignore')
''

>>> str = b"\xaf"

>>> str.decode('utf-8', 'strict')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 0: invalid start byte

>>> str.decode('utf-8', 'replace')

'�' # 原文是'\ufffd'，测试了3.6和3.7均是前面的结果

>>> str.decode('utf-8', 'backslashreplace')

'\\xaf'

>>> str.decode('utf-8', 'ignore')

总结

本章中我们学习了正则表达式，使用它我们可以定义一系列想要匹配字符串的规则。我们也学习了re模块中的四个函数：match(), search(), findall()和sub()。

我们学习了textwrap模块，用于对普通文本进行格式化和封装。我们还学习了textwrap模块中的wrap(), fill(), dedent(), indent()和shorten()函数。最后，我们学习了Unicode字符以及如何在Python中打印Unicode字符串。

下一章中，我们将学习Python中的标准文档和信息报告。

课后问题

Python中的正则表达式是什么？
编写一个Python程序来检查只包含指定字符集合的字符串（本例中为a–z, A–Z和0–9）。
Python中的哪个模块支持正则表达式？
a) re
b) regex
c) pyregex
d) 以上都不是
re.match函数的作用是什么？
a)在字符串起始处匹配模式
b)在字符中任意位置匹配模式
c)不存在该函数
d)以上都不对
以下的输出是什么？
语句: “we are humans”
匹配：re.match(r'(.*) (.*?) (.*)’, sentence)
a) (‘we’, ‘are’, ‘humans’)
b) (we, are, humans)
c) (‘we’, ‘humans’)
d) ‘we are humans’

扩展阅读

正则表达式: https://docs.python.org/3.2/library/re.html
Textwrap文档: https://docs.python.org/3/library/textwrap.html
Unicode文档: https://docs.python.org/3/howto/unicode.html