1. urllib.parse :
1
2
3
4
5
6
7
8
9
10
11
处理参数或者url的
urllib.parse.quote()
url编码
https://www.baidu.com/s?ie=UTF-8&wd=%E5%91%A8%E6%9D%B0%E4%BC%A6
字母, 数字, 下划线, 冒号 // ? =等
如果有其他字符, 需要进行编码
urllib.parse.unquote()
url解码
[注] 编码的时候只需要编码参数即可
urllib.parse.urlencode(data)
data是一个字典,直接将字典的建和值转化为query_string格式,并且实现url编码
2. 构建请求对象 :
1
request = urllib.request.Request(url=url, headers=headers)
3. 模拟各种请求方式 :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
get
百度搜素
post
百度翻译
urlopen(url, data=None)
如果有data,代表是post请求,如果没有data,代表是get请求,get的参数需要拼接到url的后面
表单数据的处理
formdata = urllib.parse.urlencode(formdata).encode('utf8')
ajax-get
豆瓣电影排行榜
https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=40&limit=20
每页显示10条数据
第一页:start=0 limit=10
第二页:start=10 limit=10
第三页:start=20 limit=10
第n页: start=(n-1)*10 limit=10
ajax-post
肯德基店铺位置
4. URLError / HTTPError :
1
2
3
4
5
6
7
是异常处理类, 属于urllib.error这个模块

URLError : 断网或者主机不存在的时候会触发 mi.com jd.com

Exception : 官方的异常基类, 所有的异常类都直接或者间接的继承它

HTTPError : 是URLError的子类, 如果多个except同时捕获,注意将子类写到上面,将父类写到下面
5. 复杂的get :
1
2
3
4
5
6
百度贴吧
第一页:https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85&ie=utf-8&pn=0
第二页:https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85&ie=utf-8&pn=50
第三页:https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85&ie=utf-8&pn=100
第n页:pn = (n-1) * 50
需求:输入贴吧名字,输入要爬取的起始页码,结束页码,以贴吧的名字创建一个文件夹,将每一页的内容全部拿下来保存到第n页.html文件中
6. Handler处理器, 自定义Opener :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
urlopen()
请求对象
为了解决代理和cookie这些更加高级的功能而引入的
实现最简单的功能, 高级功能的步骤和这个步骤一模一样

import urllib.request

url = 'http://www.baidu.com/'

# 创建handler
handler = urllib.request.HTTPHandler()

# 根据handler创建opener
opener = urllib.request.build_opener(handler)

# 发送请求的时候, 不要使用urlopen发送, 使用opener.open()
response = opener.open(url)

print(response.read().decode('utf8'))
7. 代理 :
1
2
3
4
5
生活中代理 : 代练, 代驾, 代孕, 代购
程序中 : 见代理小弟图
代理服务器: 快代理, 西刺代理, 芝麻代理, 阿布云代理
(1) 浏览器如何设置代理
(2) 代码中如何设置代理
parse.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import urllib.parse

# url = 'https://www.baidu.com/s?ie=UTF-8&wd=周杰伦'

# string = urllib.parse.quote(url)

# string1 = urllib.parse.unquote(string)

# print(string)
# print(string1)

# urlencode
url = 'https://www.baidu.com/s?'
# 将get参数写到这里
data = {
'ie': 'utf8',
'wd': '周杰伦'
}

query_string = urllib.parse.urlencode(data)
url += query_string
print(url)

'''
# 将data拼接到url的后面,组成完整的url
# 遍历这个字典,拼接为指定格式
lt = []
for k, v in data.items():
value = k + '=' + v
lt.append(value)
# 将lt用&符号拼接起来即可
query_string = '&'.join(lt)
url += query_string

print(url)
'''
request_obj.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import urllib.request
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

url = 'http://www.baidu.com/'

# 如何定制UA
# 在这个头部不仅可以定制ua,还可以定制其他的请求头部,一般只需要定制ua
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
}

# 构建请求对象
request = urllib.request.Request(url=url, headers=headers)

# 发送请求,直接打开这个请求对象即可
response = urllib.request.urlopen(request)
get_params.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import urllib.request
import urllib.parse

# 让用户输入搜索关键字
keyword = input('请输入要搜索的关键字:')
url = 'https://www.baidu.com/s?'
# get参数
data = {
'ie': 'utf8',
'wd': keyword,
}
query_string = urllib.parse.urlencode(data)

url += query_string

# 向url发送请求,得到响应
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.3',
}
request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)

# 拼接文件名字
filename = keyword + '.html'
# 写入到文件中

with open(filename, 'wb') as fp:
fp.write(response.read())
1- post.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import urllib.request
import urllib.parse

url = 'http://fanyi.baidu.com/sug'
# 将表单数据写成一个字典
formdata = {
'kw': 'baby'
}
# 将formdata单独处理一下
formdata = urllib.parse.urlencode(formdata).encode('utf8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.3',
}
# 构建请求对象
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request, data=formdata)

print(response.read().decode('utf8'))

2 - post.py :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import urllib.request
import urllib.parse

# 加密的接口,如果想要得到,需要破解
url = 'http://fanyi.baidu.com/v2transapi'

word = 'wolf'
# 表单数据
formdata = {
'from': 'en',
'to': 'zh',
'query': word,
'transtype': 'realtime',
'simple_means_flag': '3',
'sign': '275695.55262',
'token': '268ca3a468d99f5aac3a179efad0ab28',
}
# 处理表单数据
formdata = urllib.parse.urlencode(formdata).encode('utf8')
headers = {
# 'Accept': '*/*',
# 将其注释掉,索要完整的格式
# 'Accept-Encoding': 'gzip, deflate',
# 'Accept-Language': 'zh-CN,zh;q=0.9',
# 'Connection': 'keep-alive',
# 将其注释掉,让其自动计算即可
# 'Content-Length': '120',
# 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': 'BAIDUID=55279ECD6DDA84C66A41BA7CC1E6840E:FG=1; PSTM=1533627007; BIDUPSID=6F6C332F8A0E3C9949BD5D9F884F1FFB; PSINO=3; BDRCVFR[Y1-7gJ950Fn]=jCHWiyEa0lYpAN8n1msQhPEUf; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_PSSID=1465_26963_26432_21099_26350_26925_22157; locale=zh; to_lang_often=%5B%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%2C%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%5D; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1533694190; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1533694190; from_lang_often=%5B%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%2C%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%5D',
'Host': 'fanyi.baidu.com',
'Origin': 'http://fanyi.baidu.com',
'Referer': 'http://fanyi.baidu.com/?aldtype=16047',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request, data=formdata)

print(response.read().decode('utf-8'))
ajax_get.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import urllib.request
import urllib.parse

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&'
print('每页显示10条数据')
page = int(input('请输入页码:'))
# 根据page计算出来start和limit
start = (page-1) * 10
limit = 10
data = {
'start': start,
'limit': limit,
}
url += urllib.parse.urlencode(data)

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
}
request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)


print(response.read().decode('utf8'))
ajax_post.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import urllib.request
import urllib.parse

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
city = input('请输入要搜索的城市:')
data = {
'cname': city,
'pid': '',
'pageIndex': '1',
'pageSize': '10'
}
data = urllib.parse.urlencode(data).encode('utf8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
}
request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request, data=data)

print(response.read().decode('utf8'))
error.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import urllib.request
import urllib.error

# 动物 人类 男人累、女人泪 红旭

'''
url = 'http://www.maodan.com/'
# response = urllib.request.urlopen(url)
try:
response = urllib.request.urlopen(url)
# except Exception as e:
except urllib.error.URLError as e:
# except NameError as e: 这个不能捕获
print(e)

print('不影响这一句代码的运行')
'''

url = 'https://www.cnblogs.com/fh-fendou/p/7479811.html'

try:
response = urllib.request.urlopen(url)
# except urllib.error.HTTPError as e:
except (urllib.error.URLError, urllib.error.HTTPError) as e:
print(e)
print('httperror')
# except urllib.error.URLError as e:
# print(e)
# print('urlerror')

print('正常运行')
tieba.py :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import urllib.request
import urllib.parse
import os
import time

def main():
baming = input('请输入要爬取的贴吧的名字:')
start_page = int(input('请输入要爬取的起始页码:'))
end_page = int(input('请输入要爬取的结束页码:'))
url = 'https://tieba.baidu.com/f?'

for page in range(start_page, end_page + 1):
print('正在爬取第%s页......' % page)
# 根据url和page拼接指定页码的url
request = handle_request(page, baming, url)
# 根据请求对象发送请求得到响应写入到指定的文件中
down_load(request, baming, page)
print('结束爬取第%s页' % page)
time.sleep(3)

def down_load(request, baming, page):
response = urllib.request.urlopen(request)
# 通过代码创建指定的文件夹
dirname = baming
# 判断不存在的时候创建
if not os.path.exists(dirname):
os.mkdir(dirname)
# 文件的名字
filename = '第%s页.html' % page
# 得到文件的路径
filepath = os.path.join(dirname, filename)
# 将内容直接写入到filepath中
with open(filepath, 'wb') as fp:
fp.write(response.read())


def handle_request(page, baming, url):
pn = (page-1) * 50
# 拼接url
data = {
'kw': baming,
'ie': 'utf8',
'pn': pn
}
url += urllib.parse.urlencode(data)
# print(url)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
}
# 构建请求对象
request = urllib.request.Request(url, headers=headers)
return request

if __name__ == '__main__':
main()
handler.py :
1
2
3
4
5
6
7
8
9
10
11
12
import urllib.request

url = 'http://www.baidu.com/'
# 创建handler
handler = urllib.request.HTTPHandler()
# 根据handler创建opener
opener = urllib.request.build_opener(handler)

# 发送请求的时候,不要使用urlopen发送,使用opener.open()
response = opener.open(url)

print(response.read().decode('utf8'))
daili.py :
1
2
3
4
5
6
7
8
9
10
import urllib.request

url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'
handler = urllib.request.ProxyHandler(proxies={'http': '218.60.8.98:3129'})
opener = urllib.request.build_opener(handler)

r = opener.open(url)

with open('代理.html', 'wb') as fp:
fp.write(r.read())

最后更新: 2018年08月09日 18:58

原始链接: http://yoursite.com/2018/08/09/parse,构建请求,模拟请求Handler/

× 请我吃糖~
打赏二维码