Spider

1. 代理ip池, 阿布云使用 :

1 2	见代码 https://www.abuyun.com/http-proxy/dyn-manual.html

pool.txt :

218.60.8.98:3129
122.72.18.34:80
124.235.208.252:443
182.88.178.229:8123
121.43.170.207:3128
1.71.188.37:3128
124.235.208.252:443
113.200.56.13:8010
114.215.95.188:3128

pool.py :

import urllib.request
import random
import time

url = 'http://www.baidu.com/#ie=UTF-8&wd=ip'
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)

# 读取文件
fp = open('pool.txt', 'r')
string = fp.read()
fp.close()

# print(string)
# 将字符串按照换行符切割，得到一个列表，列表里面就是一个一个的代理服务器
lt = string.splitlines()

# print(lt)
while 1:
	# 从列表中随机抽取一个代理
	daili = random.choice(lt)
	# 发送请求
	proxy = {'http': daili}
	# 创建handler
	handler = urllib.request.ProxyHandler(proxies=proxy)
	# 创建opener
	opener = urllib.request.build_opener(handler)

	try:
		response = opener.open(request)
		print('使用代理%s成功' % daili)

		with open('ip.html', 'wb') as fp:
			fp.write(response.read())
		break
	except Exception as e:
		# 将这个代理从你的列表移除
		lt.remove(daili)
		print('使用代理%s失败' % daili)
	time.sleep(2)

abuyun.py :

import urllib.request
import base64

user = 'HCQ4X00T441PYI5D'
pwd = 'E8C159C7668242ED'

# 将用户名和密码拼接后再转化
string = user + ':' + pwd
# 进行base64编码
ret = 'Basic ' + base64.b64encode(string.encode('utf8')).decode('utf8')
# print(ret)

url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'
headers = {
	'Proxy-Authorization': ret
}
# 构建请求对象
request = urllib.request.Request(url=url, headers=headers)

# 
handler = urllib.request.ProxyHandler(proxies={'http': 'http-dyn.abuyun.com:9020'})
opener = urllib.request.build_opener(handler)

r = opener.open(request)

with open('ip.html', 'wb') as fp:
	fp.write(r.read())

2. cookie使用 :

cookie是什么?
http的特点 : 无状态
	客户端		服务器
	每一次请求都是单独的请求, 请求之间没有任何关系
登录时候是一个请求
访问登录后的页面又是一个请求, 这两个请求必须有关系,所以引入了cookie

登录的时候, 服务端给你响应, 在响应里面就有cookie, 浏览器就会将cookie保存起来, 下次再请求的时候, 就会带着cookie来访问

需求 : 通过代码访问登录后的页面 --- 人人网
	http://www.renren.com/960481378/profile

(1) 通过抓包
	首先让浏览器登录成功, 然后让浏览器再访问登录后页面的时候, 你来抓包, 抓取到请求头里面的cookie信息, 然后写到代码中即可
	
(2) 模拟登录
	思路 :首先抓包抓取post请求, 通过代码模拟发送post请求, 然后创建ck对象, 用来保存和携带cookie即可

import urllib.request

url = 'http://www.renren.com/960481378/profile'
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
	'Cookie': 'anonymid=jkl96t9i-7jfkjc; depovince=SC; _r01_=1; _de=F872F5698F7602B30ADE65415FC01940; ln_uact=17701256561; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=89a1a125-709a-4912-aa70-cbaf2eddd932%7C86ba94a3b75a9848502e25ac92562959%7C1533740001336%7C1%7C1533740008302; jebecookies=f91c2331-f0e2-4c9a-a936-3299112bbed6|||||; JSESSIONID=abcEmpXxJ_WYs8adJICuw; ick_login=a3456fda-f4c3-4dde-98b7-7994763bb807; p=e456fef62e5fb2a7971b6bb34af6b17d8; first_login_flag=1; t=01e2ec7768631eae78816a83379f3f508; societyguester=01e2ec7768631eae78816a83379f3f508; id=960481378; xnsid=e374bf01; ver=7.0; loginfrom=null; wp_fold=0',
}
request = urllib.request.Request(url=url, headers=headers)

r = urllib.request.urlopen(request)

with open('renren.html', 'wb') as fp:
	fp.write(r.read())

moni.py :

import urllib.request
import urllib.parse
import http.cookiejar

# 在代码中有没有一个东西和浏览器是一样的，能够保存cookie呢？下次发送的时候自动携带cookie

# 首先创建一个cookiejar对象，用来保存cookie
ck = http.cookiejar.CookieJar()
# 根据ck创建一个handler
handler = urllib.request.HTTPCookieProcessor(ck)
opener = urllib.request.build_opener(handler)

# 往下所有的请求，都是opener.open()方法发送，那么就会自动保存cookie和携带cookie

post_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2018741029602'
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
}
request = urllib.request.Request(url=post_url, headers=headers)
formdata = {
	'email': '17701256561',
	'password': 'lizhibin666',
	'icode': '',
	'origURL': 'http://www.renren.com/home',
	'domain': 'renren.com',
	'key_id': '1',
	'captcha_type': 'web_login',
	'f': 'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DVwDXbx3oN5RBHzVxzj2jwbsO3z8VmHcZ1HZQTdC3enq%26wd%3D%26eqid%3D834642bf0000b410000000055b6bac87',
}
formdata = urllib.parse.urlencode(formdata).encode('utf8')

r = opener.open(request, data=formdata)

# print(r.read().decode('utf8'))
# 假如登录成功

get_url = 'http://www.renren.com/960481378/profile'

request = urllib.request.Request(url=get_url, headers=headers)

r = opener.open(request)

with open('renren.html', 'wb') as fp:
	fp.write(r.read())

3. 正则表达式 :

abc123@qq.com
text@163.com
为什么引入正则表达式? 字符串的函数
string.find('abc123@qq.com'), 可以查找所有的邮箱, 可以验证邮箱格式是否正确, 可以匹配一类东西, 因为有规则, 学习正则就是学习规则
女人的心, 医生的处方, 道士的符, 程序媛的正则
1234567
\d{5,7}
单字符规则 :
	\d : 所有的数字字符
	\D : 非\d
	\w : 数字, 字母, 下划线, 中文
	\W : 非\w
	\s : 匹配所有的空白字符 \n \t 空格
	\S : 非\s
	[] : [aeiou] 只能匹配其中任意一个字母
	. : 除了\n以外任意字符
	[^aeiou] : 除了里面写的都能匹配
数量修饰 :
	{m} : 修饰前面的字符出现m次
	{m,} : 修饰前面的字符最少m次  贪婪地 能多匹配就多匹配
	{m,n} : 最少m次, 最多n次
	{0, } : 任意多次 *
	{1, } : 最少1次 +
	{0, 1} : 可有可无 ?
在Python里面如何使用?
	模块 :
		import re 
	pattern = re.compile(r'xxx')
	函数 :
		pattern.match() 从字符串开头查找, 找到一个结束
		pattern.search() 从字符串任意位置开始查找, 找到一个立马结束
			ret.group() 得到匹配内容
			ret.span() 得到匹配位置
		pattern.findall() 返回列表, 得到所有匹配的内容
边界修饰 :
	^ : 以某某开头
	$ : 以某某结尾
分组 : (正则的高级功能)
	()
	1. 视为一个整体	(a\d){5}
	2. 子模式, 分组
sublime中Ctrl+H --> 正则匹配
	(?P<goudan>)	(?P=goudan)
	\1	\2 第一, 二个小括号配的内容
	$1	$2 第一, 二个小括号匹配的内容
	如果有子模式 ret.group(1)就是第一个子模式匹配的内容
贪婪 :
	.*
	.+
	.*? 取消贪婪
模式修正 :
	re.I : 忽略大小写
	re.M : 视为多行模式
	re.S : 视为单行模式(忽略换行符)

re.py :

import re

# string = 'i love you very love much'
# pattern = re.compile(r'love')
# 匹配失败返回None
# ret = pattern.match(string)

# ret = pattern.search(string)
# ret = pattern.findall(string)

# print(ret)
# 得到匹配的内容, 只匹配一个，成功立马结束
# print(ret.group())
# print(ret.span())

'''
# 子模式
string = '哈哈<div><span>天青色等烟雨,而我在等你</span></div>嘻嘻'
pattern = re.compile(r'<(\w+)><(\w+)>.*</\2></\1>')
# pattern = re.compile(r'<(?P<goudan>\w+)><(?P<maodan>\w+)>.*</(?P=maodan)></(?P=goudan)>')

ret = pattern.search(string)
print(ret.group())
print(ret.group(1))
print(ret.group(2))
'''

'''
# 贪婪模式
string = '<div>啦啦啦啦啦啦，我是卖报的小行家</div></div></div></div>'
pattern = re.compile(r'<div>(.*?)</div>')
ret = pattern.search(string)

print(ret.group(1))
'''

'''
# 忽略大小写
string = 'love is a forever topic'
pattern = re.compile(r'LOVE', re.I)

ret = pattern.search(string)

print(ret.group())
'''
"""
# 多行模式
string = '''细思极恐
你的对手在看书
你的敌人在磨刀
你的闺蜜在减肥
隔壁老王在炼腰
'''
pattern = re.compile(r'^你的', re.M)
ret = pattern.search(string)

print(ret.group())
"""

# 单行模式
string = '''<div>沁园春-雪
北国风光，千里冰封，万里雪飘
望长城内外，惟余莽莽
大河上下，顿失滔滔
</div>'''
pattern = re.compile(r'<div>(.*?)</div>', re.S)
ret = pattern.search(string)
print(ret.group(1))

4. 正则案例 :

糗事百科糗图
	
图片下载 :
	防盗链, 直接通过urllib.request.urlretrieve()下载不了
	在请求头部有一个referer, 判断头部是不是从这个网站过来的, 如果是, 可以看图片, 如果不是, 图片不让看
	通过程序看这个图片的时候, 需要手动定制Referer: ,定制为网站的首页即可, 需要通过构建请求对象, 发送请求,

qiutu.py :

import urllib.request
import urllib.parse
import re
import time
import os

def main():
	start_page = int(input('请输入起始页码:'))
	end_page = int(input('请输入结束页码:'))
	url = 'https://www.qiushibaike.com/pic/page/{}/'
	for page in range(start_page, end_page + 1):
		print('正在下载第%s页......' % page)
		# 拼接url，构建请求对象
		request = handle_request(url, page)
		# 发送请求，获取响应
		content = urllib.request.urlopen(request).read().decode('utf8')
		# 正则解析内容
		parse_content(content)
		print('结束下载第%s页' % page)
		time.sleep(2)

def parse_content(content):
	pattern = re.compile(r'<div class="thumb">.*?<img src="(.*?)" alt="(.*?)" />.*?</div>', re.S)
	ret = pattern.findall(content)
	# print(ret)
	# print(len(ret))
	# 遍历这个列表，依次下载每一个图片
	for tp in ret:
		# 取出图片的链接
		image_src = 'https:' + tp[0]
		# 取出图片的名字
		name = tp[1]
		# 保存图片
		dirname = 'qiutu'
		filename = name + '.' + image_src.split('.')[-1]
		filepath = os.path.join(dirname, filename)
		print('正在下载%s..' % filename)
		urllib.request.urlretrieve(image_src, filepath)
		print('结束下载%s' % filename)
		time.sleep(2)

def handle_request(url, page):
	# 拼接url
	url = url.format(page)
	# print(url)
	headers = {
		'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
	}
	request = urllib.request.Request(url=url, headers=headers)
	return request

if __name__ == '__main__':
	main()

lizhi.py :

import urllib.request
import urllib.parse
import time
import re

def main():
	start_page = int(input('请输入起始页码:'))
	end_page = int(input('请输入结束页码:'))
	url = 'http://www.yikexun.cn/lizhi/qianming/list_50_{}.html'
	# 打开文件
	fp = open('lizhi.html', 'w', encoding='utf8')
	for page in range(start_page, end_page + 1):
		# 构建请求对象
		request = handle_request(url, page)
		# 发送请求，得到响应
		content = urllib.request.urlopen(request).read().decode('utf8')
		# 解析内容
		parse_content(content, fp)
		time.sleep(2)
	fp.close()

def parse_content(content, fp):
	pattern = re.compile(r'<div class="art-t">.*?<a href="(.*?)">(<b>)?(.*?)(</b>)?</a>.*?</div>', re.S)
	ret = pattern.findall(content)

	# 遍历列表，取出标题和链接
	for tp in ret:
		href = 'http://www.yikexun.cn' + tp[0]
		title = tp[2]
		# 取出b标签
		# title = title.strip('</b>')
		text = get_text(href)
		# 打开文件，写入文件中
		string = '<h1>%s</h1>%s' % (title, text)
		fp.write(string)
		time.sleep(2)


def get_text(href):
	# 构建请求对象
	request = handle_request(href)
	content = urllib.request.urlopen(request).read().decode('utf8')
	pattern = re.compile(r'<div class="neirong">(.*?)</div>', re.S)
	ret = pattern.search(content)
	# print(ret.group(1))
	# exit()
	return ret.group(1)

def handle_request(url, page=None):
	if page:
		url = url.format(page)
	headers = {
		'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
	}
	request = urllib.request.Request(url=url, headers=headers)
	return request

if __name__ == '__main__':
	main()

test.py :

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
# 'Accept-Encoding': 'gzip, deflate, br',
# 'Accept-Language': 'zh-CN,zh;q=0.9',
# 'Cache-Control': 'max-age=0',
# 'Connection': 'keep-alive',
# 'Cookie': 'BIDUPSID=6F6C332F8A0E3C9949BD5D9F884F1FFB; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDUID=EFDAC6F7D747687E9C719E50A41D707F:FG=1; PSTM=1533783169; BD_UPN=12314353; delPer=0; BD_HOME=0; BD_CK_SAM=1; PSINO=3; H_PS_PSSID=1435_21117_20927; H_PS_645EC=3a04mDqnf2AOXeU2n6NNCFlEHTqg2o6UIX4PAa801GwAZ5PgQkN95DF2qY8; BDSVRTM=0',
# 'Host': 'www.baidu.com',
# 'Upgrade-Insecure-Requests': '1',
# 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',

# string = '我叫{}, 我喜欢{}, 我最{}'
# string = '我叫{2}, 我喜欢{1}, 我最{0}'
# string = '我叫{name}, 我喜欢{like}, 我最{lala}'
# print(string.format(like='周星驰', lala='朱茵', name='疯狂'))

# string = '<b>b有b没有那么一首b歌，可以让我跟着和</b>'
# string = string.strip('</b>')
# print(string)

最后更新： 2018年08月09日 19:23

原始链接： http://yoursite.com/2018/08/09/代理池ip,阿布云代理使用,cookie使用,正则简单回顾,正则子模式,糗图正则匹配,正则抓取励志/

赏

lronLin's Blog

简洁是智慧的灵魂, 冗长是肤浅的藻饰!

Spider

1. 代理ip池, 阿布云使用 :

pool.txt :

pool.py :

abuyun.py :

2. cookie使用 :

moni.py :

3. 正则表达式 :

re.py :

4. 正则案例 :

qiutu.py :

lizhi.py :

test.py :

1. 代理ip池, 阿布云使用 :

pool.txt :

pool.py :

abuyun.py :

2. cookie使用 :

cookie.py :

moni.py :

3. 正则表达式 :

re.py :

4. 正则案例 :

qiutu.py :

lizhi.py :

test.py :