login taobao - ltoddy/blog GitHub Wiki
背景: 最近因为工作原因对接阿里妈妈的业务,需要将对方后台的数据抓去到我们的数据库中然后生成报表(数据统计).
先来讲一下整个登陆流程,再来讲解各个参数如何获得(完整代码在最下面).
在浏览器中打开 https://pub.alimama.com/, 右键inspect,可以看到阿里妈妈的登陆窗口是一个iframe,
其中src为:
https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fpub.alimama.com%2Fmanage%2Foverview%2Findex.htm&style=mini&full_redirect=true&newMini2=true&enup=0&qrlogin=1&keyLogin=true&sub=true&css_style=hudongcheng&from=alimama&disableQuickLogin=true
其domain为: login.taobao.com (盲猜这是淘宝的注册认证中心)
而我们需要的数据所在的domain为:pub.alimama.com.
那么这就需要在login.taobao.com登陆成功后,然后获取签名这样的东西去pub.alimama.com这里换取cookie.
在浏览器中打开https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fpub.alimama.com%2Fmanage%2Foverview%2Findex.htm&style=mini&full_redirect=true&newMini2=true&enup=0&qrlogin=1&keyLogin=true&sub=true&css_style=hudongcheng&from=alimama&disableQuickLogin=true
先尝试着用错的的账号密码输入,可以在Chrome的控制台中看到:
web端会向: https://login.taobao.com/newlogin/login.do?appName=taobao&fromSite=0 这个url发请求.
其中 reqeust.headers的 Content-Type 字段为: application/x-www-form-urlencoded
再来看看body里面的内容:
loginId: error
password2: ae0a0718aee222344d1a4a044a03e5136677a55d76df17c75c8f4747bf0ed44744bc6bbe604775e0fce571da155adf79b36293f1308bb7c8434b2c1acc456a12ede793ca798ea4b72f0bbadb979f8517df67d5cf814b9a58599cdb3602deeb3d75fd708206320fc2b86c4a90b45e25107a8789a8eca373573c7d87f8009b1eab
keepLogin: false
ua: 134#dhi4nXXwXGf8eraCJXF+dJ0D3QROwKOlAOzBtZ26EXkECHUbCAsQRX3CiiEL+jv7sxI/oN2v2oH1OX8AHZv3BQ69cRCOTsD56E7AM8Mt0xXt+Tq6IgL7o6JqqH8uiNv1QWLFCqpAZtiw9RiXq/hKoGnBXcx5Qd8JL+GIqomYTvww+cx48qMsoXowqcyqqFxOocQBnXGdJ3r7/lZoG6jlZzpA71SXnkCgD5G1hPEXfDlggrqGwspGQq2v7Ii5wAjsibKu3XErdSaEhZI7v5L4R5imgtjfnyhRh87b84CwOtEVqO09Fy125nu4U2/gYcfzOKgpjPtezW6OWxuYGwnX/hVlV7QUPSYF9ty11DW+tJOAT1GQFBvsi/dPb3Mg9bGDlKSbwVqVs0dM0ZMmTiH6YVNv4ALhXgsM+Ask5fIwsOOfUcwBTMYjpjSrRWVM1wad8cV7GS4NCRlW4l5VdGNBlfUsIYOzM3XPa3C5pS/yD65Dzqz+RokhIScy8uoirg3X93uTfYhvEj0QIbQaJrsL059WJs5bRJmkxtXDdGq5ofKKPCliiJRY6Hv2rGoYagHFqzXca1p8rml2vFCtahTMkDSHO2v88JWZ34VXXTs8omdZ6M+K2F8N53w/qDR1mxsv8i6K1eoFfGBQ5YzgmX4R2LZF3nGZ6rj+HvuoJtu/pQ4CyoZCMg3XMSbL0ltmhqelTA2eR13k+wIu3NRnbHBlwr17BnBj3D7CgzC5UoCL6qpXGp9IzVLg9wPxdc8MnerZQQ5UNf2JsFFcQLACJyX8oiBoagfWL/vJctdsJsOykzqKLCERUKm2z6aCpcofPMeDQwKC2dS5UIKMJO4utnRE+Y5lyOEmMgLM9iUKu+fWDOWDWJZ0LoOTBmLAOhPbmTDvR4Hxi+JhUMZThTQYuDjS+9uiK6EbGF5LreO7+cyzf0cBXcM4ZV+O4reOTl8jchChfxB3plifu+STzXOsWFEEuCT4Kj2+oL6FJIev9L2ABFGT84beXjHYedDB1nXCWcUdUQGhwPpYCvumVx4S8hsfY8EkOWHn7a5H6oBjLp1kW0z+pPjI9XPgfHdrsP4bTLqmJvXvSqaHYoatwRGM4fdlA3idpkW1SlYPyj/5nHf/AqffEAB/ocVOzFwHA2DOpG9uS9wRVoeNjhjEssnnQUF4fHhWL6CnfNiNLPwgqEXp56Aflbt6k6LU16kozvJbeswjMIaE6J3vB9UQ2nCgo6sacFH1EyL3Qx19FsNUiloMuhEZsna01+4zfROEElX=
umidGetStatusVal: 255
screenPixel: 2048x1280
navlanguage: en-US
navUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36
navPlatform: MacIntel
sub: true
appName: taobao
css_style: hudongcheng
appEntrance: taobao_pc
_csrf_token: FhY7WREmClBSuLs0FOuXjA
umidToken: 96e3e0b38a8060f2fd89c6ac5d5b3c9fca46c1b7
hsiz: 1a46d8b237e0b8bf17aeed409748bcb8
newMini2: true
bizParams:
full_redirect: true
style: mini
appkey: 00000000
from: alimama
isMobile: false
lang: zh_CN
returnUrl: https://pub.alimama.com/manage/overview/index.htm
fromSite: 0
可以看到有很多的参数,其中 loginId 是账号, password2 是加密后的密码. 加密后的密码长度为256位,注意这个点很重要.
直接说答案,密码加密使用RSA加密,然后再经过编码得到的password2的值. 可以从index.js文件中寻找这个线索(具体暂不细说). 很多网站也是使用RSA加密,实现方案是直接把公钥打包到js文件里面,而taobao不是这么做的,前端获取公钥的方式是动态计算出来的.
什么意思呢,后端将modulus和exponent发送给前端,然后前端依据这两个的值计算出公钥.(资料: https://www.ams.org/publicoutreach/msamhome/06-Kaliski.pdf)
说了这些参数怎么获得呢.
回到刚才的登陆页面(https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fpub.alimama.com%2Fmanage%2Foverview%2Findex.htm&style=mini&full_redirect=true&newMini2=true&enup=0&qrlogin=1&keyLogin=true&sub=true&css_style=hudongcheng&from=alimama&disableQuickLogin=true) 右键view page source. 第六个script标签(也就是内容最多的scriptb标签). 里面有5个js object:
- window.PAGE_START_LOAD_TIME
- window.LOGIN_UMID_LOAD
- window.viewConfig
- window.viewData
- window._lang
我们把目光关注在window.viewConfig和window.viewData上.
window.viewConfig有两个字段: rsaExponent 和 rsaModulus, 这两个字段用来计算出公钥.
window.viewData有一个字段: loginFormData 这是其他参数.
有了这些参数以及一些headers(自己从浏览器复制一下)就可以在login.taobao.com下进行登陆了.
def login(self):
url: str = 'https://login.taobao.com/newlogin/login.do?appName=taobao&fromSite=0'
payload: dict = self._build_login_parameters()
response: Response = requests.post(url, headers=self.headers, data=payload)
print(response.headers.get('Set-Cookie'))
print('------------------------------')
pprint(response.json())注意看response body被jsonify之后,有个redirectUrl字段,如果你拿着这个url到浏览器里访问,你会发现,你登陆了!!!!
完整代码:
crawler.py
import json
from typing import Tuple, Dict
import requests
from bs4 import BeautifulSoup
from bs4.element import Script
from jsbn import RSAKey
from requests.models import Response
class TaoBaoCrawler:
def __init__(self) -> None:
self.account = "输入你的账号"
self.password = '输入你的密码'
self.session = requests.Session()
self.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
}
def login(self) -> None:
url: str = 'https://login.taobao.com/newlogin/login.do?appName=taobao&fromSite=0'
payload: dict = self._build_login_parameters()
response: Response = requests.post(url, headers=self.headers, data=payload)
content: dict = response.json()
redirect_url: str = content['content']['data']['redirectUrl']
self.session.get(redirect_url, headers=self.headers)
# 此时你已经在pub.alimama.com这个domain下成功登陆了,做你该做的事情吧.
def _build_login_parameters(self) -> Dict[str, str]:
extra_params = TaoBaoCrawler._fetch_login_extra_parameters()
login_form_data = extra_params['loginFormData']
parameters = {
'loginId': self.account,
'password2': self._encrypt_password(),
'ua': '134#G7G4wJXwXGEPDeLsTJu/dJ0D3QROwKOlAOzBtZ26EXkECHUbI/2tCyWC624H7GV7mS+3y4LVtrBBssn7MKz3LrMN5Gi4LARlbqX5tVht+Tq6IgL7o6JqqH8uiNv1QWLFCqpAZtWtPGy8Aqn1rXoKXcpqqkujX6r2qT1AZtXw+2XaUSErSXwBLTP7/oEPYcr1ZUFJtpgX+U4K+bXiUVU3XQA0e7lJeDlPbOXUc6hUqTgBnKs1jFNIoHxMXyswr6TX1iFcRt0ZApNgVlq0tO3CRp9Joqtq8HAMXIRUtR+dTUoExx12/Av0QefEVW9Fd+f/uKnXQCczotscj+bMwgHXdLS3a+REjMRGyaHelde1+7o6MBIUUDaMBgD1n3bltW84fgS7JvP9Frchc8SA2TyMjGxn3Xq9myL4QrPv1KE2a/iAUYqwl8CRI9syfZQdlOj9e5Fi+rap/8T5Sy2rKwVi4nw92gUTfEcSs4oiJ4dFlJWQXR3XXTs8omUxSKdx9xyc5UEmq3bHEGlEl0ocqlI2sFH5iB3z/aZk2GCOcgOXx1Qu2Q5HSoq238PLSQaYdBAdYb6ebqoHmvDPeBC5MyVBCS3oQ5AcF2Gyij1IHm3UWH4zxp1P4VZrIFdsPqauwgdWp42X6QsoLC5zeyNdf7k5a62Z1TmaxIqvTeP8KqHOqPpMTf1xqKh1ihhxLJbMYyH9yMOEp62zI00KcB/6oEQmg33tanZYEH15eRTut6P7zkk/s+ukPm/YhY0VJ7VpzZyAvOp/mmynwkgHgIRqoHBGeAsA3wNtslS/qYhNbKH3mwRNWznH0bM61OoPQVNmOV2J3goobVvKge/hIXGCWX74lNgJagFKdCQSaU8AlKGE88SK/IFB1tMHK9XUV7AET8QFNuq5OTsP5THNC/5fE5c10S67tkfq5IrHbPUk0sqrARl6byxbie6j7rCeFT2vexJiI/w1XwEk5sywYGDlpYrzPJIfEzkXq5PUv20Jkl0D2bfiV9iolsxyfOW80mx4AfqTe1B2QyBUJRl+UAP9uqgxQMurBFJHHcjTlyhHyhh=',
'navPlatform': 'MacIntel',
'navUserAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36',
'navlanguage': 'en-US',
'screenPixel': '1792x1120',
'umidGetStatusVal': 255,
'returnUrl': 'https://pub.alimama.com/manage/overview/index.htm',
}
parameters.update(login_form_data)
return parameters
def _encrypt_password(self) -> str:
modulus, exponent = TaoBaoCrawler._fetch_rsa_modulus_and_exponent()
rsa = RSAKey()
rsa.setPublic(modulus, exponent)
return rsa.encrypt(self.password)
@staticmethod
def _fetch_rsa_modulus_and_exponent() -> Tuple[str, str]:
# 因为平常也不怎么写爬虫,这段代码写的比较垃圾,但是它work...
url = 'https://login.taobao.com/member/login.jhtml?redirectURL=https://pub.alimama.com/manage/overview/index.htm&style=mini&full_redirect=true&newMini2=true&enup=0&qrlogin=1&keyLogin=true&sub=true&css_style=hudongcheng&from=alimama&disableQuickLogin=true'
response: Response = requests.get(url)
# 我记得python3中标准库是有html parser的,可惜我没用过,唉,不然就可以少添加一个依赖了
soup: BeautifulSoup = BeautifulSoup(response.text, 'html.parser')
script_tag: Script = soup.find_all('script')[6]
content: str = script_tag.string
for line in content.strip().split('\n'):
line: str = line.strip()
if line.startswith('window.viewConfig'):
jsobj: str = line.strip('window.viewConfig =')
jsobj: str = jsobj.strip(';')
jsobj: str = jsobj.replace('\\\\', '')
view_config: dict = json.loads(jsobj)
modulus = view_config['rsaModulus']
exponent = view_config['rsaExponent']
return modulus, exponent
raise ValueError("can't acquire modulus and exponent!")
@staticmethod
def _fetch_login_extra_parameters() -> Dict[str, str]:
url = 'https://login.taobao.com/member/login.jhtml?redirectURL=https://pub.alimama.com/manage/overview/index.htm&style=mini&full_redirect=true&newMini2=true&enup=0&qrlogin=1&keyLogin=true&sub=true&css_style=hudongcheng&from=alimama&disableQuickLogin=true'
response: Response = requests.get(url)
soup: BeautifulSoup = BeautifulSoup(response.text, 'html.parser')
script_tag: Script = soup.find_all('script')[6]
content: str = script_tag.string
for line in content.strip().split('\n'):
line = line.strip()
if line.startswith('window.viewData'):
jsobj: str = line.strip('window.viewData =')
jsobj: str = jsobj.strip(';')
jsobj: str = jsobj.replace('\\\\', '')
view_data: dict = json.loads(jsobj)
return view_data
raise ValueError("can't acquire login extra parameters.")
def main():
crawler = TaoBaoCrawler()
crawler.login()
if __name__ == "__main__":
main()requirements.txt
beautifulsoup4
pyjsbn-rsa
requests
最后,值得注意的是,请求的Content-Type是x-www-form-urlencoded,以及使用iframe这种方法做登陆窗口.
url后缀位 .htm(这个是兼容老版本ie浏览器,后缀长度最多是三个字符).由着几点分析,都是老系统,估计没人维护.
所以这个方案可以长期使用.
