python - thawk/wiki GitHub Wiki

Python

安装

编译Python

遇到recompile with -fPIC的话,应该用++./configure --enable-shared++重新编译安装Python

使用pip下载并离线安装

  • 下载

    pip install --download DIR 包名
    pip install --download DIR -r requirements.txt
  • 安装

    pip install --no-index --find-links=DIR 包名
    pip install --no-index --find-links=DIR -r requirements.txt

pip升级已有包

pip list --outdated --format=freeze | cut -d = -f 1  | xargs -n1 pip install -U

pip国内镜像

linux下,修改 ~/.pip/pip.conf ,如果没这文件则创建。

windows下,修改 %HOMEPATH%\pip\pip.ini

内容为:

[global]
index-url = https://pypi.douban.com/simple

阿里云镜像

index-url = https://mirrors.aliyun.com/pypi/simple
trusted-host = mirrors.aliyun.com

NOTE: 使用sudo的话,要修改/root/.pip/pip.conf

NOTE: 也可以不改pip.conf,在命令行提供 -i https://pypi.douban.com/simple 参数

Windows下使用pip安装软件包

  • 首先要安装pip
    • Install distribute
      curl http://python-distribute.org/distribute_setup.py | python
    • Install pip
      curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python
    • 可选:把python的Scripts目录回到路径中,以便随时使用pip命令(如C:\Python33\Scripts)
  • 使用pip安装其它包
    pip install requests
    

安装pyenv

依赖

  • Ubuntu/Debian

    sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
    libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
    xz-utils tk-dev libffi-dev liblzma-dev python-openssl
  • Fedora/CentOS/RHEL(aws ec2)

    sudo yum install zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel \
    openssl-devel xz xz-devel libffi-devel
  • Alpine

    apk add libffi-dev ncurses-dev openssl-dev readline-dev tk-dev xz-dev zlib-dev

安装

  • 安装(macOS)

    sudo installer -pkg /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg -target /
    brew install pyenv pyenv-virtualenv pyenv-virtualenvwrapper
  • 安装(除MacOS)

    curl https://pyenv.run | bash
    export PATH="/home/tanht/.pyenv/bin:$PATH"
    eval "$(pyenv init -)"
    eval "$(pyenv virtualenv-init -)"
  • 更新

    # 安装pyenv update插件
    git clone https://github.com/pyenv/pyenv-update.git $(pyenv root)/plugins/pyenv-update
    pyenv update
  • 删除

    rm -fr ~/.pyenv

安装neovim的环境

  • https://github.com/deoplete-plugins/deoplete-jedi/wiki/Setting-up-Python-for-Neovim
  • neovim需要通过pip安装neovim包才能使用Python
    • python2

      pyenv install 2.7.15
      pyenv virtualenv 2.7.15 neovim2
      pyenv activate neovim2
      pip install neovim
      pyenv which python  # Note the path
    • python3

      pyenv install 3.6.8
      pyenv virtualenv 3.6.8 neovim3
      pyenv activate neovim3
      pip install neovim
      pyenv which python  # Note the path
      
      # The following is optional, and the neovim3 env is still active
      # This allows flake8 to be available to linter plugins regardless
      # of what env is currently active.  Repeat this pattern for other
      # packages that provide cli programs that are used in Neovim.
      pip install flake8
      mkdir -p ~/bin
      ln -s `pyenv which flake8` ~/bin/flake8  # Assumes that $HOME/bin is in $PATH
    • init.vim

      let g:python_host_prog = '/full/path/to/neovim2/bin/python'
      let g:python3_host_prog = '/full/path/to/neovim3/bin/python'

语法

for..in..

for i in range(3):
  print i

function

  • function内的变量都是local的
  • 在function内可以引用全局变量的值,但所赋的值不能带出function
  • 在function内用global声明变量,就可以为全局变量赋值
  • Python passes all arguments using "pass by reference". However, numerical values and Strings are all immutable in place. You cannot change the value of a passed in variable and see that value change in the caller. Dictionaries and Lists on the other hand are mutable, and changes made to them by a called function will be preserved when the function returns. This behavior is confusing and can lead to common mistakes where lists are accidentally modified when they shouldn't be. However, there are many reasons for this behavior, such as saving memory when dealing with large sets.

特殊值

None: 空值

properties

只能用于新式类(从object派生)

class C(object):

    def __init__(self):
        self.__x = 0

    def getx(self):
        return self.__x

    def setx(self, x):
        if x < 0: x = 0
        self.__x = x

    x = property(getx, setx)

property的签名是

property(fget=None, fset=None, fdel=None, doc=None)

内置类型

list

  • l1 = ['abc', 'def']: 构造list
  • list.extend(another_list): adds items from another list (or any sequence) to the end
  • list.append(item): append an item to the end
  • b[:]=a[:]: clone list
  • newList=zip(list1, list2, list3...): 返回一个list,第n项是(list1[n],list2[n],list3[n]),长度为最短的输入列表的长度
  • del aList[1]: 删除aList[1]元素
  • a[1:1] = ['new_item']: 在列表中插入一项或多项
  • print len(aList): 获取列表长度

dict

dict = { 'key' : 'value' }
if dict.has_key('key')

string

构造重复串

print 'abc' * 4   # abcabcabcabc

format

print u'The story of {0}, {1}, and {other}.'.format('Bill', 'Manfred', other='Georg')
"First, thou shalt count to {0}"              # References first positional argument
"My quest is {name}"                          # References keyword argument 'name'
"Weight in tons {0.weight}"                   # 'weight' attribute of first positional arg
"Units destroyed: {players[0]}"               # First element of keyword argument 'players'.
"Harold's a clever {0!s}"                     # Calls str() on the argument first
"Bring out the holy {name!r}"                 # Calls repr() on the argument first
"A man with two {0:{1}}".format("noses", 10)  # 用参数{1}指定{0}的宽度
"hex number 0x{0:x}".format(123456)           # 以十六进制输出

trim

str.lstrip([chars]), str.rstrip([chars]), str.strip([chars]), defaults to remove spaces

按分行符分隔

  • str.splitlines(True): 保留行末的换行符
  • str.splitlines(False): 不保留行末的换行符

断言

  • str.startswith(prefix): 是否以prefix开始
  • str.endswith(suffix): 是否以suffix结尾

字节数组转换为字符串

import array
s = array.array('B', list_or_tuple)

bytearray

字符数组

bytearray.fromhex('00 01 0203')

反射

obj = globals()["类名"]()
  • vars(obj): 返回一个包含对象中所有属性的dict

BDD测试

使用behave,可以配合PyHamcrest提供的断言。

单元测试

目录架构

http://stackoverflow.com/questions/1896918/running-unittest-with-typical-test-directory-structure

new_project/
    antigravity/
        antigravity.py
    test/
        __init__.py # 可为空文件,用以表明test目录是一个package
        test_antigravity.py
    setup.py
    run_tests.py
    etc.
  • run_tests.py

    #!/usr/bin/python2
    # vim: set fileencoding=utf-8 foldmethod=marker:
    """ 运行所有单元测试。
    
    运行tests目录下名为test_*.py的所有单元测试代码
    """
    
    import glob
    import unittest
    
    def create_test_suite():
        test_file_strings = glob.glob('tests/test_*.py')
        module_strings = ['tests.'+str[6:len(str)-3] for str in test_file_strings]
    
        suites = [unittest.defaultTestLoader.loadTestsFromName(name) \
                  for name in module_strings]
        test_suite = unittest.TestSuite(suites)
        return test_suite
    
    test_suite = create_test_suite()
    runner = unittest.TextTestRunner().run(test_suite)

常用测试

# 2.7 and newer
with self.assertRaises(SomeException):
    do_something()
# 2.7 and newer
with self.assertRaises(SomeException) as cm:
    do_something()

the_exception = cm.exception
self.assertEqual(the_exception.error_code, 3)
self.assertRaises(SomeException, func, *args, **kwds)

文件系统

多段拼接为文件名

import os
path = os.path.join(os.environ['HOME'], '.abook', 'rc')

找出文件扩展名

import os
(root, ext) = os.path.splitext("path/to/file.ext")

找出符合条件的文件

使用shell的通配符

import glob
files = glob.glob('dir/*.sh')
iterator = glob.iglob('dir/*.sh')

递归找子目录下符合条件的文件

import fnmatch
import os

matches = []
for root, dirnames, filenames in os.walk('src'):
  for filename in fnmatch.filter(filenames, '*.c'):
    matches.append(os.path.join(root, filename))
import os
import fnmatch

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in fnmatch.filter(files, '*.txt')]

创建临时文件 mktemp

标准用法
from tempfile import NamedTemporaryFile
>>> f = NamedTemporaryFile(delete=False)
>>> f
<open file '<fdopen>', mode 'w+b' at 0x384698>
>>> f.name
'/var/folders/5q/5qTPn6xq2RaWqk+1Ytw3-U+++TI/-Tmp-/tmpG7V1Y0'
>>> f.write("Hello World!\n")
>>> f.close()
>>> os.unlink(f.name)
支持with语句的类
import os
import shutil
import tempfile

class TemporaryDirectory(object):
    '''
    A temporary directory to be used in a with statement.
    '''
    def __init__(self, suffix='', prefix='', dir=None, keep=False):
        self.suffix = suffix
        self.prefix = prefix
        self.dir = dir
        self.keep = keep

    def __enter__(self):
        self.tdir = tempfile.mkdtemp(self.suffix, self.prefix, self.dir)
        return self.tdir

    def __exit__(self, *args):
        if not self.keep and os.path.exists(self.tdir):
            shutil.rmtree(self.tdir, ignore_errors=True)

class TemporaryFile(object):
    def __init__(self, suffix="", prefix="", dir=None, mode='w+b'):
        if prefix == None:
            prefix = ''
        if suffix is None:
            suffix = ''
        self.prefix, self.suffix, self.dir, self.mode = prefix, suffix, dir, mode
        self._file = None

    def __enter__(self):
        fd, name = tempfile.mkstemp(self.suffix, self.prefix, dir=self.dir)
        self._file = os.fdopen(fd, self.mode)
        self._name = name
        self._file.close()
        return name

    def __exit__(self, *args):
        try:
            if os.path.exists(self._name):
                os.remove(self._name)
        except:
            pass

用法

with TemporaryDirectory('_zipfile_replace') as tdir:
  z.extractall(path=tdir)

with TemporaryFile('_pdf_set_metadata.pdf') as f:
    p.save(f)
    raw = open(f, 'rb').read()
    stream.seek(0)

shutil 拷贝、移动文件

  • shutil.copyfile(src, dst): dst必须是完整文件名
  • shutil.copymode(src, dst): Copy the permission bits
  • shutil.copystat(src, dst): Copy the permission bits, last access time, last modification time, and flags from src to dst
  • shutil.copy(src, dst): dst可以是目录
  • shutil.copy2(src, dst): copy() + copystat()
  • shutil.move(src, dst): 移动或改名
  • shutil.rmtree(path, [ignore_errors[, onerror]]): 删除目录及其子目录
  • shutil.copyfileobj(fsrc, fdst[, length]): 拷贝文件内容
    import shutil
    
    with open('output_file.txt','wb') as wfd:
        for f in ['seg1.txt','seg2.txt','seg3.txt']:
            with open(f,'rb') as fd:
                shutil.copyfileobj(fd, wfd)

re

m = re.compile("^\s*(\S+)\s*=\s*(.*)$").match(line)
if not m or not in_block:
    return []       # not a valid abook datafile

if not in_format_block:
    addr.append(AddressItem(m.group(1), m.group(2)))

uuid

>>> import uuid

# make a UUID based on the host ID and current time
>>> uuid.uuid1()
UUID('a8098c1a-f86e-11da-bd1a-00112444be1e')

# make a UUID using an MD5 hash of a namespace UUID and a name
>>> uuid.uuid3(uuid.NAMESPACE_DNS, 'python.org')
UUID('6fa459ea-ee8a-3ca4-894e-db77e160355e')

# make a random UUID
>>> uuid.uuid4()
UUID('16fd2706-8baf-433b-82eb-8c7fada847da')

# make a UUID using a SHA-1 hash of a namespace UUID and a name
>>> uuid.uuid5(uuid.NAMESPACE_DNS, 'python.org')
UUID('886313e1-3b8a-5372-9b90-0c9aee199e5d')

# make a UUID from a string of hex digits (braces and hyphens ignored)
>>> x = uuid.UUID('{00010203-0405-0607-0809-0a0b0c0d0e0f}')

# convert a UUID to a string of hex digits in standard form
>>> str(x)
'00010203-0405-0607-0809-0a0b0c0d0e0f'

# get the raw 16 bytes of the UUID
>>> x.bytes
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f'

# make a UUID from a 16-byte string
>>> uuid.UUID(bytes=x.bytes)
UUID('00010203-0405-0607-0809-0a0b0c0d0e0f')

XML

Creating a new xml document

from xml.dom import minidom

# New document
xml = minidom.Document()

# Creates user element
userElem = xml.createElement("user")

# Set attributes to user element
userElem.setAttribute("name", "Sergio Oliveira")
userElem.setAttribute("nickname", "seocam")
userElem.setAttribute("email", "[email protected]")
userElem.setAttribute("photo","seocam.png")

# Append user element in xml document
xml.appendChild(userElem)

写XML文件

# Print the xml codeprint xml.toxml("UTF-8")
fp = open("file.xml","w")
xml.writexml(fp, "    ", "", "\n", "UTF-8")

# Method's stub
# writexml(self, writer, indent='', addindent='', newl='', encoding=None)

lxml

lxml.etree.tostring()的encoding可以传入unicode函数,这样就会返回一个unicode串。

from lxml import etree, cssselect
select = cssselect.CSSSelector("a tag > child")
root = etree.XML("<a><b><c/><tag><child>TEXT</child></tag></b></a>")
[ el.tag for el in select(root) ]
from lxml import etree
broken_html = "<html><head><title>test<body><h1>page title</h3>"
html = etree.HTML(broken_html)
result = etree.tostring(html, pretty_print=True, method="html")
print(result)

Beautiful Soup

from BeautifulSoup import BeautifulSoup
import re

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()

zipfile

Listing the contents

To list the contents of an existing archive, you can use the namelist and infolist methods. The former returns a list of filenames, the latter a list of ZipInfo instances.

Using the zipfile module to list files in a ZIP file

# File: zipfile-example-1.py

import zipfile

file = zipfile.ZipFile("samples/sample.zip", "r")

# list filenames
for name in file.namelist():
    print name,
print

# list file information
for info in file.infolist():
    print info.filename, info.date_time, info.file_size
$ python zipfile-example-1.py
sample.txt sample.jpg
sample.txt (1999, 9, 11, 20, 11, 8) 302
sample.jpg (1999, 9, 18, 16, 9, 44) 4762

Reading data from a ZIP file

To read data from an archive, simply use the read method. It takes a filename as an argument, and returns the data as a string.

Using the zipfile module to read data from a ZIP file

# File: zipfile-example-2.py

import zipfile

file = zipfile.ZipFile("samples/sample.zip", "r")

for name in file.namelist():
    data = file.read(name)
    print name, len(data), repr(data[:10])
$ python zipfile-example-2.py
sample.txt 302 'We will pe'
sample.jpg 4762 '\377\330\377\340\000\020JFIF'

Writing data to a ZIP file

Adding files to an archive is easy. Just pass the file name, and the name you want that file to have in the archive, to the write method.

The following script creates a ZIP file containing all files in the samples directory.

Using the zipfile module to store files in a ZIP file
# File: zipfile-example-3.py

import zipfile
import glob, os

# open the zip file for writing, and write stuff to it

file = zipfile.ZipFile("test.zip", "w")

for name in glob.glob("samples/*"):
    file.write(name, os.path.basename(name), zipfile.ZIP_DEFLATED)

file.close()

# open the file again, to see what's in it

file = zipfile.ZipFile("test.zip", "r")
for info in file.infolist():
    print info.filename, info.date_time, info.file_size, info.compress_size
$ python zipfile-example-3.py
sample.wav (1999, 8, 15, 21, 26, 46) 13260 10985
sample.jpg (1999, 9, 18, 16, 9, 44) 4762 4626
sample.au (1999, 7, 18, 20, 57, 34) 1676 1103
...

The third, optional argument to the write method controls what compression method to use. Or rather, it controls whether data should be compressed at all. The default is zipfile.ZIP_STORED, which stores the data in the archive without any compression at all. If the zlib module is installed, you can also use zipfile.ZIP_DEFLATED, which gives you “deflate” compression.

The zipfile module also allows you to add strings to the archive. However, adding data from a string is a bit tricky; instead of just passing in the archive name and the data, you have to create a ZipInfo instance and configure it correctly. Here’s a simple example:

Using the zipfile module to store strings in a ZIP file
# File: zipfile-example-4.py

import zipfile
import glob, os, time

file = zipfile.ZipFile("test.zip", "w")

now = time.localtime(time.time())[:6]

for name in ("life", "of", "brian"):
    info = zipfile.ZipInfo(name)
    info.date_time = now
    info.external_attr = 0777 << 16L # give full access to included file
    info.compress_type = zipfile.ZIP_DEFLATED
    file.writestr(info, name*1000)

file.close()

# open the file again, to see what's in it

file = zipfile.ZipFile("test.zip", "r")

for info in file.infolist():
    print info.filename, info.date_time, info.file_size, info.compress_size
$ python zipfile-example-4.py
life (2000, 12, 1, 0, 12, 1) 4000 26
of (2000, 12, 1, 0, 12, 1) 2000 18
brian (2000, 12, 1, 0, 12, 1) 5000 31
创建目录
newEpub = ZipFile(tfile, 'w')
newEpub.writestr('mimetype', 'application/epub+zip', compression=ZIP_STORED)
newEpub.writestr('META-INF/', '', 0700)
newEpub.writestr('META-INF/container.xml', CONTAINER)

copy 复制对象

import copy

x = copy.copy(y)        # make a shallow copy of y
x = copy.deepcopy(y)    # make a deep copy of y

subprocess 启动进程,可以实现重定向的功能

http://www.oreillynet.com/onlamp/blog/2007/08/pymotw_subprocess_1.html

Basic usage

import subprocess

# Simple command
subprocess.call('ls -l', shell=True)

# Command with shell expansion
subprocess.call('ls -l $HOME', shell=True)

Reading Output of Another Command

By passing different arguments for stdin, stdout, and stderr it is possible to mimic the variations of os.popen().

Reading from the output of a pipe:

print '\nread:'
proc = subprocess.Popen('echo "to stdout"',
                        shell=True,
                        stdout=subprocess.PIPE,
                        )
stdout_value = proc.communicate()[0]
print '\tstdout:', repr(stdout_value)

更简单的做法

subprocess.check_output(["echo", "Hello World!"])
# 'Hello World!\n'

subprocess.check_output("exit 1", shell=True)
# Traceback (most recent call last):
#.Writing to the input of a pipe:
print '\nwrite:'
proc = subprocess.Popen('cat -',
                        shell=True,
                        stdin=subprocess.PIPE,
                        )
proc.communicate('\tstdin: to stdin\n')

Reading and writing, as with popen2:

print '\npopen2:'

proc = subprocess.Popen('cat -',
                        shell=True,
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
stdout_value = proc.communicate('through stdin to stdout')[0]
print '\tpass through:', repr(stdout_value)

可以通过.stdin/.stdou成员与子进程交互

communicate()只能用一次,不能重复用,而直接写入stdin已经是不推荐的做法。如果需要多次进行stdin输入,要注意处理异常和关闭stdin,避免死锁。

以下代码来自https://gist.github.com/waylan/2353749

from subprocess import Popen, PIPE
import errno

p = Popen('less', stdin=PIPE)
for x in xrange(100):
    line = 'Line number %d.\n' % x
    try:
        p.stdin.write(line)
    except IOError as e:
        if e.errno == errno.EPIPE or e.errno == errno.EINVAL:
            # Stop loop on "Invalid pipe" or "Invalid argument".
            # No sense in continuing with broken pipe.
            break
        else:
            # Raise any other error.
            raise

p.stdin.close()
p.wait()

print 'All done!' # This should always be printed below any output written to less.

mechanize Stateful programmatic web browsing in Python

os

  • os.makedirs(path, mode): 创建path及一系列的parent目录
  • os.removedirs(path): 删除文件及其上层目录(根据path路径,删除叶子节点后,如果上层目录为空目录则删除之。如果上层目录有多个子目录,则上层目录不会被删除)

os.path

  • os.path.relpath(path, [start_dir]): 返回path相对start_dir的路径。start_dir缺省为os.curdir
  • os.path.realpath(+__file__+): 取当前脚本文件的全路径
  • os.path.dirname(): 返回路径部分
  • os.path.basename(): 返回文件名部分
  • os.path.expanduser(): 展开{tilde}和{tilde}user
  • os.path.expandvars(): 展开环境变量

chmpy

列出所有文件

NOTE: chm中的文件名要转换为utf-8

#!/usr/bin/python
import sys
import locale
import codecs
try:
    from pychm import chmlib
except:
    from chm import chmlib

def getfilelist(chmpath):
    '''
    get filelist of the given path chm file
    return (bool,fileurllist)
    '''
    def callback(cf,ui,lst):
        '''
        innermethod
        '''
        lst.append(ui.path)
        return chmlib.CHM_ENUMERATOR_CONTINUE

    assert isinstance(chmpath,unicode)
    chmfile=chmlib.chm_open(chmpath.encode(sys.getfilesystemencoding()))
    lst=[]
    ok=chmlib.chm_enumerate(chmfile,chmlib.CHM_ENUMERATE_ALL,callback,lst)
    chmlib.chm_close(chmfile)
    return (ok,lst)

if __name__ == "__main__":
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
    sys.stderr = codecs.getwriter(locale.getpreferredencoding())(sys.stderr)

    (ok, lst) = getfilelist(unicode(sys.argv[1], locale.getpreferredencoding()))
    print lst

读出chm中的文件内容

#!/usr/bin/python
try:
    from pychm import chmlib
except:
    from chm import chmlib

def getfiles(chmpath):
    def callback(cf,ui,lst):
        lst.append(ui.path)
        return chmlib.CHM_ENUMERATOR_CONTINUE

    assert isinstance(chmpath,unicode)
    chmfile=chmlib.chm_open(chmpath.encode(sys.getfilesystemencoding()))
    lst=[]
    ok=chmlib.chm_enumerate(chmfile,chmlib.CHM_ENUMERATE_ALL,callback,lst)

    if ok:
        files = {}
        for l in lst:
            result, ui = chmlib.chm_resolve_object(chmfile, l)
            if (result != chmlib.CHM_RESOLVE_SUCCESS):
                print u"Failed to resolve {0}".format(l)
                continue

            size, text = chmlib.chm_retrieve_object(chmfile, ui, 0L, ui.length)
            if (size == 0):
                print u"Failed to retrieve {0}".format(l)
                continue

            files[l] = text

    chmlib.chm_close(chmfile)
    return files

bisect 二分搜索

可以找出一个已排序的列表中,一个新元素应该插入的位置

bisect.bisect_left

bisect.bisect_right

email

email.Header.decode_header(mail['subject'])
import sys,email,re
for line in sys.stdin:
    m = re.match('^Subject:\s*(.+)$', line)
    if m:
        for str,encoding in email.Header.decode_header(m.group(1)):
            print str.decode(encoding) if encoding else str

chardet

自动检测字符编码

简单用法

>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

增量检测

import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print detector.result

用同一个detector对象检测多个文件

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('*.xml'):
    print filename.ljust(60),
    detector.reset()
    for line in file(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print detector.result

网络

twisted

twisted是个非常非常优秀的高性能网络编程框架,通过它可以以“闪电”般地 速度开发出高性能、高质量的网络服务器。

参考资料

tornado

http://www.tornadoweb.org/

Web Framework

import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("Hello, world")

def make_app():
    return tornado.web.Application([
        (r"/", MainHandler),
    ])

if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    tornado.ioloop.IOLoop.current().start()

requests

import requests
import logging

# Enabling debugging at http.client level (requests->urllib3->http.client)
# you will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA.
# the only thing missing will be the response.body which is not logged.
try: # for Python 3
    from http.client import HTTPConnection
except ImportError:
    from httplib import HTTPConnection
HTTPConnection.debuglevel = 1

logging.basicConfig() # you need to initialize logging, otherwise you will not see anything from requests
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

requests.get('http://httpbin.org/headers')

comet

https://github.com/sockjs/sockjs-client

urllib/urllib2

输出一个网页前10个链接的脚本
# Scan the web looking for references

import re
import urllib

regex = re.compile(r'href="([^"]+)"')

def matcher(url, max=10):
    "Print the first several URL references in a given url."
    data = urllib.urlopen(url).read()
    hits = regex.findall(data)
    for hit in hits[:max]:
        print urllib.basejoin(url, hit)

matcher("http://python.org")
cookie 处理
  • 基本用法

    import cookielib, urllib2
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    r = opener.open("http://example.com/")
  • 导入Netscape、Mozilla、Lynx cookies.txt文件

    This example illustrates how to open a URL using your Netscape, Mozilla, or Lynx cookies (assumes Unix/Netscape convention for location of the cookies file):

    import os, cookielib, urllib2
    cj = cookielib.MozillaCookieJar()
    cj.load(os.path.join(os.environ["HOME"], ".netscape/cookies.txt"))
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    r = opener.open("http://example.com/")
  • 使用DefaultCookiePolicy

    The next example illustrates the use of DefaultCookiePolicy. Turn on RFC 2965 cookies, be more strict about domains when setting and returning Netscape cookies, and block some domains from setting cookies or having them returned:

    import urllib2
    from cookielib import CookieJar, DefaultCookiePolicy
    policy = DefaultCookiePolicy(
        rfc2965=True, strict_ns_domain=DefaultCookiePolicy.DomainStrict,
        blocked_domains=["ads.net", ".ads.net"])
    cj = CookieJar(policy)
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    r = opener.open("http://example.com/")

pcap解析工具

pure_pcapy 纯python的解释pcap文件的库

import pure_pcapy as pcapy
from impacket.ImpactPacket import TCP
from impacket.ImpactDecoder import EthDecoder
import struct

cap = pcapy.open_offline(u'test.pcap')

(header, payload) = cap.next()

orders = list()
reports = list()

while header:
    eth = EthDecoder().decode(payload)
    ip = eth.child()

    if ip.get_ip_p() != TCP.protocol:
        continue

    sec, us = header.getts()
    packet_time = pd.to_datetime(sec * 1000000000 + us * 1000) + pd.Timedelta(hours=8)

    tcp = ip.child()

    data = tcp.get_data_as_string()
    packet_no, body_len = struct.unpack(MSG_HEADER_FMT, data[:MSG_HEADER_SIZE])

    ...

    (header, payload) = cap.next()

scapy

Scapy

PyShark

利用tshark进行解包的工具。

pypcap 解释pcap抓包结果(tcpdump/WireShark)

命令行处理

optparse getopt的替代者

提供了getopt没有的功能,如类型转换、自动生成帮助信息等

import optparse

parser = optparse.OptionParser()
parser.add_option('-a', action="store_true", default=False)
parser.add_option('-b', action="store", dest="b")
parser.add_option('-c', action="store", dest="c", type="int")

print parser.parse_args(['-a', '-bval', '-c', '3'])

对long参数的处理和short参数没有区别:

import optparse

parser = optparse.OptionParser()
parser.add_option('--noarg', action="store_true", default=False)
parser.add_option('--witharg', action="store", dest="witharg")
parser.add_option('--witharg2', action="store", dest="witharg2", type="int")

print parser.parse_args([ '--noarg', '--witharg', 'val', '--witharg2=3' ])

返回-v参数的出现次数:

import optparse

parser = optparse.OptionParser()
parser.add_option('-v', '--verbose', action="count", dest="verbose", default=0)

options, remainder = parser.parse_args(['-v', '--verbose', '-v'])
print options.verbose

yaspin,提示进度

https://pypi.org/project/yaspin/

import time
from yaspin import yaspin

# Context manager:
with yaspin():
    time.sleep(3)  # time consuming code

# Function decorator:
@yaspin(text="Loading...")
def some_operations():
    time.sleep(3)  # time consuming code

sqlite

绑定一个列表(where col in (vals))

使用','.join('?'*len(vals))来构造出类似?,?,?的占位符序列

Ghost.py -- Headless browser

支持javascript的浏览器模块。

http://jeanphix.me/Ghost.py/

安装

需要安装 PyQtPySide(pyside-py27)。

pip install Ghost.py

样例

from ghost import Ghost
ghost = Ghost()

self.ghost.click('#prompt-button')
# Opens the web page
ghost.open('http://www.openstreetmap.org/')
# Waits for form search field
ghost.wait_for_selector('input[name=query]')
# Fills the form
ghost.fill("#search_form", {'query': 'France'})
# Submits the form
ghost.call("#search_form", "submit")
# Waits for results (an XHR has been called here)
ghost.wait_for_selector(
    '#search_osm_nominatim .search_results_entry a')
# Clicks first result link
ghost.click(
    '#search_osm_nominatim .search_results_entry:first-child a')
# Checks if map has moved to expected latitude
lat, resources = ghost.evaluate("map.center.lat")
assert float(lat.toString()) == 5860090.806537

图形相关

Python图形相关

numpy、pandas

普通操作

统计
>>> import numpy
>>> numpy.average([1,2,3])
2.0
>>> numpy.sum([1,2,3])
6
>>> numpy.std([1,2,3])
concat
# 保留原索引
pandas.concat([df1, df2])
# 重新编索引
pandas.concat([df1, df2], ignore_index=True)
把一列加上一个字符串
df['bar'] = df.bar.map(str) + " is " + df.foo
列改名
>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name") # scalar, changes Series.name
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels
0    1
3    2
5    3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(index=str, columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6

rename()的inplace参数决定是否生成新的对象。

为单列改名

data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
修改列值
w['female'] = w['female'].map({'female': 1, 'male': 0})
df['Date'] = df['Date'].str[-4:].astype(int)
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# Member补充前导0到6位数字
df['Member'] = df['Member'].apply("{0:06}".format)

把索引转换为列

df.reset_index()

可以通过level参数指定要转换的索引级别。缺省全部转换。

列转换为索引

indexed_df = df.set_index(['A', 'B'])
indexed_df2 = df.set_index(['A', [0, 1, 2, 0, 1, 2]])
indexed_df3 = df.set_index([[0, 1, 2, 0, 1, 2]])

设置显示格式

全局有效

pd.options.display.float_format = '${:,.2f}'.format

索引查找

.loc 基于主label进行定位
df.loc[5]   # 5只作为label,不作为position
df.loc['a']
df.loc['a', 'b', 'c']
df.loc['a':'c']
df.loc[boolean数组]
.iloc 基于主label的position进行定位
df.iloc[5]   # 5是从0开始的position
df.iloc[1, 3, 5]
df.iloc[2:4]
df.iloc[boolean数组]
.ix 支持混合下标

先查找label,找不到再找position。

IMPORTANT: 如果label是整形,就不会再找position,此时建议使用.loc.iloc

索引非缺省axis
df1 = df.ix[:,0:2]

MultiIndex多级索引

http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-hierarchical

构造
pd.MultiIndex.from_arrays
pd.MultiIndex.from_tuples
pd.MultiIndex.from_product
样例
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...:

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]:
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [5]: index
Out[5]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]:
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64
In [16]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)

In [17]: df
Out[17]:
first        bar                 baz                 foo                 qux  \
second       one       two       one       two       one       two       one
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466
通过pd.set_option()调整MultiIndex显示方式
In [20]: pd.set_option('display.multi_sparse', False)

In [21]: df
Out[21]:
first        bar       bar       baz       baz       foo       foo       qux  \
second       one       two       one       two       one       two       one
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466

first        qux
second       two
A      -0.226169
B      -1.436737
C      -2.006747
get_level_values 取指定层的索引值
In [23]: index.get_level_values(0)
Out[23]: Index([u'bar', u'bar', u'baz', u'baz', u'foo', u'foo', u'qux', u'qux'], dtype='object', name=u'first')

In [24]: index.get_level_values('second')
Out[24]: Index([u'one', u'two', u'one', u'two', u'one', u'two', u'one', u'two'], dtype='object', name=u'second')
清除names

把对应项目改为None即可。

pv_state.index.names = [None, None]
set_levels 修改一层索引的值
In [8]: df.index.set_levels(df.index.levels[1].\
            tz_localize('Etc/GMT-1', ambiguous = 'NaT'),level=1, inplace=True)
去掉一层索引
df.index = df.index.droplevel(2)
取指定索引的值

可以指定一部分索引值

In [25]: df['bar']
Out[25]:
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df['bar', 'one']
Out[26]:
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df['bar']['one']
Out[27]:
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64
调整维度
df.reorder_level([1, 0], axis=1)
df.sort_index()
使用slice指定多维的下标
  • 应提供所有axes

    可以用:说明指定axes选择全部。

    df.loc[(slice('A1','A3'),.....),:]
  • slice(None)代表选择所有。

    dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]

类SQL操作

merge - JOIN
A.merge(B, left_on='lkey', right_on='rkey', how='outer')

统计分析

统计每个分组的行数
df.groupby(u'地区').size()
统计一列(序号)在分组中的排名百分比

KnockIndex是从1开始连续的编号。统计在相同OpenTime的记录中的排名百分比。

df.groupby('OpenTime')['KnockIndex'].apply(lambda x: np.round(x * 100 / x.size))
取得指定列的不重复的值
# Create an example dataframe
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'year': [2012, 2012, 2013, 2014, 2014],
        'reports': [4, 24, 31, 2, 3]}

#List unique values in the df['name'] column
pd.unique(df.name.ravel())
测试一个值是否在集合中
pd.isin()
df['col1'].isin(set)
np.where()根据值的不同,返回不同值
all['地区'] = np.where(all['Location']=='1',u'托管机房',all[u'地区'])
反向cumsum
df[df.columns[::-1]].cumsum(axis=1)[df.columns]
分类汇总

http://pandas.pydata.org/pandas-docs/stable/groupby.html

In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})
   ...:
In [2]: df
Out[2]:
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860
In [56]: grouped = df.groupby('A')

In [57]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[57]:
          sum      mean       std
A
bar  0.443469  0.147823  0.301765
foo  2.529056  0.505811  0.966450
In [61]: grouped.agg({'C' : 'sum', 'D' : 'std'})
Out[61]:
            C         D
A
bar  0.443469  1.490982
foo  2.529056  0.645875

时间相关

转换为datetime
pd.to_datetime(orders.TransacTime, format='%Y%m%d%H%M%S%f')
pd.to_datetime(1489541395312290, unit='us')
pd.to_datetime(1489541395312290*1000)
pd.to_datetime(1489541395312290*1000).tz_localize(tzlocal.get_localzone())
时间增减

使用 Timedelta(),如Timedelta(hours=1, milliseconds=100)

pd.to_datetime(orders.TransacTime, format='%Y%m%d%H%M%S%f')
TimeDelta转换为秒数等
td.astype('timedelta64[s]')
td.astype('timedelta64[ms]')
td.astype('timedelta64[us]')
datetime列转换time
all['OpenTime'] = all['OpenTime'].map(pd.datetime.time)
time转换为datetime
all['datetime'] = all['time'].apply(
    lambda t: pd.datetime.combine(datetime.date.today(), t)
    )
取时间部分
all['OpenTime'].dt.time
截断到一定的精度
all['OpenTime'].dt.round('3s')

NOTE: 还可以用floor()/ceil()

时区
import pandas as pd

TIME_ZONE = 'Asia/Shanghai'
tracked['start'] = pd.to_datetime(tracked['start']).dt.tz_localize('UTC').dt.tz_convert(TIME_ZONE)
import pandas as pd
import tzlocal

# 把时间转换为UTC,但去掉时区信息 tz_convert(None)
line_time = pd.to_datetime(m.group('datetime')).tz_localize(
    tzlocal.get_localzone()).tz_convert(None)
判断是否还时区信息
if dt.tzinfo:
    # timezone aware
else:
取当前时间
now = pd.Timestamp.utcnow()
取当前时区
import tzlocal

tzlocal.get_localzone()

字符串处理

提取匹配串
In [42]: s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [43]: s
Out[43]:
A    a1a2
B      b1
C      c1
dtype: object

In [44]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

In [45]: s.str.extract(two_groups, expand=True)
Out[45]:
  letter digit
A      a     1
B      b     1
C      c     1

In [46]: s.str.extractall(two_groups)
Out[46]:
        letter digit
  match
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1
测试字符串是否匹配

str.contains对应re.search(),不需要全字符串匹配。

In [54]: pattern = r'[a-z][0-9]'

In [55]: pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
Out[55]:
0    False
1    False
2    False
3    False
4    False
dtype: bool

str.match对应re.match(),需要全字符串匹配。

In [56]: pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)
Out[56]:
0    False
1    False
2    False
3    False
4    False
dtype: bool

pivot_table 数据透视表

指定分类顺序
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

指定头几个分类的顺序,其余分类排在后面

all['KnockState'] = all['KnockState'].astype('category')
all['KnockState'].cat.set_categories(
    pd.unique(
        ['rejected','succeed','retry'] + all['KnockState'].ravel().tolist()),
    inplace=True, ordered=True)
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean],fill_value=0,margins=True)
计算百分点
df.pivot_table(columns='a', aggfunc=lambda x: np.percentile(x, 50))
调整表头
import pandas as pd
import numpy as np

np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)

调整前

B       1     2     3   All
A
1     454   649   770  1873
2     628   576   467  1671
3     376   247   481  1104
All  1458  1472  1718  4648

调整后

multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
            A             All_B
            1     2     3
B     1   454   649   770  1873
      2   628   576   467  1671
      3   376   247   481  1104
All_A    1458  1472  1718  4648

作图

hist的每个点的数量
  • 对于离散点,可以用value_count()

    In [11]: s = pd.Series([1, 1, 2, 1, 2, 2, 3])
    
    In [12]: s.value_counts()
    Out[12]:
    2    3
    1    3
    3    1
    dtype: int64
    
  • 对于浮点数,可以取整再groupby

    s.groupby(lambda i: np.floor(2*s[i]) / 2).count()
    
  • 直接使用numpy.histogram,可以返回counts和bins

    In [4]: s = Series(randn(100))
    
    In [5]: counts, bins = np.histogram(s)
    
    In [6]: Series(counts, index=bins[:-1])
    Out[6]:
    -2.968575     1
    -2.355032     4
    -1.741488     5
    -1.127944    26
    -0.514401    23
     0.099143    23
     0.712686    12
     1.326230     5
     1.939773     0
     2.553317     1
    dtype: int32
    
  • 取bins的中点

    bins[:-1] + np.diff(bins)/2
柱状图例子
# 所有敲门委托距离平台开放时间的分布图
fig = plt.figure(figsize=FIG_SIZE, dpi=FIG_DPI)
time_diff_bins = np.unique(all['TimeDiff'].ravel())

succeed_knocks = all[(all['KnockState']=='succeed') & (all['GatewayIndex']==1)]
failed_knocks = all[(all['KnockState']=='rejected') & (all['GatewayIndex']==1)]

ax = succeed_knocks['TimeDiff'].hist(bins=time_diff_bins)
failed_knocks['TimeDiff'].hist(bins=time_diff_bins, ax=ax)

bin_width = time_diff_bins[1] - time_diff_bins[0]

# 对最大的几个值进行标注
for x, y in pd.concat([failed_knocks, succeed_knocks])['TimeDiff'].value_counts().sort_values(ascending=False).head().iteritems():
    ax.annotate(
        str(y), xy=(x + bin_width / 2, y), xycoords='data', va='bottom', ha='center')

fig.savefig('knock_time_diff.png')
分组画散点图
for area, df in knocks.groupby(u'地区'):
    ax.plot(df['GatewayNum'], df['KnockPercent'], marker='.', linestyle='', label=area)

根据某时间列过滤时间范围

DataFrame.between_time() 要求要过滤的列必须作为index,在不希望修改DataFrame的Index时,可以单独把列取出来进行过滤:

import pandas as pd
import numpy as np

N = 100
df = pd.DataFrame(
    {'date': pd.date_range('2000-1-1', periods=N, freq='H'),
     'value': np.random.random(N)})

index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]

四元数(Quaternion),用于计算旋转

https://github.com/moble/quaternion

import numpy as np
import quaternion

构造

  • 直接构造
    q1 = np.quaternion(1,2,3,4)
    q2 = np.quaternion(5,6,7,8)

算术运算

  • 运算符

    • 加: q1 + q2
    • 减: q1 - q2
    • 乘: q1 * q2
    • 除: q1 / q2
    • 标量乘: q1 * s == s * q1
    • 标量除: q1 / s and s / q1
    • 倒数: np.reciprocal(q1) == 1/q1
    • 指数: np.exp(q1)
    • 对数: np.log(q1)
    • 平方根: np.sqrt(q1)
    • 共轭: np.conjugate(q1) == np.conj(q1)
  • numpy ufuncs

    add, subtract, multiply, divide, log, exp, power, negative, conjugate, copysign, equal, not_equal, less, less_equal, isnan, isinf, isfinite, absolute

    The unary tests isnan and isinf return true if they would return true for any individual component; isfinite returns true if it would return true for all components.

成员函数或属性

  • 存取各组成部分的属性

    • w, x, y, z
    • i, j, k (equivalent to x, y, z)
    • scalar, vector (equivalent to w, [x, y, z])
    • real, imag (equivalent to scalar, vector)
  • 与范数相关的方法

    • abs (square-root of sum of squares of components)
    • norm (sum of squares of components)
    • modulus, magnitude (equal to abs)
    • absolute_square, abs2, mag2 (equal to norm)
    • normalized
    • inverse
  • 与数组结构相关的方法

    • ndarray (the numpy array underlying the quaternionic array)
    • flattened (all dimensions but last are flattened into one)
    • iterator (iterate over all quaternions)

转换

  • as_float_array(a)

  • as_quat_array(a)

  • from_float_array(a)

  • from_vector_part(v, vector_axis=-1)

  • as_vector_part(q)

  • as_spinor_array(a)

  • as_rotation_matrix(q)

  • from_rotation_matrix(rot, nonorthogonal=True)

  • as_rotation_vector(q)

    向量的三个下标表示旋转轴,向量的范数表示旋转的角度(弧度)。

  • from_rotation_vector(rot)

  • as_euler_angles(q)

  • from_euler_angles(alpha_beta_gamma, beta=None, gamma=None)

  • as_spherical_coords(q)

  • from_spherical_coords(theta_phi, phi=None)

  • rotate_vectors(R, v, axis=-1)

    可以一次性旋转多个向量。

  • isclose(a, b, rtol=4*np.finfo(float).eps, atol=0.0, equal_nan=False)

    返回一个数组,指示每个下标是否接近。

  • allclose(a, b, rtol=4*np.finfo(float).eps, atol=0.0, equal_nan=False, verbose=False)

    返回一个boolean,表示是否所有对应下标都接近。

网页、模板

mako模板库

http://www.makotemplates.org

使用方法
简单使用
  • 简单替换

    from mako.template import Template
    
    mytemplate = Template("hello, ${name}!")
    print mytemplate.render(name="jack")
  • 使用Context对象保存上下文

    from mako.template import Template
    from mako.runtime import Context
    from StringIO import StringIO
    
    mytemplate = Template("hello, ${name}!")
    buf = StringIO()
    ctx = Context(buf, name="jack")
    mytemplate.render_context(ctx)
    print buf.getvalue()
  • 从文件中读取模板

    from mako.template import Template
    
    mytemplate = Template(filename='/docs/mytmpl.txt')
    print mytemplate.render()

还可以把模板文件生成的.py文件缓存起来,下次再用相同参数创建模板时, 将直接转入缓存的文件。

from mako.template import Template

mytemplate = Template(filename='/docs/mytmpl.txt', module_directory='/tmp/mako_modules')
print mytemplate.render()
使用TemplateLookup处理模板间的引用
  • 使用TemplateLookup指定模板文件的路径

    from mako.template import Template
    from mako.lookup import TemplateLookup
    
    mylookup = TemplateLookup(directories=['/docs'])
    mytemplate = Template("""<%include file="header.txt"/> hello world!""", lookup=mylookup)
  • 使用TemplateLookup.get_template()直接取得模板

    from mako.template import Template
    from mako.lookup import TemplateLookup
    
    mylookup = TemplateLookup(directories=['/docs'], module_directory='/tmp/mako_modules')
    
    def serve_template(templatename, **kwargs):
        mytemplate = mylookup.get_template(templatename)
        print mytemplate.render(**kwargs)
处理Unicode

TemplateLookup的input_encoding参数可以指定模板文件编码, output_encoding参数可以指定输出编码,encoding_errors 指定了出错时的处理方式。

在Python 3下,如果指定了output_encoding,render()将返回bytes, 否则将返回string。

render_unicode()在Python 2下返回unicode,Python 3下返回string。

from mako.template import Template
from mako.lookup import TemplateLookup

mylookup = TemplateLookup(directories=['/docs'], output_encoding='utf-8', encoding_errors='replace')

mytemplate = mylookup.get_template("foo.txt")
print mytemplate.render()
异常处理

mako提供了text_error_template()/html_error_template()函数, 可以从sys.exc_info()中获取最近抛出的异常,转换为文本或html 格式。

from mako import exceptions

try:
    template = lookup.get_template(uri)
    print template.render()
except:
    print exceptions.text_error_template().render()
模板格式
表达式替换
  • 替换 ${expr}

    expr可以是简单变量名,也可以是Python表达式。

    this is x: ${x}
    pythagorean theorem:  ${pow(x,2) + pow(y,2)}
  • 转义 ${expr | escape}

    • 使用|分隔表达式和转义标志。

      ${"this is some text" | u}

      输出:this+is+some+text

    • 内置转义标志

      • u: URL转义,通过urllib.quote_plus(string.encode('utf-8'))实现
      • h: HTML转义,通过markupsafe.escape(string)实现,旧版本通过cgi.escape(string,True)实现
      • x: XML转义
      • trim: 去掉前后空白字符,通过string.strip()实现
      • entity: 把一些字符转义为HTML entity
      • unicode (str on Python 3): 产生unicode字符串(缺省使用)
      • decode.<some encoding>: 把输入按指定编码解码为unicode
      • n: 禁用所有缺省转义标志
    • 可以用逗号分隔多个转义标志

      ${" <tag>some value</tag> " | h,trim}

      将得到

      &lt;tag&gt;some value&lt;/tag&gt;
      
控制结构
  • 语法

    使用%<关键字>的格式,并使用%end<关键字>结束一个结构:

    % <name>
    ...
    % end<name>

    可以使用python标准的控制结构if/else/while/for/try/except。

    %前面可以有空白字符,不能有其它字符。不考虑缩进。

    % for a in ['one', 'two', 'three', 'four', 'five']:
        % if a[0] == 't':
        its two or three
        % elif a[0] == 'f':
        four/five
        % else:
        one
        % endif
    % endfor

    NOTE: %本身可以被转义%%

  • loop上下文

    % for结构中,可以通过loop取得循环的更多信息:

    <ul>
    % for a in ("one", "two", "three"):
        <li>Item ${loop.index}: ${a}</li>
    % endfor
    </ul>
    • loop.index: 从0开始连续递增的下标
    • loop.even: 偶数下标
    • loop.odd: 奇数下标
    • loop.first: 是否第一项
    • loop.reverse_index: 还剩余多少项(需要提供__len__)
    • loop.last: 是否最后一项(需要提供__len__)
    • loop.cycle: 轮流返回参数中的项目。loop.cycle('even', 'odd') 依次返回even-odd-even-odd...
    • loop.parent: 访问上层循环的loop对象
注释
  • 单行注释

    ## this is a comment.
    ...text ...
  • 多行注释

    <%doc>
        these are comments
        more comments
    </%doc>
换行符

在行末加入\字符,将把本行与下一行连接起来。

引入Python代码

使用<% %>可以引入Python代码。代码块可以随意缩进,但一个代码块内, 应按Python规则进行缩进。

<% %>引入的代码块将在render函数中执行。

this is a template
<%
    x = db.get_resource('foo')
    y = [z.element for z in x if x.frobnizzle==5]
%>
% for elem in y:
    element: ${elem}
% endfor

<%! %>引入的代码块将在模块级别执行,因此不能访问模板的上下文。

mako标签

mako有一些特殊的标签,格式为<%标签...>/</%标签>

  • <%page>

    定义页面的属性,如cache大小等。

  • <%include>

    包含另一个模板文件

    <%include file="header.html"/>
    
        hello world
    
    <%include file="footer.html"/>
  • <%def>

    定义模板中使用的函数。

    <%def name="myfunc(x)">
        this is myfunc, x is ${x}
    </%def>
    
    ${myfunc(7)}
  • <%block>

  • <%namespace>

  • <%inherit>

  • <%nsname:defname>

  • <%call>

  • <%doc>

  • <%text>

Jinja2模板库

在嵌套循环中,引用外层的索引
{% for i in a %}
    {% for j in i %}
        {{loop.parent.index}}
    {% endfor %}
{% endfor %}

PEP

PEP8 Python编程风格

  • http://www.python.org/dev/peps/pep-0008/

  • 使用4个空格进行缩进

  • 折行

    • 所有行均不应超过79个字符
    • Python对括号(小、中、大括号)内的内容会自动支持多行
    • 对于有两个操作数的运算符,应在运算符之后断行
    • 折行有两种对齐方式
      • 首行有参数,则后面的行应以首行的括号为准进行对齐
        # Aligned with opening delimiter
        foo=long_function_name(var_one, var_two,
                               var_three, var_four)
      • 首行不放参数,后面的行使用比首行更多的缩进。对于def、for、if之类的语句,可使用两个缩进,以便与后面内容的缩进区分开
        # More indentation included to distinguish this from the rest.
        def long_function_name(
                var_one, var_two, var_three,
                var_four):
            print(var_one)
  • 空行

    • 顶层函数和类之间用两个空行进行分隔
    • 类的方法之间用一个空行进行分隔
    • Python把form feed(Control-L)视为空格,而很多其它工具将这个字符视为分页符。可以利用这个特性把代码分组为多页。
  • 字符编码

    • python 2推荐使用ascii编码,而python 3以后版本推荐使用utf-8编码
    • 在实践中,特别有中文的情况下,推荐使用utf-8编码。可以在文件头几行或末几行放置一个编码声明。下面这种方式是对vim友好的:
      # vim: set fileencoding=utf-8:
  • Imports

    • 每个模块使用单独的import语句,不要用一个import导入多个模块
      import os
      import sys
    • 可以用一个import语句从一个模块中导入多个函数/类
      from subprocess import Popen, PIPE
    • import语句应放到文件最上面,仅次于模块注释和docstring,放在全局变量和常量声明之前
    • import语句应按模块类型分组,每组间以空行分隔
      1. Python标准库
      2. 第三方库
      3. 本地应用或库的模块
    • import时,应使用绝对模块路径,而非相对路径
  • 在下列情况下避免多余的空格

    • 在括号内,贴着括号字符
      Yes: spam(ham[1], {eggs: 2})
      No:  spam( ham[ 1 ], { eggs: 2 } )
      
    • 在逗号、分号、冒号前
      Yes: if x==4: print x, y; x, y=y, x
      No:  if x==4 : print x , y ; x , y=y , x
      
    • 在函数声明中的括号前
      Yes: spam(1)
      No:  spam (1)
      
    • 在数组、字典下标的括号前
      Yes: dict['key']=list[index]
      No:  dict ['key']=list [index]
      
    • 在赋值操作或其它操作符前后,多余一个空格
    • 在下列operators前后都应使用一个空格
      • 赋值运算 (=, +=, -=等)
      • 比较运算 (==, <, >, !=, <>, <=, >=, in, not in, is, is not)
      • 布尔运算 (and, or, not)
    • 算术运算符前后应有空格
    • 函数的关键字参数或参数缺省值的等号前后不应有空格
      def complex(real, imag=0.0):
          return magic(r=real, i=imag)
      
  • 注释

    • 块注释,一般用于注释紧跟其后的代码

      • 应与代码保持相同的缩进
      • 以一个#及一个空格开始
      • 注释中,多段话之间以包含一个#字符的行分隔(即空行)
    • 行内注释

      • 行内注释与前面的代码之间应以至少两个空格分隔
      • 行内注释以#及一个空格开始
  • 命名规范

    • 包及模块名称:用全小写的简短名字。必要时使用下划线
    • 类名:大写开头,CapWords
    • 异常名:异常也是类,使用类名规范,但应以Error为后缀
    • 全局变量:使用与函数名相同的规范
    • 函数名:小写字符,必要时使用下划线分隔多个单词
    • 函数参数
      • 实例方法的首个参数应为self,类方法的首个参数应为cls
      • 如果参数名称与保留字冲突,推荐在参数名后面加上下划线
    • 方法名和实例变量
      • 小写字符,必要时使用下划线
      • 私有方法和实例变量前可加下划线
      • 为免与子类命名冲突,可以使用双下划线开头。这种情况下,python会在名称前加上类名,如可以使用Foo._Foo__a来访问Foo.__a
    • 常量:大写,用下划线分隔。CONTANT_NAME
    • 对继承的考虑
      • public属性不使用前置下划线。与保留字冲突时,后面加一个下划线
      • 对于简单的数据属性,推荐直接暴露这个属性,而不是使用复杂的访问方法。在以后需要时,可以用properties来用函数处理对属性的访问
      • 对于需要继续,但不希望派生类访问的属性,可以考虑使用双下划线前缀

PEP333 Python Web Server Gateway Interface v1.0 (WSGI)

常用例程

判断类型

isinstance("abc", str)
print obj.__class__.__name__

声明文件的编码

#!/usr/bin/python
# vim: set fileencoding=utf-8

unicode问题

可用locale.getpreferredencoding()取得当前的encoding

encoding = locale.getpreferredencoding()

修正重定向时,sys.stdout使用ascii编码的问题

http://wiki.python.org/moin/PrintFails

可以为sys.stdout包一层正确编码的StreamWriter:

import sys
import codecs
import locale
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout);

取字符串的宽度

使用unicodedata.east_asian_width函数,该函数的参数是一个字符,返回一个字符串:

  • F: East Asian Full-width
  • H: East Asian Half-width
  • W: East Asian Wide
  • Na: East Asian Narrow (Na)
  • A: East Asian Ambiguous (A)
  • N: Not East Asian
def string_width(text):
    """
    text必须是unicode
    """
    import unicodedata
    s = 0
    for ch in text:
        if isinstance(ch, unicode):
            if unicodedata.east_asian_width(ch) in ('F', 'W', 'A'):
                s += 2
            else:
                s += 1
        else:
            s += 1
    return s

urldecode

def urldecode(url):
    rex = re.compile('%([0-9a-hA-H][0-9a-hA-H])',re.M)
    return rex.sub(lambda m: chr(int(m.group(1),16)), url.encode("utf-8")).decode("utf-8")

程序启动及参数处理

使用传统的getopt

import getopt
import locale

if __name__ == "__main__":
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
    sys.stderr = codecs.getwriter(locale.getpreferredencoding())(sys.stderr)

    try:
        opts, args = getopt.getopt(sys.argv[1:], u"t:a:vs", [u"title=", u"author=", u"verbose", u"silent"])
    except getopt.GetoptError:
        sys.stderr.write(usage())
        sys.exit(1)

    args = [unicode(arg, DEFAULT_ENCODING) for arg in args]
    opts = [(unicode(opt[0], DEFAULT_ENCODING), unicode(opt[1], DEFAULT_ENCODING)) for opt in opts]

    title   = ''
    author  = ''

    for opt, arg in opts:
        if opt in ('-t', '--title'):
            title = arg
        elif opt in ('-a', '--author'):
            author = arg
        elif opt in ('-v', '--verbose'):
            verbose = True
        elif opt in ('-s', '--silent'):
            silent = True

    if len(args) < 1:
        sys.stderr.write(usage())
        sys.exit(2)

使用更强大的argparse

import locale
import logging
import argparse

if __name__ == "__main__":
    sys.stdin  = codecs.getreader(locale.getpreferredencoding())(sys.stdin)
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
    sys.stderr = codecs.getwriter(locale.getpreferredencoding())(sys.stderr)

    parser = optparse.ArgumentParser(
        description="""\
    Convert chm books into epub format.""",
        version=VERSION)

    parser.add_argument('chm_file', action="store", help="CHM filename")
    parser.add_argument('-t', '--title',   action="store", type=string, dest="title", help="Book title. If omit, guest from filename")
    parser.add_argument('-a', '--author',  action="store", type=string, dest="author", help="Book author. If omit, guest from filename")
    parser.add_argument('-v', '--verbose', action="store_false", dest="verbose", help="Be moderatery verbose")
    parser.add_argument('-s', '--silent',  action="store_false", dest="silent", help="Only show warning and errors")

    args = parser.parse_args()

    for k in vars(args):
        if isinstance(getattr(args, k), str):
            setattr(args, k, unicode(getattr(args, k), locale.getpreferredencoding()).strip())

    if args.silent:
        logging.basicConfig(level=logging.WARNING)
    elif args.verbose:
        logging.basicConfig(level=logging.DEBUG)
    else:
        logging.basicConfig(level=logging.debug)

    bookinfo = read_index(args.chm_file)

    if args.title:
        bookinfo.title = args.title

    if args.author:
        bookinfo.author = args.author

input

str = raw_input("prompt")

启动新进程

启动一个程序,代替当前的进程,不会返回

os.execl(path, arg0, arg1, ...)
os.execle(path, arg0, arg1, ..., env)
os.execlp(file, arg0, arg1, ...)
os.execlpe(file, arg0, arg1, ..., env)
os.execv(path, args)
os.execve(path, args, env)
os.execvp(file, args)
os.execvpe(file, args, env)

启动一个子进程

os.spawnl(mode, path, ...)
os.spawnle(mode, path, ..., env)
os.spawnlp(mode, file, ...)
os.spawnlpe(mode, file, ..., env)
os.spawnv(mode, path, args)
os.spawnve(mode, path, args, env)
os.spawnvp(mode, file, args)
os.spawnvpe(mode, file, args, env)

mode可以是P_NOWAIT/P_WAIT,如果是前者,函数会返回新进程的pid,后者会返回子进程的返回值或-signal,表示收到的signal。

通用参数说明

l的,参数是逐个传进去,带v的,参数作为一个数组传进去。

e的,env是一个map,表示新进程的环境变量。如果不带e,会使用当前进程的环境变量。

p的,会在环境变量PATH中查找file的全路径。对于pe,会在新的环境变量下进行查找。

出错时会抛出OSError,有errno,strerror等属性,对于chdir()、unlink()等带文件参数的调用,还会有filename属性

常用的内建函数

  • dir([obj]): 显示对象的属性,如果没有提供参数,则显示僵局变量的名字
  • help([obj]): 显示对象的文档字符串
  • len(obj): 返回对象的长度
  • repr(obj): 显示对象的内容表示
  • type(obj): 返回对象类型(type对象)
  • vars(obj): 显示对象属性的dict

让stdout直接输出,不经过buffer

python调用增加-u参数

在调用python时加上-u参数即可。可把.py文件的首行改为#!/usr/bin/python -u

替换stdout对象

class flushfile(file):
  def __init__(self, f):
    self.f = f
  def write(self, x)
    self.f.write(x)
    self.f.flush()

import sys
sys.stdout = flushfile(sys.stdout)

打印表格

def toRSTtable(self,rows, header=True, vdelim="  ", padding=1, justify='right'):
    """ Outputs a list of lists as a Restructured Text Table

    - rows - list of lists
    - header - if True the first row is treated as a table header
    - vdelim - vertical delimiter betwee columns
    - padding - padding nr. of spaces are left around the longest element in the
      column
    - justify - may be left,center,right
    """
    border="=" # character for drawing the border
    justify = {'left':string.ljust,'center':string.center, 'right':string.rjust}[justify.lower()]

    # calculate column widhts (longest item in each col
    # plus "padding" nr of spaces on both sides)
    cols = zip(*rows)
    colWidths = [max([len(str(item))+2*padding for item in col]) for col in cols]

    # the horizontal border needed by rst
    borderline = vdelim.join([w*border for w in colWidths])

    # outputs table in rst format
    print borderline
    for row in rows:
        print vdelim.join([justify(str(item),width) for (item,width) in zip(row,colWidths)])
        if header: print borderline; header=False
    print borderline

打印类名

print self.__class__.__name__

把html转换为文本

http://effbot.org/zone/re-sub.htm#strip-html

import re

##
# Removes HTML markup from a text string.
#
# @param text The HTML source.
# @return The plain text.  If the HTML source contains non-ASCII
#     entities or character references, this is a Unicode string.

def strip_html(text):
    def fixup(m):
        text = m.group(0)
        if text[:1] == "<":
            return "" # ignore tags
        if text[:2] == "&#":
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        elif text[:1] == "&":
            import htmlentitydefs
            entity = htmlentitydefs.entitydefs.get(text[1:-1])
            if entity:
                if entity[:2] == "&#":
                    try:
                        return unichr(int(entity[2:-1]))
                    except ValueError:
                        pass
                else:
                    return unicode(entity, "iso-8859-1")
        return text # leave as is
    return re.sub("(?s)<[^>]*>|&#?\w+;", fixup, text)

unescape html entities

http://effbot.org/zone/re-sub.htm#unescape-html

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

time_t相关转换

import time
import datetime

def string_to_time(strtime):
    t_tuple = time.strptime(strtime,"%Y-%m-%d %H:%M:%S")
    return time.mktime(t_tuple)

def string_to_time2(strtime):
    dt = datetime.datetime.strptime(strtime,"%Y-%m-%d %H:%M:%S")
    t_tuple = dt.timetuple()
    return time.mktime(t_tuple)

def time_to_string(timestamp):
    t_tuple = time.localtime(timestamp)
    dt = datetime.datetime(*t_tuple[:6])
    return dt.strftime("%Y-%m-%d %H:%M:%S")

def get_yesterday():
    d = datetime.datetime(*time.localtime()[:6])
    t = datetime.timedelta(days=1)
    return  (d-t)

打印时间段

time.strftime("%H:%M:%S", time.gmtime(elapsed_time))

读取进程ID

import os
print os.getpid()

读取线程ID

import thread
print thread.get_ident()

取得当前行号

import inspect

def lineno():
    """Returns the current line number in our program."""
    return inspect.currentframe().f_back.f_lineno

生成可读的随机串

#生成一个Population
pop = [chr(i) for i in xrange(33, 126 + 1)]

#sample随机选取,然后再Join
print "".join(random.sample(pop, 64))

把两个序列合并(如果一个比较短,则重复)

import itertools
a = [1,2,3,4,5]
b = [1,2,3]
print zip(a, itertools.cycle(b))
(1, 2), (2, 2), (3, 3), (4, 1), (5, 2)

递归合并两个dict

import collections

def dict_merge(dct, *merge_dcts):
    """ Recursive dict merge. Inspired by :meth:``dict.update()``, instead of
    updating only top-level keys, dict_merge recurses down into dicts nested
    to an arbitrary depth, updating keys. The ``merge_dct`` is merged into
    ``dct``.
    :param dct: dict onto which the merge is executed
    :param merge_dcts: dicts merged into dct
    :return: dct
    """
    for merge_dict in merge_dcts:
        for k, v in merge_dct.iteritems():
            if (k in dct and isinstance(dct[k], dict)
                    and isinstance(merge_dct[k], collections.Mapping)):
                dict_merge(dct[k], merge_dct[k])
            else:
                dct[k] = merge_dct[k]

    return dct

为数字加上千位分隔符

def add_thousands_sep(num):
    """把数字转换为带千位分隔符的字符串

    Args:
        num: 要转换的数字。如果是小数,则只为整数部分加上千位分隔符

    Returns:
        字符串,转换结果
    """
    parts = str(num).split('.')
    s = parts[0][::-1] # 反转
    parts[0] = ",".join([s[i:i+3] for i in range(0,len(s),3)])   # 每三个数字加一个逗号
    parts[0] = parts[0][::-1] # 再反转为正常顺序

    return ".".join(parts)

去除序列中的重复元素,但保持元素的顺序

python 2.7及以后:

from collections import OrderedDict
items = [1, 2, 0, 1, 3, 2]
list(OrderedDict.fromkeys(items))

2.7之前

seen = set()
[x for x in seq if x not in seen and not seen.add(x)]

字符串转换为数字

i = int(s)
f = float(s)

字符串转换为bool

from distutils.util import strtobool
>>> strtobool('True')
1
>>> strtobool('False')
0

对一个table按几列进行排序

import operator
mytable = (
    ('Joe', 'Clark', '1989'),
    ('Charlie', 'Babbitt', '1988'),
    ('Frank', 'Abagnale', '2002'),
    ('Bill', 'Clark', '2009'),
    ('Alan', 'Clark', '1804'),
    )
# itemgetter取出指定的列
table = sorted(table, key=operator.itemgetter(2,1))

判断一个变量是不是序列(list、tuple等)

import collections
isinstance(val, collections.Sequence)

dpkt 创建及解释TCP/IP消息(可以解释pcap/tcpdump结果)

捕捉Ctrl-C事件

try:
    Work()

except KeyboardInterrupt:
    # here you put any code you want to run before the program
    # exits when you press CTRL+C
    print "\n", counter # print value of counter

except:
    # this catches ALL other exceptions including errors.
    # You won't get any error messages for debugging
    # so only use it once your code is working
    print "Other error or exception occurred!"

finally:
    CleanUp()

算百分比点

  • scipy.stats.scoreatpercentile

  • http://code.activestate.com/recipes/511478/

    import math
    import functools
    
    def percentile(N, percent, key=lambda x:x):
        """
        Find the percentile of a list of values.
    
        @parameter N - is a list of values. Note N MUST BE already sorted.
        @parameter percent - a float value from 0.0 to 1.0.
        @parameter key - optional key function to compute value from each element of N.
    
        @return - the percentile of the values
        """
        if len(N) == 0:
            return None
        k = (len(N)-1) * percent
        f = math.floor(k)
        c = math.ceil(k)
        if f == c:
            return key(N[int(k)])
        d0 = key(N[int(f)]) * (c-k)
        d1 = key(N[int(c)]) * (k-f)
        return d0+d1
    
    # median is 50th percentile.
    median = functools.partial(percentile, percent=0.5)
  • np.percentile()也可以

单元测试中引入被测类

  • tests/context.py

    import os
    import sys
    sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
    
    import sample
  • tests/test_*.py

    from .context import sample

在path中加入相对可执行程序的路径

import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'lib')))

import some_util

Makefile

init:
    pip install -r requirements.txt

test:
    py.test tests

.PHONY: init test

判断一个IP地址是否在指定网段

import socket,struct

def ip_in_network(ip,net):
   "Is an address in a network"
   ipaddr = struct.unpack('L',socket.inet_aton(ip))[0]
   netaddr,bits = net.split('/')
   netmask = struct.unpack('L',socket.inet_aton(netaddr))[0] & ((2L<<int(bits)-1) - 1)
   return ipaddr & netmask == netmask

创建指定值类型的dict

使用collections.defaultdict

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)
# d.items()
# [('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

把一个串转换为hexdump格式的十六进制表示

def group(a, *ns):
  for n in ns:
    a = [a[i:i+n] for i in range(0, len(a), n)]
  return a

def join(a, *cs):
  return [cs[0].join(join(t, *cs[1:])) for t in a] if cs else a

def hexdump(data):
  toHex = lambda c: '{:02X}'.format(c)
  toChr = lambda c: chr(c) if 32 <= c < 127 else '.'
  make = lambda f, *cs: join(group(list(map(f, data)), 8, 2), *cs)
  hs = make(toHex, '  ', ' ')
  cs = make(toChr, ' ', '')
  for i, (h, c) in enumerate(zip(hs, cs)):
    print ('{:010X}: {:48}  {:16}'.format(i * 16, h, c))


with (open ('tohex.py','br')) as file:
  data=file.read()
  hexdump(data)

输出样例:

0000000000: 33 C0 8E D0 BC 00 7C 8E  C0 8E D8 BE 00 7C BF 00  3.....|. .....|..
0000000010: 06 B9 00 02 FC F3 A4 50  68 1C 06 CB FB B9 04 00  .......P h.......
0000000020: BD BE 07 80 7E 00 00 7C  0B 0F 85 0E 01 83 C5 10  ....~..| ........
0000000030: E2 F1 CD 18 88 56 00 55  C6 46 11 05 C6 46 10 00  .....V.U .F...F..
0000000040: B4 41 BB AA 55 CD 13 5D  72 0F 81 FB 55 AA 75 09  .A..U..] r...U.u.
0000000050: F7 C1 01 00 74 03 FE 46  10 66 60 80 7E 10 00 74  ....t..F .f`.~..t
0000000060: 26 66 68 00 00 00 00 66  FF 76 08 68 00 00 68 00  &fh....f .v.h..h.
0000000070: 7C 68 01

把16进制串转换回binary

python -c "open('output_file','wb').write(open('input_file','r').read().strip().decode('hex'))"

FAQ

shutil.rmtree() 在windows下删除只读文件时会报access denied

在出错时,尝试修改文件权限再试

def onerror(func, path, exc_info):
    """
    Error handler for ``shutil.rmtree``.

    If the error is due to an access error (read only file)
    it attempts to add write permission and then retries.

    If the error is for another reason it re-raises the error.

    Usage : ``shutil.rmtree(path, onerror=onerror)``
    """
    import stat
    if not os.access(path, os.W_OK):
        # Is the error an access error ?
        os.chmod(path, stat.S_IWUSR)
        func(path)
    else:
        raise

数据分析工具

7种Python科学数据工具

http://www.galvanize.com/blog/seven-python-tools-all-data-scientists-should-know-how-to-use/

  • IPython
  • GraphLab Create
  • Pandas:强大、灵活的数据分析和探索工具
  • PuLP
  • Matplotlib:强大的数据可视化工具、作图库
  • Scikit-Learn:支持回归、分类、聚类等强大的机器学习库
  • Spark
  • Scipy:提供矩阵支持,以及矩阵相关的数值计算模块
  • StatsModels:统计建模和计量经济学,包括描述统计、统计模型估计和推断
  • Keras:尝试学习库,用于建立神经网络以及深度学习模型
  • Gensim:用来做文本主题模型的库,文本挖掘可能用到

ANACONDA

https://www.continuum.io/downloads

Streamlit

爬虫scrapy

http://scrapy.org/

Jupyter Notebook

pip install jupyter

技巧

magics

  • 列出可用magics
    % lsmagic
    
  • %: 开始单行magics
  • %%: 开始多行magics
  • !: 运行shell命令
    ! pip freeze | grep pandas

在Notebook中运行shell等其它语言

%% HTML
<img ....>

转换为HTML

ipython nbconvert --to html pipelinedashboard.ipynb
csv = ![command]

try:
   from io import StringIO
except ImportError:
   from StringIO import StringIO

df = pandas.read_csv(StringIO(csv.n))

ipywidgits增加交互

常见问题

FreeBSD下,import numpy出错

  • 出错信息

    ImportError: /lib/libgcc_s.so.1: version GCC_4.6.0 required by /usr/local/lib/gcc49/libgfortran.so.3 not found
    
  • 解决办法

    export LD_LIBRARY_PATH="/usr/local/lib/gcc49"

FreeBSD下,sqlite3有问题

ImportError: No module named _sqlite3

~/.pydistutils.cfg:

## Exclusively here to allow pysqlite to compile in a venv.
[build_ext]
include_dirs=/usr/local/include
library_dirs=/usr/local/lib

参考文章

pipenv

在和pyenv一起使用时,遇到TypeError: 'NoneType' object is not iterable,可以pyenv global <version>设定一个版本号,避免使用系统python版本予以解决。

⚠️ **GitHub.com Fallback** ⚠️