httpclientAccessWebSource - juedaiyuer/researchNote GitHub Wiki

#用HttpClient访问Web资源#

实际开发时常常使用Apache的HTTP客户端开源项目HttpClient,完全能够处理HTTP连接中的各种问题

项目中引入HttpClient.jar包，模拟IE来获取网页内容的一段代码

//创建一个客户端,类似于打开一个浏览器
HttpClient httpclient=new HttpClient();

//创建一个get方法,类似于在浏览器地址栏中输入一个地址
GetMethod getMethod=new GetMethod("http://www.blablabla.com");

//回车,获得响应状态码
int statusCode=httpclient.executeMethod(getMethod);

//查看命中情况,可以获得的东西还有很多,比如 head、cookies 等
System.out.println("response=" + getMethod.getResponseBodyAsString());

//释放
getMethod.releaseConnection();

HTTP协议本身对URL字符串长度有所限制，因此不能传递过多的参数给服务器，采用post方法进行HTTP请求，post方法可以使用NameValuePair设置参数，可以设置无限多的参数；

相比于post，get方法将参数写在URL里面的方式，因此URL有长度限制，因此传递参数的长度会有限制

//得到 post 方法
PostMethod PostMethod = new PostMethod("http://www.saybot.com/postme");

//使用数组来传递参数
NameValuePair[] postData = new NameValuePair[2];

//设置参数
postData[0] = new NameValuePair("武器", "枪");
postData[1] = new NameValuePair("什么枪", "神枪");
postMethod.addParameters(postData);

//回车,获得响应状态码
int statusCode=httpclient.executeMethod(getMethod);

//查看命中情况,可以获得的东西还有很多,比如 head、cookies 等
System.out.println("response=" + getMethod.getResponseBodyAsString());

//释放
getMethod.releaseConnection();

##HttpClient代理服务器##

有时执行爬虫程序的机器不能直接访问Web资源，需要通过HTTP代理服务器访问

//创建 HttpClient 相当于打开一个代理
HttpClient httpClient=new HttpClient();

//设置代理服务器的 IP 地址和端口
httpClient.getHostConfiguration().setProxy("192.168.0.1", 9527);

//告诉 httpClient,使用抢先认证,否则你会收到“你没有资格”的恶果
httpClient.getParams().setAuthenticationPreemptive(true);

//MyProxyCredentialsProvder 返回代理的 credential(username/password)
httpClient.getParams().setParameter(CredentialsProvider.PROVIDER,
new MyProxyCredentialsProvider());

//设置代理服务器的用户名和密码
httpClient.getState().setProxyCredentials(new AuthScope("192.168.0.1",
AuthScope.ANY_PORT, AuthScope.ANY_REALM),
new UsernamePasswordCredentials("username","password"));

##Java网页抓取示例##

public class RetrivePage {
    private static HttpClient httpClient = new HttpClient();
    // 设置代理服务器
    static {
        // 设置代理服务器的IP地址和端口
        httpClient.getHostConfiguration().setProxy("172.17.18.84", 8080);
    }

    public static boolean downloadPage(String path) throws HttpException,
            IOException {
        InputStream input = null;
        OutputStream output = null;
        // 得到post方法
        PostMethod postMethod = new PostMethod(path);
        // 设置post方法的参数
        /*
         * NameValuePair[] postData = new NameValuePair[2]; postData[0] = new
         * NameValuePair("name","lietu"); postData[1] = new
         * NameValuePair("password","*****");
         * postMethod.addParameters(postData);
         */
        // 执行，返回状态码
        int statusCode = httpClient.executeMethod(postMethod);
        // 针对状态码进行处理 (简单起见，只处理返回值为200的状态码)
        if (statusCode == HttpStatus.SC_OK) {
            input = postMethod.getResponseBodyAsStream();
            //得到文件名
            String filename = path.substring(path.lastIndexOf('/')+1);
            //获得文件输出流
            output = new FileOutputStream(filename);
            //输出到文件
            int tempByte = -1;
            while((tempByte=input.read())>0){
                output.write(tempByte);
            }
            //关闭输入输出流
            if(input!=null){
                input.close();
            }
            if(output!=null){
                output.close();
            }
            return true;
        }
        //若需要转向，则进行转向操作
        if ((statusCode == HttpStatus.SC_MOVED_TEMPORARILY) || (statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_SEE_OTHER) || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
            //读取新的URL地址
            Header header = postMethod.getResponseHeader("location");
            if(header!=null){
                String newUrl = header.getValue();
                if(newUrl==null||newUrl.equals("")){
                    newUrl="/";
                    //使用post转向
                    PostMethod redirect = new PostMethod(newUrl);
                    //发送请求，做进一步处理。。。。。
                }
            }
        }
        return false;
    }

    /**
     * 测试代码
     */
    public static void main(String[] args) {
        // 抓取lietu首页,输出
        try {
            RetrivePage.downloadPage("http://www.lietu.com");
        } catch (HttpException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

##处理HTTP状态码##

HttpClient访问Web资源，涉及到HTTP状态码

int statusCode = httpClient.executeMethod(postMethod);

状态码200,响应成功状态码404,请求资源不存在

返回值状态码3XX，通常进行转向，转向的代码片段

##source##

HttpClient Tutorial