Hadoop API的讀取資料 - twilighthook/BigDataNote GitHub Wiki

1. 用HADOOP URL讀取檔案

要讀取Hadoop資料可以藉由java.net.URL開啟一個資料流。

InputStream in = null;
try {
	in = new URL("hdfs://host/path").openStream();
	//process
} catch (Exception e) {
	e.printStackTrace();
} finally {
	IOUtils.closeStream(in);
}

而要讓JAVA可以辨識hdfs的路徑，要呼叫URL的setURLStreamHandlerFactory靜態方法，並傳入FsUrlStreamHandlerFactory的實例作為參數。再藉由IOUtils讀取並輸出檔案。

public class URLCat {

	static {
		URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
	}

	public static void main(String[] args) {
		InputStream in = null;
		try {
			in = new URL(args[0]).openStream();
			IOUtils.copyBytes(in, System.out, 4096, false);
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			IOUtils.closeStream(in);
		}
	}

}

2. 用FileSystem API讀取資料

FileSystem是通用的API，在這邊實作會需要一個Configuration來取得客戶端或伺服端的設定，例如hadoop裡的core-site.xml設定。

public class FileSystemCat {

	public static void main(String[] args) throws IOException {
		String uri = args[0];
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(uri), conf);
		InputStream in = null;
		try {
			in = fs.open(new Path(uri));
			IOUtils.copyBytes(in, System.out, 4096, false);
		}catch (Exception e) {
			e.printStackTrace();
		}finally {
			IOUtils.closeStream(in);
		}
	}
	
}

3. 利用seek()來移動標頭位置

try {
	in = fs.open(new Path(uri));
	IOUtils.copyBytes(in, System.out, 4096, false);
	in.seek(0);
	IOUtils.copyBytes(in, System.out, 4096, false);
}

在code的部分將改寫成上述程式碼，可以將讀取標頭移至最前面，再次的讀取檔案。