Lab 010 - jazzwang/hive_labs GitHub Wiki
- 我們將沿用 Hive (1) 的資料集。
- 接下來,我們再來把 Batting.csv (Batting statistics, 打擊率) 轉成另一個資料表。
- 先瞭解 Batting.csv 這個檔案的欄位跟資料型態。說明文件中提到
2.2 Batting Table playerID Player ID code yearID Year stint player's stint (order of appearances within a season) teamID Team lgID League G Games G_batting Game as batter AB At Bats R Runs H Hits 2B Doubles 3B Triples HR Homeruns RBI Runs Batted In SB Stolen Bases CS Caught Stealing BB Base on Balls SO Strikeouts IBB Intentional walks HBP Hit by pitch SH Sacrifice hits SF Sacrifice flies GIDP Grounded into double plays G_Old Old version of games (deprecated)
-
根據第一行的資料,我們可以決定資料表每個欄位的資料型態
欄位 範例 資料型態
playerID aardsda01 STRING yearID 2004 INT stint 1 INT teamID SFN STRING lgID NL STRING G 11 INT G_batting 11 INT AB 0 INT R 0 INT H 0 INT 2B 0 INT 3B 0 INT HR 0 INT RBI 0 INT SB 0 INT CS 0 INT BB 0 INT SO 0 INT IBB 0 INT HBP 0 INT SH 0 INT SF 0 INT GIDP 0 INT G_old 11 INT
Table: Batting 的資料型態
-
根據以上瞭解,我們使用 HiveQL 建立資料表的語法來建立 Batting 資料表
hive> create table baseball.Batting ( playerID STRING, yearID INT, stint INT, teamID STRING, lgID STRING, G INT, G_batting INT, AB INT, R INT, H INT, 2B INT, 3B INT, HR INT, RBI INT, SB INT, CS INT, BB INT, SO INT, IBB INT, HBP INT, SH INT, SF INT, GIDP INT, G_old INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
- HiveQL 自 CSV 檔案匯入資料到 Hive 資料表的語法是 `LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>。
hive> LOAD DATA LOCAL INPATH "Batting.csv" OVERWRITE INTO TABLE baseball.Batting;
- 用
SHOW TABLES;
來查詢目前的資料庫有哪幾個資料表。確認是否有正確產生 Batting 資料表。
hive> USE baseball; SHOW TABLES; OK Time taken: 0.031 seconds OK batting master Time taken: 0.22 seconds
- 其次,我們可以檢查一下剛剛建立的 baseball.batting 資料表,內容是否正常。
hive> SELECT * FROM baseball.batting;
- JOIN SELECT 的語法是
SELECT <欄位> FROM <資料表A> JOIN <資料表B> ON <條件>
。
SELECT A.PlayerID, B.teamID, B.AB, B.R, B.H, B.2B, B.3B, B.HR, B.RBI FROM Master A JOIN BATTING B ON A.playerID = B.playerID;
- <備註> 在虛擬機器中執行這個查詢可能會有點吃力,請耐心等候實驗結果。
MapReduce Total cumulative CPU time: 12 seconds 690 msec Ended Job = job_201312211330_0031 No encryption was performed by peer. MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Accumulative CPU: 12.69 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 12 seconds 690 msec OK Time taken: 812.483 seconds
- "Build a data library with Hive", By Peter J. Jamack, Big Data Analytics Consultant, 21 May 2013, IBM developerWorks