Lab 010 - jazzwang/hive_labs GitHub Wiki

實作十

Hive (2) : JOIN SELECT 跨表查詢

資料集:魔球

  1. 我們將沿用 Hive (1) 的資料集。
  2. 接下來,我們再來把 Batting.csv (Batting statistics, 打擊率) 轉成另一個資料表。

建立 Hive 資料表

  1. 先瞭解 Batting.csv 這個檔案的欄位跟資料型態。說明文件中提到
2.2 Batting Table
playerID       Player ID code
yearID         Year
stint          player's stint (order of appearances within a season)
teamID         Team
lgID           League
G              Games
G_batting      Game as batter
AB             At Bats
R              Runs
H              Hits
2B             Doubles
3B             Triples
HR             Homeruns
RBI            Runs Batted In
SB             Stolen Bases
CS             Caught Stealing
BB             Base on Balls
SO             Strikeouts
IBB            Intentional walks
HBP            Hit by pitch
SH             Sacrifice hits
SF             Sacrifice flies
GIDP           Grounded into double plays
G_Old          Old version of games (deprecated)
  1. 根據第一行的資料,我們可以決定資料表每個欄位的資料型態

    欄位 範例 資料型態


    playerID aardsda01 STRING yearID 2004 INT stint 1 INT teamID SFN STRING lgID NL STRING G 11 INT G_batting 11 INT AB 0 INT R 0 INT H 0 INT 2B 0 INT 3B 0 INT HR 0 INT RBI 0 INT SB 0 INT CS 0 INT BB 0 INT SO 0 INT IBB 0 INT HBP 0 INT SH 0 INT SF 0 INT GIDP 0 INT G_old 11 INT

    Table: Batting 的資料型態

  2. 根據以上瞭解,我們使用 HiveQL 建立資料表的語法來建立 Batting 資料表

hive> create table baseball.Batting
( playerID STRING, yearID INT, stint INT, teamID STRING, lgID STRING,
  G INT, G_batting INT, AB INT, R INT, H INT, 2B INT, 3B INT, HR INT,
  RBI INT, SB INT, CS INT, BB INT, SO INT, IBB INT, HBP INT, SH INT,
  SF INT, GIDP INT, G_old INT ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

匯入 CSV 資料到 Hive 資料表

  1. HiveQL 自 CSV 檔案匯入資料到 Hive 資料表的語法是 `LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>。
hive> LOAD DATA LOCAL INPATH "Batting.csv" OVERWRITE INTO TABLE baseball.Batting;

檢查匯出結果

  1. SHOW TABLES; 來查詢目前的資料庫有哪幾個資料表。確認是否有正確產生 Batting 資料表。
hive> USE baseball; SHOW TABLES;
OK
Time taken: 0.031 seconds
OK
batting
master
Time taken: 0.22 seconds
  1. 其次,我們可以檢查一下剛剛建立的 baseball.batting 資料表,內容是否正常。
hive> SELECT * FROM baseball.batting;

JOIN 跨表查詢

  1. JOIN SELECT 的語法是 SELECT <欄位> FROM <資料表A> JOIN <資料表B> ON <條件>
SELECT A.PlayerID, B.teamID, B.AB, B.R, B.H, B.2B, B.3B, B.HR, B.RBI FROM Master A JOIN BATTING B ON A.playerID = B.playerID;
  1. <備註> 在虛擬機器中執行這個查詢可能會有點吃力,請耐心等候實驗結果。
MapReduce Total cumulative CPU time: 12 seconds 690 msec
Ended Job = job_201312211330_0031
No encryption was performed by peer.
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Accumulative CPU: 12.69 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
Total MapReduce CPU Time Spent: 12 seconds 690 msec
OK
Time taken: 812.483 seconds

參考資料

  1. "Build a data library with Hive", By Peter J. Jamack, Big Data Analytics Consultant, 21 May 2013, IBM developerWorks
⚠️ **GitHub.com Fallback** ⚠️