如果想瞭解這份資料的每個 CSV 檔有哪些欄位,請參閱 readme 2012.txt 這個說明檔案。像是有 Open Source Sports 這樣的網站,而他們每年都會公佈各年度的統計資料。說明文件中也有提到另有 MS Access 的格式,若有需要,也可以下載這個版本,就可以比較輕易地進行 MS Access 先轉 MS SQL Server 再轉到 Hive 資料表的轉換,就可以省卻自己定義 Schema 的步驟。
user@master ~/baseball $ hive
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-0.8.1-cdh4.0.1.jar!/hive-log4j.properties
Hive history file=/tmp/user/hive_job_log_user_201312231714_241463960.txt
hive>
MASTER - Player names, DOB, and biographical info
---
2.1 MASTER table
---
lahmanID Unique number assigned to each player
playerID A unique code asssigned to each player. The playerID links
the data in this file with records in the other files.
managerID An ID for individuals who served as managers
hofID An ID for individuals who are in teh baseball Hall of Fame
birthYear Year player was born
birthMonth Month player was born
birthDay Day player was born
birthCountry Country where player was born
birthState State where player was born
birthCity City where player was born
deathYear Year player died
deathMonth Month player died
deathDay Day player died
deathCountry Country where player died
deathState State where player died
deathCity City where player died
nameFirst Player's first name
nameLast Player's last name
nameNote Note about player's name (usually signifying that they changed
their name or played under two differnt names)
nameGiven Player's given name (typically first and middle)
nameNick Player's nickname
weight Player's weight in pounds
height Player's height in inches
bats Player's batting hand (left, right, or both)
throws Player's throwing hand (left or right)
debut Date that player made first major league appearance
finalGame Date that player made first major league appearance (blank if still active)
college College attended
lahman40ID ID used in Lahman Database version 4.0
lahman45ID ID used in Lahman database version 4.5
retroID ID used by retrosheet
holtzID ID used by Sean Holtz's Baseball Almanac
bbrefID ID used by Baseball Reference website
hive> SELECT lahmanID FROM Master WHERE birthyear > 1900;
hive> SELECT COUNT( * ) FROM Master;
MapReduce Total cumulative CPU time: 2 seconds 900 msec
Ended Job = job_201312211330_0022
No encryption was performed by peer.
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Accumulative CPU: 2.9 sec HDFS Read: 0 HDFS Write: 0 SUCESS
Total MapReduce CPU Time Spent: 2 seconds 900 msec
OK
18126
Time taken: 97.89 seconds