Analyze Data - TheLadders/pipeline GitHub Wiki

Command Line

Cassandra's CQLSH

  • Query Cassandra directly inside of Docker
root@docker$ cqlsh

cqlsh> USE pipeline; SELECT fromuserid, touserid, rating, batchtime FROM real_time_ratings LIMIT 10;

 fromuserid | touserid | batchtime|    rating
------------+----------+----------+-----------
          1 |      133 | 24671840 |         8
          1 |      720 | 24671840 |         6
          1 |      971 | 24671840 |        10
          1 |     1095 | 24673840 |         7
          1 |     1616 | 24673840 |        10
          1 |     1978 | 24673840 |         7
          1 |     2145 | 24673840 |         8
          1 |     2211 | 24673840 |         8
          1 |     3751 | 24673840 |         7
          1 |     4062 | 24673840 |         3

(10 rows)

Beeline's HiveQL CLI

  • Query the Hive ThriftServer directly inside of Docker
root@docker$ beeline -u jdbc:hive2://127.0.0.1:10000 -n hiveuser -p ''
0: jdbc:hive2://127.0.0.1:10000> SELECT id, gender FROM gender_json_file LIMIT 100;

Using Notebooks for Ad Hoc Data Analysis

Spark-Notebook

  • Get the IP of your Docker Container
macosx-laptop$ docker-machine ip pipelinebythebay
macosx-laptop$ open http://<ip-from-above>:39000
``