ICP Assignment 7 - MadhuriSarode/BDP GitHub Wiki

Student ID : 24 : Madhuri Sarode

Student ID : 4 : Bhargavi

Student ID : 16 : Bhavana

Independent Column based No SQL Tool - Cassandra

Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Data can be distributed over servers in Cassandra and it can operate strongly on multiple data centres which are distributed over the World, along with replication which gives the Cassandra the ability to operate on low latency servers.

The hierarchy of elements in Cassandra is:

Cluster

      Data center(s)

           Rack(s)

              Server(s)

                 Node (more accurately, a vnode)

A vnode(virtual node) is the data storage layer within a server. A server is the Cassandra software. A server is installed on a machine, where a machine is either a physical server, an EC2 instance, or similar.

The Cassandra is elastic and scalable. With no downtime or restarting needed for the configuration of the system, whenever node is added to the channel it starts operating straight away without the system to be restarted.

For each column family, there are 3 layer of data stores: memtable, commit log and SSTable. 1 For efficiency, Cassandra does not repeat the names of the columns in memory or in the SSTable

An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Memtables and SSTables are maintained per table. The commit log is shared among tables. SSTables are immutable, not written to again after the memtable is flushed. The SSTables are files stored on disk.

Distribution is in an ordered or random manner,Distribution of Data is Automatic. When data is provided ,it automatically distributes the data across several data centres or hybrid cloud fashion.Cassandra shows a true “read/write-anywhere design”. It means that no one is aware of where the data has been written.This tells us that if there is any machine in the Cassandra ring,It will have the ability of reading and writing no matter if it’s in single cluster setup or multi cluster setup.

Cassandra(CQL Command Line)

Cassandra keyspace is created with replication factor as 3 which indicates there will be three copies of the data on three different nodes .

Where to place next replica is determined by the Replication Strategy.SimpleStrategy is used when you have just one data center. SimpleStrategy places the first replica on the node selected by the partitioner. After that, remaining replicas are placed in clockwise direction in the Node ring.

The available key spaces can be listed using. Desc keysapaces; command.

We use the command use keyspacename; to direct Cassandra to use the specified cluster for further operations and data uploads.

Query1:

Employee table is created and data is inserted as follows. The available data in the entire table is seen as follows.

Query 2:

The new column salary is added to the table employee1 using alter command. The records are updated to fill in salary values and the details of the employees whose job title is clerk is viewed using following command

Query3:

The records of the employees whose hiredate is February 18 2000 can be queried as follows

Query4:

Just few columns can be queried from the table

Query5:

A new column manager is added to the table to indicate who the manager is of the employee. We update the records of each employees with their manager names. Selecting the records whose salary is greater than 45000 is as follows.