CASSANDRA - praveenpoluri/Big-Data-Programing GitHub Wiki

CASSANDRA

About Cassandra:

Apache Cassandra is a highly scalable and available distributed database that facilitates storing and managing high-velocity structured data across multiple commodity servers without a single point of failure. Let's learn more about Cassandra in this blog.

Apache Cassandra is an extremely powerful open-source distributed database system that works really well to handle huge volumes of records spread across multiple commodity servers. It can be easily scaled to meet a sudden increase in demand by deploying multi-node Cassandra clusters and meet high availability requirements, without a single point of failure. It is one of the most efficient NoSQL databases available today. DataStax offers a free packaged distribution of Apache Cassandra. This also includes various other tools such as a Windows Installer, DevCenter, and the DataStax professional documentation.

A NoSQL database is a type of data processing engine that is deployed exclusively for working with data that can be stored in a tabular format and hence does not meet the requirements of relational databases. Some of the salient features of NoSQL databases are that they can handle extremely large amounts of data, can have a simple API, can be replicated easily, are practically schema-free, and are more or less consistent.

NoSQL technologies are designed for being extremely simple, horizontally scalable, and for providing extremely fine control over availability. Data structures used in a NoSQL database are very different from that are used in the relational databases. Due to this, it adds up speed to the operations in NoSQL databases.

Cassandra Characteristics:

  • It is a column-oriented database.
  • It is highly consistent, fault-tolerant, and scalable.
  • It was created for Facebook and was later open sourced.
  • The data model is based on Google Bigtable.
  • The distributed design is based on Amazon Dynamo.

Why Cassandra ?

Cassandra is a very robust and complete NoSQL database that is being deployed by some of the biggest corporations on earth such as Facebook, Netflix, Twitter, Cisco, and eBay. The following are some of the obvious features of Cassandra that clearly make it stand out from the crowd:

Support for a Wide Set of Data Structures Cassandra lets you support data structures of all kinds such as structured, unstructured, and semi-structured data, and it also supports dynamic changes to the data structures to reflect the changing needs.

Linearly Scalable Architecture It can be easily scaled from a certain set of nodes to a higher set of nodes by a simple addition of extra nodes in a linear fashion without having to get into the complexities, and it gives an immediate increase in the throughput and response time.

Seamless Distribution This NoSQL database lets you distribute your data in a seamless manner over multiple data centers by a simple process of data replication.

High Reliability Cassandra is built to handle the failure of nodes in the cluster without affecting the performance in any way as it has no single node failure, an essential feature for mission-critical applications.

Compare the two NoSQL tools Cassandra and MongoDB in this riveting blog post now!

Support for ACID The properties of ACID (atomicity, consistency, isolation, and durability) are well supported by Cassandra database, which is quite a significant feature since ACID transactions are supported by RDMS.

Aim:

To install and setup CASSANDRA on windows, create a keyspace, using keyspace create table and insert data into table and run given five queries on the table.

Tools:

  • Cassandra
  • Windows
  • Command prompt

Tasks:

Task1:

Start Cassandra shell (cqlsh) and create a keyspace(A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node.) using the following command: I am setting replication factor to be 3.

Image also shows the verifiction of created namespace using desc keyspace query.

I am going to use the created keyspace for my table creation etc. USE keyspacename2; (This is the query to use the created keyspace.)

Task2:

Created a table employees2 with columns as per the given data in the csv file.

Verified the table created using the select query as shown above.

Inserted data into the table created from employees csv file given using the copy query with head true and '|' as delimiter as shown below.

Task3:

Queries to be executed on given data:

  • Query 1: List the empID,ename,jobtitle,and hiredate of employee from the employee table.

  • Query 2: List the name,salary of the employees who are clerks.

First we are checking if there's and employee role is clerk with select * from employees1 manually and then ran actual query.

  • Query 3: List the name,job,salary of every employee joined on ‘february18,2000’.

  • Query 4:List name and annual salary of all the employees.

  • Query 5:Display employees’ names, salary and manager values of those employees whose salary is 45000 from EMP table using SELECT statement.

Drawbacks:

  • Cassandra does not provide ACID and relational data properties. If you have a strong requirement for ACID properties, Cassandra would not be a fit in that case.
  • Cassandra does not support aggregates, if you need to do a lot of them, think another database.
  • No join or subquery support. You may be able to find a workaround for this one, but that might affect the performance and increase the overhead.
  • Here data is modeled around queries instead of its structure due to which same data is store multiple times.
  • Reads are slower. Cassandra was optimized from the beginning for fast writes. Reads were not as much of a concern but that quickly changed as more use cases were considered.

Conclusion:

Apache Cassandra is entirely suited to large-scale applications that need to access huge volumes of unstructured data. That being said, Cassandra is still a good choice for smaller applications, as it delivers a high level of data protection out of the box.

References: