Home - achaora/mongodb GitHub Wiki

#Assessment of Distributed MongoDB Using Medicare Suppliers Data# ##Introduction## This report assesses the performance of a scalable distributed MongoDB database. The assessment includes running the same performance tests on a distributed environment and a standalone MongoDB database configured in a similar query-router and a database server set-up. National data, from Data.gov, on healthcare suppliers and Medicare reimbursements was used for the assessment. FutureSystems’ infrastructure as a service environment was employed for provisioning the necessary server virtual machines (VMs). A GitHub repository with instructions and the code referenced in this report 1is accessible here. ##Motivation## The assessment had two goals:

a. establish if use of a distributed database yields performance gains beyond those that are attainable through a standalone set-up.

b. compare the performance of a standard Aggregate Query versa a MapReduce Query.

##Method
###Infrastructure Setup### The analysis was conducted on the India Cluster, a 128 node, 11 teraflops IBM iDataPlex cluster in the FutureSystem environment at the Digital Science Center in the School of Informatics and Computing at Indiana University.

Five Ubuntu Linux 14.04 VMs of the ‘medium flavor’ (4096 MB RAM; 40 GB secondary storage) were used in two configurations as follows: A query-router (mongos) accessing a standalone MongoDB database server; and the same query-router accessing a distributed (sharded) MongoDB server environment. Use of a query-router, while not required for the standalone set-up, was introduced in the control to balance out any lag in query performance that could be attributed to network connectivity between the router VM and the server VMs housing the data. Fig 1.1 below shows a schematic of the setups, with detailed how-to information available in the ‘Instructions’ section of the 2associated project-GitHub repository.

Setup Diagram
Figure 1.1 ###Performance Evaluation The steps for evaluating the performance involved chunking and iteratively importing the suppliers-Medicare reimbursements data, while running performance tests on the environments, for the collection of metrics at different dataset sizes. At the core of the performance tests were timed queries (appended) for cycling through data on all healthcare providers in the US that submitted Medicare claims in 2012, and establishing the average Medicare claim-amount made per state in dollar terms, and the average reimbursement paid out per state through the federal program. The scripts ran three query runs for each execution, and calculated the average result of the runs before writing it to a collection in the database queried, and to a text file in the folder containing the programs.

The query runs were timed and constituted of standard Aggregate Queries on MongoDB and a query using MongoDB’s MapReduce functionality. While query-execution time can be read more accurately from system collections for aggregate queries, no similar support is provided for MapReduce queries. To make the timing method uniform across the two environments, server timestamps for before and after query-execution were therefore utilized for determining the durations of the runs.

Comparisons of performance results were then conducted, and the results made available below.

The Python programming language was used for development of the programs for chunking, importing and running performance tests on the setups. The scripts, configuration files and instructions on how to use them are accessible through 3the project-GitHub repository.
##Results## Fig 1.2 and Table1.1 show the results.

Results Diagram Figure 1.2

Results Table Table 1.1

Findings of note were as follows:

• The MapReduce query increasingly performed slower than the aggregate query on either standalone or distributed environments. The rate at which the increase occurred nevertheless generally decreased with the size of the dataset, with an order of magnitude in this decrease observed beyond 1,830,212 documents.

• The aggregate query on the standalone VM performed better than on the distributed environment at the lower dataset sizes of 915,039 and 1,830,212 documents.

• Beyond 1,830,212 documents to the full size of the dataset – 9,153,271 documents, the aggregate query performed better on the distributed environment than on the standalone.

##Discussion## The MapReduce query performing increasingly slower than the aggregate query on both environments, can be explained by it needing to write to temporary storage on disk in the map step. MongoDB adds an extra collection to the database for this processing. As the dataset grows larger, more data has to be written and read from disk, which would explain the lower performance in comparison to in-memory calculations. In theory, beyond the limits of the size of the dataset used for this evaluation, the aggregate queries will fail to yield results when the limits of the VMs’ random access memory (RAM) are reached. The MapReduce query may still function at that stage, due to its operation of writing and reading intermediary calculations to and from disk.

The better performance of the aggregate query on standalone versa distributed at the lower dataset sizes of 915,039 and 1,830,212 documents can be explained in that the distributed setup probably incurs overheads in the querying of data distributed across shards. As the size of the dataset increases, the gains of using resources on three VMs as opposed to one appear to kick in, resulting in better performance of the distributed environment.

##Conclusion## For larger datasets, distributed MongoDB databases perform better on aggregate queries than databases on standalone machines. MapReduce queries are slower in execution time than standard aggregate queries, but they may continue to return results for dataset sizes where standard aggregate queries fail due to RAM constraints.
  ##Appendix
####Queries at the basis of the performance scripts:

Aggregate Query:

db.supplier.aggregate(
[
 {
  $group: 
     {
	_id:"$nppes_provider_state", 
	avg_claim: {$avg: "$average_submitted_chrg_amt"},
	avg_payment: {$avg: "$average_Medicare_payment_amt"}
     }
 }
]
  )

MapReduce Query:

var mapFunction = function() {

		var key = this.nppes_provider_state;
		var value = {
				count: 1,
				claim: this.average_submitted_chrg_amt,
				payment: this.average_Medicare_payment_amt
			     };
                   	emit(key, value);
               };


var reduceFunction = function(keyState, countStVals) {
	    	reduceVal = {count: 0, claim: 0, payment: 0};	
		
		for (var provider = 0; provider < countStVals.length; provider++) {
			reduceVal.count += countStVals[provider].count; 
			reduceVal.claim += countStVals[provider].claim;
			reduceVal.payment += countStVals[provider].payment;
		};

                   	return reduceVal;
               };

var finalizeFunction = function(keyState, reduceVal) {
	    	reduceVal.avg = {
				  claim: reduceVal.claim/reduceVal.count,
				  payment: reduceVal.payment/reduceVal.count
				};	
		
		return reduceVal;
               };


db.supplier.mapReduce( mapFunction,
                 reduceFunction,
                 {
                   out: { merge: "map_reduce_suppliers_states" },
                   finalize: finalizeFunction
                 }
               )

####Processing Screenshots

performance_tester.py executing performance tests

⚠️ **GitHub.com Fallback** ⚠️