Bucket Sort - Eishaaya/GMRDataStructuresTest GitHub Wiki

Bucket Sort is a pretty simple divide and conquer algorithm, a type of algorithm which, as the name suggests, divides a problem into multiple smaller sub-problems and solves them individually. The last algorithm you made that falls under this category was probably Merge Sort. However, Bucket Sort varies vastly in execution, instead of recursively cutting the input in half, it separates the inputs into a set amount of ordered buckets based off their value, then sorts each bucket. Once each bucket is sorted, we can just move the data back into the original array to complete the sort.

Implementation Guide

Simple Bucket Sort

Parameter: int[] or List<int>

Returns: void

In our example here, we're going to be using four buckets for ease of example, but generally you want to aim for a 1:1 ratio between buckets and items.


So, first up, we need to move every item to its corresponding bucket. Remember to account for negatives and compress the range just like you did on counting sort.


Then, we sort all the buckets individually, this is usually done with insertion sort. Yes, that insertion sort from ages ago. Insertion sort, despite being O(n2), is actually pretty fast when you don't have to bubble a lot. Since our buckets are pretty small, insertion sort is actually a great choice!


Finally, we just move the data back into the input array, and we're done


Improvements

  • Cache efficiency:

You may already know about cache, RAM & registers at this point, but to quickly summarize: Computers use registers to store temporary information. Think variables, arrays, and whatever happens to be currently running.

RAM is basically a big physical array of registers. L3, L2 & L1 cache are respectively smaller and faster register arrays used to hold onto frequently accessed data.

For example, if you grab the first item in an array, you probably want more of the array, so the array will be loaded into cache. Next time you try to grab that array, it'll find it in one of caches and the access will be incredibly fast. This is called a cache hit. However, as more data is cached, this array will be moved down the caches, and eventually will have to be accessed through RAM normally if left unused for long enough. This is called a cache miss.

So, when we sort each bucket individually, each bucket has to be cached individually, and we lose out on accessing cached data unless our data is compressed into a few large buckets, which we don't want. Instead, after moving the data to their buckets, we'll just put it back into the array without sorting. However, our data is mostly sorted since it just came out of the buckets. So, when we run insertion sort on the entire array, we won't be bubbling more than we would've on each bucket individually. This leads to a faster sort since it's the same amount of operations, but with our array being contiguous, we'll be running on cached data!

<--Previous | Next-->

⚠️ **GitHub.com Fallback** ⚠️