Awk - gregorymorrison/euler1 GitHub Wiki

Awk is a mini language meant for text processing on Unix. It debuted in 1977, and was designed for small-but-powerful single-line uses, though it has the ability to extend it with C-style logic. Its design is pretty simple, with three sections to organize your code - a BEGIN setup, a body, and an END teardown. In my first attempt at Euler1 and first program with Awk, I found you could simply copy C idioms verbatim. Of course, this is very un-Awk-like, but it works and illustrates structure:

#! /usr/bin/awk -f
# Euler1 in AWK

BEGIN {sum = 0}
{
    for (i = 1; i<1000; i++) {
        if (i%3==0 || i%5==0) {
            sum += i;
        }
    }
}
END {print "euler1 = " sum}

Awk scripts require input to process, so to run Euler1 we need to pipe it some dummy input. Awk iterates through the input and performs an action for each line, so make sure your dummy input only has one line. Awk takes this as program arguments which I then ignore and simply use C-style constructs to print the solution to the console. Here's the result:

$ echo . | ./euler1.awk
233168

Does Awk support user-defined functions? Yup. This is good, because we'll need it for recursion later. Here is the first example again, this time written using a standalone function. The order of your functions relative to awk's BEGIN/run/END sections doesn't matter, so we'll put it first:

#! /usr/bin/awk -f
# Euler1 in AWK

function euler(n) {
    sum = 0
    for (i = 1; i < n; i++) {
        if (i%3==0 || i%5==0) {
            sum += i;
        }
    }
    return sum
}

BEGIN { print "euler1 = " euler(1000) }

Note that this time we put our call in the BEGIN section. This work is done before awk iterates through what is passed in, and there is no awk iteration here. So we do not pipe echo nonsense to awk before running it - as a matter of fact, if we did, it would lock awk in an endless loop. Likewise, if we put our code in awk's middle section and did not pass it input, it would also lock up. So this time we must run it as a standard executable:

$ ./euler1.awk
233168

Here's a functional version that uses tail recursion with an accumulator. One of the main points here is that the function euler() is deterministic - it will always return the same output for a given input. This is accomplished in part by the absence of any variable manipulation - there are instead only functions which return values. The other main point is that this recursion uses tail call optimization - it's written in such a way that an intelligent compiler can optimize its stack usage to be O(n) instead of O(n!). In English, this means that your program will probably not crash even for hugely recursive calls.

#! /usr/bin/awk -f
# Euler1 in AWK

function euler(n, acc) {
    if (n == 1) {
        return acc;
    } else if (n%3==0 || n%5==0) {
        return euler(n-1, acc+n);
    } else {
        return euler(n-1, acc);
    }
}

function euler1(n) {
    return euler(n, 0);
}

BEGIN { print "euler1 = " euler1(999) }

Here's another version using a Quicksort-based algorithm. Here we recursively break the list up in half and then reassemble it. Instead of sorting each half, though, we’ll filter the pivot value, converting it to 0 if it’s not divisible. Here, euler() returns euler() calculated on the half of the list before the pivot element, euler() calculated on the half of the list after the pivot element, and the pivot element or 0, all added together:

#! /usr/bin/gawk -f
# Euler1 in AWK

function range(size, ints,     i) {
    for (i=1; i<=size; i++)
        ints[i] = i;
}

function slice(_ints, start, end, sliced,     i, _sliced) {
    for (i in _ints) {
        if (int(i) >= int(start) && int(i) <= int(end)) {
            _sliced[i] = i;
        } else if (int(i) > int(end)) {
            break;
        }
    }
    delete sliced;
    for (i in _sliced)
        sliced[i] = i;
}

function euler(ints,     pivot, pivotVal, pre, post, first, last, i, j, postInts) {
    if (length(ints) == 0)
        return 0;

    j = 0;
    for (i in ints) {
        postInts[i] = i;
        if (int(first) == 0) {
            first = i;
        }	
        last = i;
        if (j == int(length(ints)/2))
            pivot = i;
        j += 1;
    }

    pivotVal = 0;
    if (ints[pivot]%3==0 || ints[pivot]%5==0) {
        pivotVal = ints[pivot];
    }

    slice(ints, first, pivot-1, sliced);
    if (length(sliced) > 0) {
        pre = euler(sliced);
    }

    slice(postInts, pivot+1, last, sliced);
    if (length(sliced) > 0) {
        post = euler(sliced);
    }

    return pre + pivotVal + post;
}

function euler1(size) {
    range(size, ints);
    return euler(ints);
}

BEGIN {
    print "euler1 = " euler1(999)
}

I'm not gonna lie - it took me an entire day to write this frustrating version of QuickSort. There's a lot to unpack here. Awk does have some nice built-in functions, but it doesn't seem to have range or slice functions, so I rolled my own. And arrays don't have discoverable indices - there's no first or last functions. Arrays are basically just associative maps, so the only way you can find what's in them is by iterating through them. Luckily, awk does have an array iterator. So to find the first and last elements I need for pre and post, I have to iterate through the map to discover them.

I got really tripped up by awk's idea of namespaces. Notice the space between ints and pivot in the declaration of euler() here? And note that the call to euler() only passes one variable, ints? Here, ints is a parameter passed in, while the others are local variable declarations. That's right - local variables are declared in the function definition. Any variables that are not declared here are considered global. It took me a while to figure out why the recursive calls kept writing over the values in earlier stackframes - because they're global, that's why! Dammit awk, I wasted a day on this malarkey. This illustrates the benefit of my functional recursion example above - since everything was immutable, I didn't get bitten by this, er, design flaw?

Also, awk doesn't allow you to return array types - the only way to return an array is by passing it in by reference. So, my implementation of slice returns the array of sliced ints as an empty parameter passed in and a filled parameter passed out. Notice that slice() doesn't have a return statement - it's pass-by-reference only. But this means that the parameter sliced is not protected from being overwritten - since only i and _sliced here are local - the rest are global. So I had to do this weird trick where I create a local array of ints - _sliced, then delete the param sliced, then reassign _sliced to sliced. But you can't assign one array to another, so I had to iterate through _sliced and add each one all over to sliced. Likewise, I had to protect the ints in euler() from being overwritten by the call to _eluer() by pre before post got to use them. So I had to make a local copy of ints, here stored in postInts. Whew.

Here’s one more – an elegant algorithm based on an observation by little Carl Friedrich Gauss. It operates in O(1) constant time. Don’t sweat it if this seems inscrutable; click the blog link above for an explanation:

#! /usr/bin/awk -f
# Euler1 in AWK

function mySum (n, size) {
    return (int(size/n)^2 + int(size/n)) * n / 2;
}

function euler(size) {
    return mySum(3, size) + mySum(5, size) - mySum(15, size);
}

BEGIN { print "euler1 = " euler(999) }

Although awk allows long functions, it's really made for quick one-liners. Let's look at a more idiomatic version. Here I feed it a list of integers generated by bash's seq function - 'seq 999'. Awk evaluates some output for each element in this list, and uses it to update a global variable, sum. Finally in Awk's teardown section, it outputs the result. Note here that since we are running seq and awk here with a pipe, I've changed the shebang to run bin/bash instead of awk:

#! /bin/bash
# Euler1 in AWK

seq 999 | awk '{if ($1%3==0 || $1%5==0) sum+=$1} END{print sum}'

My initial impression of this language was, 'Why do I need to learn yet another scripting language?' I had thought that Awk was just another obsolete language. And yet, this last example turned out to be the most compact version of Euler1 so far - only 57 bytes! It took me a couple of hours to write this, having no previous knowledge of Awk. 

I've been very impressed with the power I see here, and I look forward to rocking it old-skool with Awk in the future.