Analysis: Graph input ordering - pgRouting/pgrouting GitHub Wiki

Introduction

The PostgreSQL rows are a set, so the output of any algorithm should not depend on the ordering of the input rows.
Moreover, a function is a ONE to ONE mapping and not ONE to MANY, which is a relation.
However, on executing sample queries on the already-implemented function, it has been found that most of the pgRouting functions fail to satisfy this requirement.
The expected behavior is that the output of every algorithm should be the same, irrespective of the ordering of the input rows - edges_sql.

The problem arises probably because the pgRouting graph is created from the edges_sql rows "sequentially", therefore the set-like nature of the rows is lost when the graph gets created. So, for two different ordering of rows, the different graph gets created internally, and boost algorithm gives a different result for the different graph.

The solution to this problem is as follows:

Add a pgTAP test for every function that tests whether the same set of rows are returned irrespective of initial ordering. Add a todo wherever the tests fail.
Fix the code of the functions which fail the test, by creating a function insert_edges_sorted which first sorts the edges of the graph, in ascending order of id, source, target, and then inserts the edges. Therefore, the same graph will get created for any ordering of input rows.

The "Fast Fix" for this problem

Order the rows in a particular order before creating the graph, preferably in ascending order of the id column, and then in ascending order of the source column (if two or more rows have the same id) and then in ascending order of the target column (if two or more rows have the same id and source values).

The equivalent SQL statement for this ordering is:

ORDER BY id, source, target

So, the following function can be created in the include/cpp_common/pgr_base_graph.hpp file, and it can be called before inserting the edges in the graph, so that the edges always have a fixed ordering before the graph gets created:

void sort_edges(std::vector<T> &edges) {
    std::sort(edges.begin(), edges.end(),
        [](const T &lhs, const T &rhs) -> bool {
            return lhs.target < rhs.target;
        });
    std::stable_sort(edges.begin(), edges.end(),
        [](const T &lhs, const T &rhs) -> bool {
            return lhs.source < rhs.source;
        });
    std::stable_sort(edges.begin(), edges.end(),
        [](const T &lhs, const T &rhs) -> bool {
            return lhs.id < rhs.id;
        });
}

This C++ code does the sorting in the way mentioned above, equivalent to the SQL statement.

Stable sorts preserve the physical order of semantically equivalent values. Therefore, first, normal sorting is done on the target column, and then the stable sort is done on the source column (preserving the relative order of two equal values in the source column). Then, the stable sort is done on the id column (preserving the relative order of two equal values in the id column) to get the final result.

The pgTAP tests

The pgTAP tests are added per function, to check whether the same set of rows is always returned, for any ordering of the input rows.

E.g.: For the pgr_dijkstra function, a file dijkstra-rows-check.sql is created in the pgtap/dijkstra/ directory with the following tests:

The below explanation is mentioned for DIRECTED graphs. A similar test is performed for UNDIRECTED graphs.

-- Check whether the same set of rows is always returned

PREPARE expectedOutputDirected AS
SELECT * FROM pgr_dijkstra(
    'SELECT id, source, target, cost, reverse_cost
    FROM edge_table
    ORDER BY id',
    ARRAY[7, 2], ARRAY[3, 12]
);

PREPARE descendingOrderDirected AS
SELECT * FROM pgr_dijkstra(
    'SELECT id, source, target, cost, reverse_cost
    FROM edge_table
    ORDER BY id DESC',
    ARRAY[7, 2], ARRAY[3, 12]
);

PREPARE randomOrderDirected AS
SELECT * FROM pgr_dijkstra(
    'SELECT id, source, target, cost, reverse_cost
    FROM edge_table
    ORDER BY RANDOM()',
    ARRAY[7, 2], ARRAY[3, 12]
);

SELECT set_eq('expectedOutputDirected', 'descendingOrderDirected', '1: Should return same set of rows');
SELECT set_eq('expectedOutputDirected', 'randomOrderDirected', '2: Should return same set of rows');
SELECT set_eq('expectedOutputDirected', 'randomOrderDirected', '3: Should return same set of rows');
SELECT set_eq('expectedOutputDirected', 'randomOrderDirected', '4: Should return same set of rows');
SELECT set_eq('expectedOutputDirected', 'randomOrderDirected', '5: Should return same set of rows');

Then, in order to do the same test with edges having different cost, the following line is used to update the cost of the edges (according to the edge id):

UPDATE edge_table SET cost = cost + 0.001 * id * id, reverse_cost = reverse_cost + 0.001 * id * id;

and then all the 5 pgTAP tests are performed to do a simple set comparison of your result sets.

In the pgTAP tests mentioned above, three prepared statements are made with the rows ordered:

As expected (In increasing order of the edge id)
In decreasing order of the edge id
Randomly

Then, a single test is done to check whether (1) and (2) return the same set of rows, i.e., checking the result with the graph having ascending order of id and descending order of id. The last 4 tests check whether (1) and (3) return the same set of rows when the rows are ordered randomly.

These tests ensure that the OUTPUT will be the same regardless of any ordering of the INPUT (because the input rows are a set)

Why this isn't a problem?

This is because:

When the in/out edges of a vertex have different costs, then in general, it will not matter the ordering of the initial data. The resulting rows will be mostly the same irrespective of initial ordering.
The tests mainly fail when every edge has the same cost.

The sorting of the input rows has an issue that sorting takes time. It adds an extra n*log(n) term to the running time of the function. So, even though sorting isn't generally required, extra time will be taken by the algorithm to execute because of the sorting of the input rows.

As an example, if a user wants to route from his house to store and if there are 1 million rows for the whole country, the whole country will have to be loaded, and those million rows will get sorted before routing, which will add considerable time to the routing algorithm.

How a user can control which edge will be returned (when edges have the same costs)

The user can control this by ordering the rows in a particular way depending upon his requirement. The graph gets created in the sequential order of the rows. So, whichever row comes first, that edge will get added first in the graph. So, based on the algorithm, the user can determine the output and can order the edges_sql rows initially for a particular choice of an edge in the output.