Performance Metrics for v1.2.3 - Jusas/WatneyAstrometry GitHub Wiki
See Performance upgrades v1.2.3 ‐ v2.0.0
The solver is mostly dependent on CPU. Memory consumption is somewhat low/decent, depending on the image size. Disk IO is also required, as the catalog star quad data is constantly read from the database files. As many kind of different devices are used in astrophotography your mileage may vary. Generally though, the beefier the CPU, the faster the solves will be. It's tough to say whether or not the disk IO becomes a bottleneck with SD cards as I did not have the hardware to test that out properly.
I made some simple testing to get some idea of the performance using three different devices/setups I have:
- Desktop PC (x64): Ryzen 3800X @3.9 GHz 8-core CPU, 32Gb memory, Sata SSD, Windows 10
- Udoo x86 PC (x64): Intel Pentium N3710 @1.6 GHz Quad core CPU, 8Gb memory, eMMC storage, Ubuntu Linux
- Raspberry Pi 4B (arm64): Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz, 8Gb memory, Kingston 256 Gt microSD card UHS-I (U3), Ubuntu Mate Linux
The benchmarking was performed using the CLI solver app, with the --benchmark switch.
A source material of 6 images was prepared:
| Image | Size | Field radius | Star density |
|---|---|---|---|
| m31.jpg | 4479x3375 | 1.87 | High |
| m81.png | 1128x834 | 0.46 | Low |
| ic1795.fits | 4656x3520 | 0.47 | Medium |
| m106.fits | 4656x3520 | 1.95 | Low |
| ngc7331.fits | 1392x1040 | 0.76 | Medium |
| veil1.fits | 4656x3520 | 1.94 | High |
The test image package can be downloaded here.
The material represents a wide range of properties and file formats and we can make some interesting observations because of this. Testing was performed with both nearby and blind solve modes. Quad sampling was used in blind solves but was ignored for nearby solves as the performance differences are relatively insignificant when the number of search iterations is low.
Other things to keep in mind is that there is a little overhead in running a .NET 6 self-contained program. Not much, but probably mostly noticeable in startup times and since the program is invoked to solve a single image the overhead is always present.
Nearby solve strategy is quite simple: with a given center coordinate, search area radius and assumed field radius (or an estimated range of low-high field radius), we try to solve first using the center point and if it fails, we create a search areas around the center point that covers the given search area radius and run the solve on each of these search areas. Usually if the coordinate is a good estimate and the image is good, not many search iterations are required.
As there are very few search iterations made with nearby solves, with slower CPUs much of the CPU time is actually spent on the initial image reading and star detection rather than running the solve algorithms.
This said, nearby solve performance isn't much of an issue. Generally either the solution is found very quickly, or not at all.
The first chart shows the nearby performance scale. The 6 images were all solved with a few degrees offset input coordinates, with the three devices and then per each device the results of the 6 images were averaged:

Note the breakdown of the time spent.
On the desktop PC, ~500ms was actually used in the solving process, the majority of the time spent was image reading and star detection routines. Note that FITS is by far the fastest to read because the data is plain uncompressed bytes and that barely registered in time spent. Jpeg and PNG clearly take significantly more time to read and actually contribute 99% to the average here.

All in all, the process is fast and usable on all of the test platforms.
Getting blind solving to be performant is a challenge. A lot of optimizations have already been made but there's no doubt room for even further improvement. On a performant desktop, the performance is ok but when we head to SoC CPU territory like the Udoo x86 and RPi we can see a drastic difference. The solver was run with these params:
watney-solve -i <image> --min-radius 0.5 --max-radius 8 --sampling <1, 2, 4, 8, 16>
The default value of 1 was used for the x (--lower-density-offset) and y (--higher-density-offset), meaning one additional higher quad density pass and one lower density pass (passes are sets of star quads in the database, density based). The default value of 1 has been selected as a compromise to ensure matches are found in earlier iterations, while running more quad comparisons also takes more time. In these 3-pass solver runs there are roughly 3 times the amount of quads to calculate comparisons against, which makes it slower but also makes the likelihood of finding a match better.
Sampling was tested using values 1 (no sampling), 2, 4, 8 and 16. Sampling is a process that selects only a fraction of the quads formed from catalog stars and runs the comparison on them (e.g. with sampling 2 it forms 2 groups of quads, i.e. 1/2, 2/2 and then runs the comparison against image star quads first with group 1, then with group 2. If a solution is found using group 1, we've just halved the number of needed comparisons, saving significant computing time).

The general observation can be made that the Udoo x86 is about 10 times slower than the Ryzen Desktop PC, and the Pi4B is pretty close to the Udoo x86 in performance. Previously the difference between the two was more noticeable, as there were more disk reads. The blind solve times are long on these SoC computers but a very significant performance boost was gained in version 1.1, dropping many blind solve times to a fraction of what they used to be.
The general rule is the higher the star density and the lower the field radius, the more time is spent in calculations.
Sampling steps in and reduces the times significantly. However depending on the detected star count in the image and the field radius we can see that there's a point where sampling actually starts to hamper us and increases the solve times.

With the m81.png we have quite the situation of few stars detected (189) and all of them are used for quad formation. Low field radius tells us we'll be plowing through higher field radiuses first, making a lot of calculations basically for nothing. At this point it becomes obvious that the more time we spend on the higher field radiuses, the worse the situation will come with compute time spent. At sampling value 16
and with very few quads to use for comparison, it may well be that we do not find a partial hit with our first groups so we're actually wasting time trying to find a partial hit, i.e. our chances are lower because we didn't have enough stars to form enough quads. So imagine plowing through 1/16, 2/16, 3/16 ... until maybe at group 8/16 we find a partial hit that we can check and just barely get a match with the full catalog quads.
At the end we just waste time, and the overhead of that will actually be so bad that not using sampling at all would have produced results slightly faster. At sampling value 4, we seemed to hit the sweet spot with this image.
The default "auto" sampling value is actually set to 4 because of these reasons (however for the CLI solver this is overridden from the configuration file watney-solve-config.yml, with a default value of 16. Change it there if this value does not suit your needs). It seems to be a somewhat reliable value, decreasing solve times significantly and should work well for the majority of cases.
Sampling is a method where instead of getting all the catalog star quads in range of our search area and comparing them to our image star quads, we only get a subset of the quads in the database and perform the comparisons on them. So for example, if we set sampling to 4, we will only take the first 25% of the quads the database has in that area and perform our calculations on them, then move on to the next 25% and so on. The idea behind this is very simple: the less calculations we need to make, the faster the process. Since in blind solves we have a heap of calculations, limiting the number of quads will significantly drop the number of calculations made. This obviously comes at a cost: we're dropping a good chunk of our reference material, and may not get a solution. Since we're exploring the full range of field radiuses, we may end up running calculations on small radiuses when our image's field radius is actually large if we don't happen to get any hits from the first 25% of the database quads (i.e. the first 25% is a full miss) - this may turn out to actually increase the solve time, and is the reason why a high sampling value is not automatically better. The more stars you have in your image, generally the more reliably higher sampling can decrease the solve times.
The process in somewhat like this:
- For each search area defined by the search strategy, take the sampled amount of quads from the database and run comparisons for the image quads. If enough quad matches can be found then good, perform preliminary solution calculations and if it's looking good, we perform the improved solution calculation with full database quad set and we have a solution. If on the other hand a partial hit was found (one or more quads match), this is a potential area where a solution could be found when using the full database quad set so we mark it as such of it and move on for now.
- Periodically after handling a small batch of searches with the above logic, we inspect the partial hits by performing matching with the full set of database quads to see if any of our partial hits actually produce a solution. If a solution is found, great! If not, we mark these areas as fully checked and continue on.
This logic goes on per each database quad subset (1/4, 2/4, 3/4, 4/4) until either a solution has been found or all subsets have been checked. The sooner we find a partial hit that can be expanded to a full solve, the better.
This diagram shows the high level logic:
