meeting 2025 02 06 n315 - JacobPilawa/TriaxSchwarzschild_wiki_5 GitHub Wiki

Context

I tried to push the CO vs. stellar comparison a bit further following our discussion from the other day, and I think I've landed on some nice results.
On one hand, many of the worked examples I have come across caution against using Bayes factors/Bayesian evidence for model comparison, in particular where you're testing something like an "exact model" like we have here. They instead seem to urge people to interpret confidence intervals on their parameters and judge consistency in that way.
- Also, check out Figure 16.3 here which seems essentially exactly what I am recreating below: https://mspeekenbrink.github.io/sdam-book/ch-Bayes-factors.html#bayes-factors-for-general-linear-models:
  - "In this case, the Bayes factor shows strong evidence (BF1>10) for a wide range of sensible values of r, and hence one might consider the test result quite robust."
- I've got some text in the paper reflecting this approach now, and I think this is essentially identical to our ``gray band'' that we had been plotting in the past. This gives ~a posterior on our fitting line, which at the 68% confidence level contains the one-to-one line, and thus we cannot rule out y=x as being the true model.
Alternatively, I think I implemented the full Bayesian evidence calculation in dynesty using their outputted evidences. There is a slight difference in my results from dynesty vs. what LinMix returns since they're marginally different algorithms, but the results are strongly consistent with one another. I tried to include some model diagnostics below, but the gist is:
- I've fit two models with dynesty:
  - The first model is identical in spirit to our LinMix model, which simply fits for a slope, intercept, and intrinsic scatter of our relation.
  - I've fit a second model with dynesty which only includes the intrinsic scatter as a parameter and fixes the slope and intercept to be 1 and 0 respectively.
- I then extract the cumulative evidence for each model and compare their ratios (technically the log of their differences), and I'm finding that the linear model has ~< 3 times the evidence. It seems to me from reading online that an evidence difference of less than 3 means that the two models are consistent with one another/not sufficient evidence to rule out one versus the other.
  - This consistency can be seen below in their respective 68% confidence regions, which SIGNIFICANTL overlap:
Interpretation:
- I think it's much clearer to interpret the results in the context of the 68% credible regions on our parmaeters containing a slope = 1 and an intercept of 0, and from this we say that our data cannot rule out a 1-to-1. I'm also happy to include a summary of the analysis below (maybe even show the two bands on our final plot) which gets a bit more technical/specific to our Bayesian approach. On one hand, this is much more quantitative, but I think that the results are already easily explained with our posteriors so I could go either way.

Diagnostic Plots

First, here are the two sets of posteriors for the two models:
- Note I've included (slope, intecept) = (1,0) on this plot for comparison. These are well within the 1 sigma region.

Free Model	One-to-One Model
[images/250206/free_model_posteriors.png]]](/JacobPilawa/TriaxSchwarzschild_wiki_5/wiki/[[images/250206/one_to_one_posteriors.png)

Here's comparison of the 68% confidence band for each of the models.

Posterior Predictive Check
images/250206/posterior_predictive_check.png

And lastly, here are the traces of the (log) evidence as output from dynesty. The model comparison approahc is the ratio of these two cumulative evidences:

Evidence
images/250206/evidence_trace.png

Follow Up After Discussion with Emily:

Emily and I exchanged a few emails discussing the Bayesian model comparison between the one-to-one and the free linear model. I have some additional tests here exploring that fit.
In short, we wanted to see if our results depend strongly on how we treat the intrinsic scatter in the one-to-one model. In theory, the data should be consistent with this one-to-one model, and the inclusion of the intrinsic scatter for this one-to-one model might be "too flexible" for our case here. We thought of a few ways to test this:
- First, I ran the same analysis as I have above, but I fix the prior range on the intrinsic scatter for the one-to-one model to be [0.282,0.284], effectively forcing the model to have an intrinsic scatter of 0.283. The result is essentially an identical set of fits, but limiting the prior range to be so narrow seems to impact the resulting evidence calculation. This seems to suggest to me that we should be careful reporting the exact evidence numbers if they strongly depend on the prior range.
- I also ran a test where I exclude the intrinsic scatter entirely from the likelihood for the one-to-one model, but I think that this is not the best approach. In this case, we're computing the likelihood on a VERY strict model (no scatter from true one-to-one), so the resulting evidence is VERY low compared to the full linear model which is allowed to have an intrinsic scatter.
- I finally ran a test where I include an intrinsic scatter in the model but fixed to a variety of values rather than letting the MCMC sample for the best fitting intrinsic scatter. The results broadly show what we see in the individual cases, where the model's evidence increases relative to the linear model until an intrinsic scatter of ~0.3 or so, at which point the linear model starts to become more and more favored.
In summary, I think that this broadly points to the fact that the one-to-one and linear fits cannot rule each other out, but rather the data is consistent with both fits. Due to the variance in the evidence calculation based on the prior, I hesitate to quote exact evidence differences, but we can certainly quote the posterior intervals on our parameters/fits.

New Diagnostics

Fixed Intrinsic Scatter

In this case, I've adjusted the prior range on the one-to-one model's intrinsic scatter to be our best fit value. This is the case which is essentially identical to our base fit, but limiting the prior volume seems to marginally adjust the evidence of our one-to-one model and thus our evidence difference:

Posterior Fits
images/250206/posterior_predictive_check_fixed_best.png

Exlcude Intrinsic Scatter

In this case, I'm asserting that there is no intrinsic scatter in the one-to-one model, and this is where a strict model leads to very low evidence calculations (whereas the linear fit has the flexiblity of an intrinsic scatter). I'm asserting this by saying there is no intrinsic scatter term and allowing the variance to only come from measurement errors.

Posterior Fits
images/250206/posterior_predictive_check_no_int_scatter.png

Line of Intrinsic Scatters

In this case, I'm again not allowing dynesty to search for the intrinsic scatter but instead fixing the assumed intrinsic scatter to taken on a set of values so we can see how our evidence depends on the assumed intrinsic scatter. I'm doing this by including a fixed intrinsic scatter value in the likelihood.
- This seems to be in line with what we might expect -- initially when the model is too strict, the linear model with intrinsic scatter is more consistent with the data. When both fits are allowed to have some intrinsic uncertainty (for a generous range of uncertainties), the linear model and the one-to-one appear broadly consitent with marginal evidence for one-to-one being better. But as we allow the one-to-one to have a large intrinsic scatter (way more than the data supports), the linear-model becomes once again more favored since the intrinsic scatter is "drowning out" any signal from the data.

Evidence vs. Sigma
images/250206/evidence_vs_sigma.png

One Note of Clarity

One thing I was worried about -- note that there is a slight difference in the evidence in the plot immediately above around ~sig = 0.3 or so compared to the value quoted in our fiducial fit. Specifically, the evidence difference in our fiduciaul fit is ~2.8, compared to ~4.0 in this plot. This, I think, is simply a result of the evidence calculation having different prior volumes in these two cases. In the case immediately above, the intrinsic sigma is fixed to a specific value, whereas the fiducial case has an actual prior volume to integrate over. It's encouraging that fixing the intrinsic scatter over a limited prior volume (first case ``fixing full model'') gives conisistent evidence with the fixed case, even if this is different than the fiducial case. I'm unsure if there is a way to treat the fixed parameter prior volumes as they appear in the evidence calculation, but if we focus more on the posteriors and less on the actual evidence, we avoid this issue.