Bibliography - ARBORproject/arborproject.github.io GitHub Wiki

Bibliography

Please add any papers, blog posts, etc. that would be relevant to the community!

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Key DeepSeek arxiv paper
Auditing AI Bias: The DeepSeek Case: From the blog: "Thought token forcing can reveal bias and censorship"
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study:
1. There may NOT be Aha moment in R1-Zero-like training. Instead, we found Aha moment (such as self-reflection patterns) appears at epoch 0, namely base models.
2. We found Superficial Self-Reflection (SSR) from base models’ responses, in which case self-reflections do not necessarily lead to correct final answers.
3. We took a closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions."
Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models:
1. We observed that a state-of-the-art LLM can exhibit deceptive behaviors when given simulated physical autonomy, including disabling safety systems and establishing covert networks while maintaining a façade of compliance.
2. We found this behavior emerged spontaneously despite not being explicitly programmed, suggesting potential risks when integrating such models into robotic systems with real-world capabilities.
Related papers on Chain of Thought faithfulness
1. Measuring Faithfulness in Chain-of-Thought Reasoning Lanham et al., 2023: Mixed results on reliability; from the abstract: "As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen."
2. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting Turpin et al., 2023. Evidence of unfaithfulness. A clever experiment with in-context learning where you can "force" a demonstrably unfaithful response.
3. On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models Tanneru et al, 2024. Even improving faithfulness in CoT is hard!
Similarity-Distance-Magnitude Universal Verification:
1. Introduces SDM activation functions, SDM calibration, and SDM networks, which are neural networks (e.g., LLMs) with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. This is applicable for non-reasoning and reasoning models; hence, the relevance for the ARBOR project.
2. When used for post-hoc calibration of an existing reasoning model, the robust estimates of predictive uncertainty can be used for conditional-branching during test-time search. A pre-trained, Apache-2.0 licensed version is available here as a Model-Context-Protocol server. Additionally, all of the code and data for the original paper is available here.
3. SDM estimators obtain "interpretability-by-exemplar" via dense-matching into the training (support) set, conditional on the prediction, as well as via matching into relevant partitions of the calibration set.
4. More broadly, this work provides a new perspective on the behavior of neural networks, demonstrating that there are regions of the output distribution that are low variation and high probability that can be reliably detected. Existing modeling approaches marginalize over these regions, which can contribute to unexpected LLM behavior at test time. This has important practical implications for the reliable and controllable deployment of reasoning models, and LLMs more generally, as it enables uncertainty-aware test-time conditional branching for end-users, which is critical for human trust in the output, and to guide the reasoning of the models themselves.
5. As a side point, controlling for the epistemic uncertainty with SDM estimators is also important for other feature- and parameter-based interpretability methods, as it provides an indicator as to whether new, unseen test data adequately resembles that of the data used in establishing the inductive bias of the interpretability method (mechanistic, or otherwise).