[25.02.22] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Authors: Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng
Published In: arXiv:2003.08934v2 [cs.CV]
Year: 2020
Link: https://arxiv.org/abs/2003.08934
Date of Discussion: 2025.02.20 Thu PM 10:00

Summary

Research Problem: The paper addresses the problem of synthesizing novel views of complex 3D scenes from a set of input images. It aims to create a continuous, volumetric representation of a scene that can be rendered from arbitrary viewpoints.
Key Contributions:
- Introduced Neural Radiance Fields (NeRF), a method that represents a scene using a fully-connected deep neural network (MLP).
- The network takes 5D coordinates (spatial location and viewing direction) as input and outputs volume density and view-dependent emitted radiance.
- Achieved state-of-the-art results in novel view synthesis, outperforming prior work on neural rendering and view synthesis.
- Proposed a positional encoding to enable the MLP to represent higher frequency functions.
- Introduced a hierarchical sampling procedure to improve rendering efficiency.
Methodology/Approach:
- Represent the scene as a continuous 5D function using an MLP.
- Input: 5D coordinate (x, y, z, θ, φ) representing spatial location and viewing direction.
- Output: Volume density (σ) and view-dependent emitted radiance (color, c).
- Volume density (σ) is a function of location (x) only.
- Color (c) is a function of both location (x) and viewing direction (θ, φ).
- Use classical volume rendering techniques to project the output colors and densities into an image.
- Optimize the network weights by minimizing the rendering loss between synthesized and ground truth images.
- Employ positional encoding and hierarchical volume sampling to improve performance.
Results:
- NeRF outperforms prior work on neural rendering and view synthesis, both quantitatively and qualitatively.
- The method can represent complex geometry and appearance, including view-dependent effects like specularities.
- The model achieves high-quality renderings with fine details.
- The optimized NeRF representation is compact, requiring less storage than the input images.

Discussion Points

Strengths:
- Innovative use of a simple MLP to represent a continuous volumetric scene function.
- High-quality rendering results with fine details and view-dependent effects.
- Compact model representation.
- The positional encoding and hierarchical sampling significantly improve performance.
- The paper is well-written and the method is clearly explained.
Weaknesses:
- The model is heavily overfit to each scene, requiring retraining for each new scene. It's not a generalizable model.
- Training is computationally expensive, taking 1-2 days per scene on a single GPU.
- The method has limitations with scenes that have view-dependent density changes (e.g., holograms, polarized films).
- The method requires a constrained dataset with known camera poses and intrinsic parameters.
Key Questions:
- How can the method be made more generalizable to avoid overfitting to individual scenes?
- How can the training time be reduced?
- How does the hierarchical sampling work, specifically the inverse transform sampling? (This was clarified during the discussion).
- How do the coarse and fine networks interact during rendering, and why are both sets of samples used? (This was partially clarified, but some confusion remained).
- How does the positional encoding relate to concepts like Rope embeddings?
Applications:
- Novel view synthesis for virtual reality and augmented reality.
- 3D scene reconstruction from images.
- Creating realistic 3D models of objects and environments.
- Special effects in film and video games.
Connections:
- Relates to other work on implicit neural representations and volume rendering.
- The positional encoding is similar to that used in Transformers, but used for a different purpose (mapping continuous coordinates to a higher-dimensional space).
- The hierarchical sampling is related to importance sampling, but with a different goal.
- The discussion participants noted similarities to Rope embeddings.

Notes and Reflections

Interesting Insights:
- The simplicity of the MLP architecture and its effectiveness in representing complex scenes.
- The concept of overfitting the network to a single scene to achieve high-quality rendering.
- The importance of positional encoding for capturing high-frequency details.
- The effectiveness of hierarchical sampling for improving rendering efficiency.
- The ability of the model to learn view-dependent effects, even though it's a form of "cheating" or "eye-trickery".
Lessons Learned:
- Implicit neural representations can be a powerful tool for representing 3D scenes.
- Positional encoding is crucial for capturing high-frequency information.
- Hierarchical sampling can significantly improve rendering efficiency.
- Overfitting can be a useful technique in specific contexts, such as representing a single scene.
Future Directions:
- Research on making NeRF more generalizable to avoid overfitting.
- Exploring methods to reduce training time.
- Investigating the use of attention mechanisms to improve performance.
- Exploring how to create true "world models" that capture the underlying physics of the scene, rather than just visual appearance.
- Comparing NeRF to other methods like Gaussian Splatting.