1) Sehender Raum : Seeing Space  3a1b2c3/seeingSpace Wiki
Notes about capturing, rendering and digitally reconstruction the world
When I learned about traditional computer graphics and photogrammetry I missed the big picture about how all the pieces connect: with hardware, physics and machine learning aspects. It made it harder to understand recent research and its meaning for the field. Rendering 3D models from 2D images remains a challenging problem but incredible progress has been made since I first became interested in the topic 20 years ago (see below)
 TLDR: Nerf in 10 slides
With neural rendering computer graphics and vision might be heading for its "Charles Darwin" moment where we can see and remove some limiting assumptions form the field. Some disjoint pieces may just fall into place: Computer graphics and vision now have a shared framework. Exciting times.
I love to hear from anybody in the field https://www.linkedin.com/in/katrinschmid/
 Inverse and Differential rendering (aka "Computervision")
 Neural Rendering, ca 1990s

Imagebased rendering (IBR): Plenoptic function and capture
 The plenoptic function 8D

Static 5D and 4D Lightfields: capture and rendering (Gabriel Lippmann, 1908)

Capturing, storing and compressing static and dynamic light fields

Neural Scene representations
 Encodings comparison
 Input format: Local Light Field Fusion (LLFF), 2019
 Multi Sphere Image, Multiplane image (MPI), local layered representation format and DeepMPI representation (2.5 D), 2020
 Parametric Encoding: Acorn: Adaptive coordinate networks for neural scene representation, 2021
 multiresolution decomposition
 Conversion to mesh or voxel
 PointBased Rendering
 PlenOctrees For Realtime Rendering of Neural Radiance Fields, 2021, NeRFSH
 Plenoxels: Radiance Fields without Neural Networks, 2021
 PointNeRF: Pointbased Neural Radiance Fields, 2022
 Mixed scene representations for neural rendering, 2019
 Compression

Neural Scene representations
 Novel (virtual) 2D view synthesis from plenoptic samples
 Multiresolution Nerfs
 Relighting with 4D Incident Light Fields
 Nerf for computer vision task : Scene Labeling and Understanding with Implicit Scene Representation, 2021
 Nerf methods in comparison + Combining LIDAR and radiance fields

Capturing, storing and compressing static and dynamic light fields
 Conclusion
Table of contents generated with markdowntoc
"Classic rendering" in computer graphics
Classical computer graphics methods approximate the physical process of image formation in the real world: light sources emit photons that interact with the objects in the scene, as a function of their geometry and material properties, before being recorded by a camera. This process is known as light transport. The process of transforming a scene definition including cameras, lights, surface geometry and material into a simulated camera image is known as rendering.
The two most common approaches to rendering are rasterization and raytracing.
 Rasterization is a feedforward process in which geometry is transformed into the image domain, sometimes in backtofront order known as painter’s algorithm.
 Raytracing is a process in which rays are cast backwards from the image pixels into a virtual scene, and reflections and refractions are simulated by recursively casting new rays from the intersections with the geometry.
(published in 1986)
The rendering equationThe rendering equation describes physical light transport for a single camera or the human vision. A point in the scene is imaged by measuring the emitted and reflected light that converges on the sensor plane. Radiance (L) represents the ray strength, measuring the combined angular and spatial power densities. Radiance can be used to indicate how much of the power emitted by the light source that is reflected, transmitted or absorbed by a surface will be captured by a camera facing that surface from a specified angle of view.
Source: https://www.semanticscholar.org/paper/InverseRenderingandRelightingFromMultiplePlusLiuDo/da1ba94e01596d0241d7d426b071ae9731d148b3, https://www.mdpi.com/20724292/13/13/2640, Rendering for Data Driven Computational Imaging, Tristan Swedish
Inverse and Differential rendering (aka "Computervision")
Inverse graphics attempts to take sensor data and infer 3D geometry, illumination, materials, and motions such that a graphics renderer could realistically reproduce the observed scene. Renderers, however, are designed to solve the forward process of image synthesis. To go in the other direction, we propose an approximate differentiable renderer (DR) that explicitly models the relationship between changes in model parameters and image observations. Differentiable rendering enables optimization of 3D object properties like the geometry of a mesh. Unlike traditional rendering, differentiable rendering can backpropagate gradients from image space to 3D geometry.
Source: http://rgl.epfl.ch/publications/NimierDavidVicini2019Mitsuba2
Inverse rendering and differentiable rendering have been a topic of research for some time. However, major breakthroughs have only been made in recent years due to improved hardware and advancements in deep learning.
Neural Rendering, ca 1990s
Is a relative new technique that combines classical or other 3D representation and renderer with deep neural networks that rerender the classical render into a more complete and realistic views. In contrast to Neural Imagebased Rendering (NIBR), neural rerendering does not use input views at runtime, and instead relies on the deep neural network to recover the missing details. Deepfakes are an early neural rendering technique in which a person in an existing image or video is replaced with someone else's likeness. The original approach is believed to be based on Korshunova et al (2016), which used a convolutional neural network (CNN).
A typical neural rendering approach takes as input images corresponding to certain scene conditions (for example, viewpoint, lighting, layout, etc.), builds a "neural” scene representation from them, and "renders” this representation under novel scene properties to synthesize novel images.
The learned scene representation is not restricted by simple scene modeling approximations and can be optimized for high quality novel images. At the same time, neural rendering approaches incorporate ideas from classical graphics—in the form of input features, scene representations, and network architectures—to make the learning task easier, and the output more controllable. Neural rendering has many important use cases such as semantic photo manipulation, novel view synthesis, relighting, free viewpoint video, as well as facial and body reenactment.
Source: Advances in Neural Rendering, https://www.neuralrender.com/ Source: Advances in Neural Rendering, https://www.neuralrender.com/
Artifacts such as ghosting, blur, holes, or seams can arise due to viewdependent effects, imperfect proxy geometry or too few source images. To address these issues, NIBR methods replace the heuristics often found in classical IBR methods with learned blending functions or corrections that take into account viewdependent effects.
Source: Advances in Neural Rendering, https://www.neuralrender.com/
Imagebased rendering (IBR): Plenoptic function and capture
Computational imaging (CI) is a class of imaging systems that, starting from an imperfect physical measurement and prior knowledge about the class of objects or scenes being imaged, deliver estimates of a specific object or scene presented to the imaging system.
In contrast to classical rendering, which projects 3D content to the 2D plane, imagebased rendering techniques generate novel images by transforming an existing set of images, typically by warping and compositing them together. The essence of imagebased rendering technology is to obtain all the visual information of the scene directly through images. Its used in computer graphics and computer vision, and it is also widely used in virtual reality technology.
The plenoptic function 8D, Gabriel Lippmann, 1908
The world as we see it using our eyes is a continuous threedimensional function of the spatial coordinates. To generate photorealistic views of a realworld scene from any viewpoint, it not only requires to understand the 3D scene geometry, but also to model complex viewpointdependent appearance resulting of light transport phenomena. A photograph is a twodimensional map of the “number of photons” that map from the threedimensional scene.
While the rendering equation is a useful model for computer graphics some problems are easier to solve by a more generalized light model.
The plenoptic function is also inspired by multifaceted insect eyes or lens arrays
Source: https://en.wikipedia.org/wiki/Compound_eye, Rendering for Data Driven Computational Imaging, Tristan Swedish
The plenoptic function describes the degrees of freedom of a light ray with the parameters: Irradiance, position, wavelength, time, angle, phase, polarization, and bounce.
Source: Rendering for Data Driven Computational Imaging, Tristan Swedish, https://www.blitznotes.org/ib/physics/waves.html, https://courses.lumenlearning.com/boundlesschemistry/chapter/thenatureoflight/
Light has the properties of waves. Like ocean waves, light waves have crests and troughs.
 The distance between one crest and the next, which is the same as the distance between one trough and the next, is called the wavelength.
 Wave phase is the offset of a wave from a given point. When two waves cross paths, they either cancel each other out or compliment each other, depending on their phase.
 Irradiance is the amount of light energy from one thing hitting a square meter of another each second. Photons that carry this energy have wavelengths from energetic Xrays and gamma rays to visible light to the infrared and radio. The unit of irradiance is the watt per square meter.
 Polarization and Bounce are often omitted for simplicity
 The full equation is also time dependent.
Static 5D and 4D Lightfields: capture and rendering (Gabriel Lippmann, 1908)
Lightfield capture was first proposed in 1908 by Nobel laureate physicist Gabriel Lippmann (who also contributed to early color photography). If Vx, Vy, Vz are fixed, the plenoptic function describes a panorama at fixed viewpoint (Vx, Vy, Vz). A regular image with a limited field of view can be regarded as an incomplete plenoptic sample at a fixed viewpoint. As long as we stay outside the convex hull of an object or a scene, if we fix the location of the camera on a plane, we can use two parallel planes (u,v) and (s,t) to simplify the complete 5D plenoptic function to a 4D lightfield plenoptic function.
A Light field is a mathematical function of one or more variables whose range is a set of multidimensional vectors that describe the amount of light flowing in every direction through every point in space*. It restricts the information to light outside the convex hull of the objects of interest. The 7D plenoptic function can under certain assumptions and relaxations simplify o a 4D light field, which is easier to sample and operate on. A hologram is a photographic recording of a light field, rather than an image formed by a lens. A light field is a function that describes how light transport occurs throughout a 3D volume. It describes the direction of light rays moving through every x=(x, y, z) coordinate in space and in every direction d, described either as θ and ϕ angles or a unit vector. Collectively they form a 5D feature space that describes light transport in a 3D scene.
The magnitude of each light ray is given by the radiance and the space of all possible light rays is given by the fivedimensional plenoptic function. The 4D lightfield has 2D spatial (x,y) and 2D angular (u,v) information that is captured by a plenoptic sensor.
 the incident light field Li(u, v, alpha, beta) describing the irradiance of light incident on objects in space
 the radiant light field Lr (u, v, alpha, beta) quantifying the irradiance created by an object
 time is an optional 5th dimension
Capturing, storing and compressing static and dynamic light fields
Light field rendering [Levoy and Hanrahan 1996] eschews any geometric reasoning and simply samples images on a regular grid so that new views can be rendered as slices of the sampled light field. Lumigraph rendering [Gortler et al. 1996] showed that using approximate scene geometry can ameliorate artifacts due to undersampled or irregularly sampled views. The plenoptic sampling framework [Chai et al. 2000] analyzes light field rendering using signal processing techniques and shows that the Nyquist view sampling rate for light fields depends on the minimum and maximum scene depths. Furthermore, they discuss how the Nyquist view sampling rate can be lowered with more knowledge of scene geometry. Zhang and Chen [2003] extend this analysis to show how nonLambertian and occlusion effects increase the spectral support of a light field, and also propose more general view sampling lattice patterns. Rendering algorithms based
One type uses an array of microlenses placed in front of an otherwise conventional image sensor to sense intensity, color, and directional information. Multicamera arrays are another type. Compared to a traditional photo camera that only captures the intensity of the incident light, a lightfield camera provides angular information for each pixel.
In principle, this additional information allows 2D images to be reconstructed at a given focal plane, and hence a depth map can be computed. While special cameras and cameras arrangements have been build to capture light fields it is also possible them with a conventional camera or smart phone under certain constraints (see Crowdsampling the Plenoptic Function).
Source: Stanford light field camera; Right: Adobe (large) lens array, source https://cs.brown.edu/courses/csci1290/labs/lab_lightfields, "Lytro Illum", a discontinued commercially available light field camera
Implicit neural scene representations
Source: Advances in Neural Rendering, https://www.neuralrender.com/
* General purpose format, can store images, points, voxels, meshes, compression * Cloudy blurry artifacts when not enough data available * Can handle reflection and transparent objects, small details, fuzzy objects * Not meant for survey/measuring, can generate lowres geometry via marching cubesSource: Advances in Neural Rendering, https://www.neuralrender.com/
 If you read code a tiny nerf https://github.com/MaximeVandegar/Papersin100LinesofCode/tree/main/NeRF_Representing_Scenes_as_Neural_Radiance_Fields_for_View_Synthesis
Encodings comparison
Whereas discrete signal representations like pixel images or voxels approximate continuous signals with regularly spaced samples of the signal, these neural fields approximate the continuous signal directly with a continuous, parametric function, i.e., a MLP which takes in coordinates as input and outputs a vector (such as color or occupancy).
Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, highquality representations. Stateoftheart results are obtained by conditioning a neural approximation with a lookup from trainable feature grids
From Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
A demonstration of the reconstruction quality of different encodings and parametric data structures for storing trainable feature embeddings. Each configuration was trained for 11 000 steps using our fast NeRF implementation (Section 5.4), varying only the input encoding and MLP size. The number of trainable parameters (MLP weights + encoding parameters), training time and reconstruction accuracy (PSNR) are shown below each image. Our encoding (e) with a similar total number of trainable parameters as the frequency encoding configuration (b) trains over 8× faster, due to the sparsity of updates to the parameters and smaller MLP. Increasing the number of parameters (f) further improves reconstruction accuracy without significantly increasing training time.
Comparison of Encodings (from instant nerf paper) A practical introduction https://keras.io/examples/vision/nerf/#setup
, 2019
Input format: Local Light Field Fusion (LLFF)Used in original Nerf paper for still images, can get lightfields to the nyquist frequency limit..
LLFF uses Colmap to calculate the position of each of the camera* (poses files), then uses a trained AI to calculate the distance map and from there it generates the MPI, which is the output we'll use to create the MPI videos (and the metadata). So this pose recentering needs to be applied on real data where the camera poses are arbitrary, is that correct? Leave aside rendering, does it have impact on training: train on llff (real data with arbitrary camera poses) without rencenter_poses and with NDC? Intuitively it depends on how much the default world coordinate differs from the poses_avg, in practice when using COLMAP, do they differ much?
NDC makes very specific assumptions, that the camera is facing along z and is entirely behind the z=near plane. So if the rotation is wrong it will fail (in its current implementation). This is analogous to how a regular graphics pipeline like OpenGL Pose_bounds.npy contains 3x5 pose matrices and 2 depth bounds for each image. Each pose has [R T] as the left 3x4 matrix and [H W F] as the right 3x1 matrix.
 https://github.com/Fyusion/LLFF,
 https://bmild.github.io/llff/
 https://www.youtube.com/watch?v=LY6MgDUzS3M
, local layered representation format and DeepMPI representation (2.5 D), 2020
Multi Sphere Image, Multiplane image (MPI)MSI: a Multi Sphere Image. LM: a Layered Mesh with individual layer textures. LMTA: a Layered Mesh with a texture atlas.
Deep image or video generation approaches that enable explicit or implicit control of scene properties such as illumination, camera parameters, pose, geometry, appearance, and semantic structure. MPIs (rgba) have the ability to produce highquality novel views of complex scenes in real time and the view consistency that arises from a 3D scene representation (in contrast to neural rendering approaches that decode a separate view for each desired viewpoint).
Our method takes in a set of images of a static scene, promotes each image to a local layered representation (MPI), and blends local light fields rendered from these MPIs to render novel views. As a rule of thumb, you should use images where the **maximum disparity between views is no more than about 64 pixels (watch the closest thing to the camera and don't let it move more than ~1/8 the horizontal field of view between images). Our datasets usually consist of 2030 images captured handheld in a rough grid pattern. https://github.com/Fyusion/LLFF
Its depthwise resolution is limited by the number of discrete planes, and thus the MPIs cannot be converted to other 3D representations such as mesh, point cloud, etc. I
DeepMPI (2020) extends prior work on multiplane images (MPIs) to model viewing conditions that vary with time Our work makes three key contributions: first, a representation, called a DeepMPI, for neural rendering that extends prior work on multiplane images (MPIs) [68] to model viewing conditions that vary with time; second, a method for training DeepMPIs on sparse, unstructured crowdsampled data that is unreg 1 [1] describes the plenoptic function as 7D, but we can reduce this to a 4D color lightfield supplemented by time by applying the later observations of [33]. Crowdsampling the Plenoptic Function 3 istered in time
Parametric Encoding: Acorn: Adaptive coordinate networks for neural scene representation, 2021
parametric approach uses a tree subdivision of the domain Rd , wherein a large auxiliary coordinate encoder neural network (ACORN) [Martel et al. 2021] is trained to output dense feature grids in the leaf node around x. These dense feature grids, which have on the order of 10 000 entries, are then linearly interpolated, as in Liu et al. [2020]. This approach tends to yield a larger degree of adaptivity compared with the previous parametric encodings, albeit at greater computational cost which can only be amortized when sufficiently many inputs x fall into each leaf node. Sparse parametric encodings.
Source: https://www.computationalimaging.org/publications/acorn
multiresolution decomposition
Conversion to mesh or voxel
Neural radiance field (NeRF) techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint
The neural network can also be converted to mesh in certain circumstances https://github.com/bmild/nerf/blob/master/extract_mesh.ipynb), we need to first infer which locations are occupied by the object. This is done by first create a grid volume in the form of a cuboid covering the whole object, then use the nerf model to predict whether a cell is occupied or not. This is the main reason why mesh construction is only available for 360 inwardfacing scenes as forward facing scenes
PointBased Rendering
ADOP: Approximate Differentiable OnePixel Point Rendering, https://t.co/npOqsAstAx https://t.co/LE4ZdckQPO
PlenOctrees For Realtime Rendering of Neural Radiance Fields, 2021, NeRFSH
Neural Radiance Fields (NeRFs) in real time using PlenOctrees, an octreebased 3D representation which supports viewdependent effects. Our method can render 800x800 images at more than 150 FPS, which is over 3000 times faster than conventional NeRFs. We do so without sacrificing quality while preserving the ability of NeRFs to perform freeviewpoint rendering of scenes with arbitrary geometry and viewdependent effects. Realtime performance is achieved by pretabulating the NeRF into a PlenOctree. In order to preserve viewdependent effects such as specularities, we factorize the appearance via closedform spherical basis functions. Specifically, we show that it is possible to train NeRFs to predict a spherical harmonic representation of radiance, removing the viewing direction as an input to the neural network. Furthermore, we show that PlenOctrees can be directly optimized to further minimize the reconstruction loss, which leads to equal or better quality compared to competing methods. Moreover, this octree optimization step can be used to reduce the training time, as we no longer need to wait for the NeRF training to converge fully. Our realtime neural rendering approach may potentially enable new applications such as 6DOF industrial and product visualizations, as well as next generation AR/VR systems. PlenOctrees are amenable to inbrowser rendering as well; https://alexyu.net/plenoctrees/
Source https://alexyu.net/plenoctrees/
Realtime online demo: https://alexyu.net/plenoctrees/demo/?load=https://angjookanazawa.com/plenoctree_data/ficus_cams.draw.npz;https://angjookanazawa.com/plenoctree_data/ficus.npz&hide_layers=1
Plenoxels: Radiance Fields without Neural Networks, 2021
Proposes a viewdependent sparse voxel model, Plenoxel (plenoptic volume element), that can optimize to the same fidelity as Neural Radiance Fields (NeRFs) without any neural networks. Our typical optimization time is 11 minutes on a single GPU, a speedup of two orders of magnitude compared to NeRF.
Source https://github.com/sxyu/svox2
Also https://github.com/naruya/VaxNeRF
PointNeRF: Pointbased Neural Radiance Fields, 2022
Volumetric neural rendering methods like NeRF generate highquality view synthesis results but are optimized perscene leading to prohibitive reconstruction time. On the other hand, deep multiview stereo methods can quickly reconstruct scene geometry via direct network inference. PointNeRF combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field.
https://arxiv.org/abs/2201.08845
Mixed scene representations for neural rendering, 2019
Mixture of Volumetric Primitives for Efficient Neural Rendering
Source: https://arxiv.org/pdf/2103.01954.pdf
Instant Neural Graphics Primitives
Implementation of four neural graphics primitives, being neural radiance fields (NeRF), signed distance functions (SDFs), neural images, and neural volumes. In each case, we train and render a MLP with multiresolution hash input encoding using the tinycudann framework.
https://github.com/NVlabs/instantngp/raw/master/docs/assets_readme/fox.gif
https://github.com/NVlabs/instantngp, needs RTX graphics card
Compression
TODO
Novel (virtual) 2D view synthesis from plenoptic samples
Synthesize plenoptic slices that can be interpolated to recover local regions of the full plenoptic function. Given a dense sampling of views, photorealistic novel views can be reconstructed by simple light field sample interpolation techniques. For novel view synthesis with sparser view sampling, the computer vision and graphics communities have made significant progress by predicting traditional geometry and appearance representations from observed images. The study of imagebased rendering is motivated by a simple question: how do we use a finite set of images to reconstruct an infinite set of views.
View synthesis can be approached by either explicit estimation of scene geometry and color, or using coarser estimates of geometry to guide interpolation between captured views. One approach aims to explicitly reconstruct the surface geometry and the appearance on the surface from the observed sparse views, other approaches adopt volumebased representations to directly to model the appearance of the entire space and use volumetric rendering techniques to generate images for 2D displays. The raw samples of a light field are saved as disks. resolution large amounts of data
The Volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.
Source: Advances in Neural Rendering, https://www.neuralrender.com/
Source: https://github.com/ArnePetersen/PlenopticSimulation, A System for Acquiring, Processing, and Rendering Panoramic Light Field sStills for Virtual Reality
Source: Advances in Neural Rendering, https://www.neuralrender.com/
Light field rendering pushes the latter strategy to an extreme by using dense structured sampling of the lightfield to make reconstruction guarantees independent of specific scene geometry. Most image based renering algorithms are designed to model static appearance, DeepMPI (Deep Multiplane Images), which further captures viewing condition dependent appearance.
Camera calibration is often assumed to be prerequisite, while in practise, this information is rarely accessible, and requires to be precomputed with conventional techniques, such as SfM.
3d scene reconstruction and inverse and differential rendering
Inverse rendering and differential rendering: explicitly reconstructing the scene
The key concept behind neural rendering approaches is that they are differentiable. A differentiable function is one whose derivative exists at each point in the domain. This is important because machine learning is basically the chain rule with extra steps: a differentiable rendering function can be learned with data, one gradient descent step at a time. Learning a rendering function statistically through data is fundamentally different from the classic rendering methods we described above, which calculate and extrapolate from the known laws of physics.
They can be classified into explicit and implicit representations. Explicit methods describe scenes as a collection of geometric primitives, such as triangles, pointlike primitives, or higherorder parametric surfaces.
Source: Advances in Neural Rendering, https://www.neuralrender.com/
One popular class of approaches uses meshbased representations of scenes with either use or viewdependent appearance. Differentiable rasterizers or pathtracers can directly optimize mesh representations to reproduce a set of input images using gradient descent. However, gradientbased mesh optimization based on image reprojection is often dicult, likely because of local minima or poor conditioning of the loss landscape. Furthermore, this strategy requires a template mesh with xed topology to be provided as an initialization before optimization [22], which is typically unavailable for unconstrained realworld scenes.
Inverse rendering aims to estimate physical attributes of a scene, e.g., reflectance, geometry, and lighting from image(s). Also called Differentiable Rendering it promises to close the loop between computer vision and graphics.
Novel view synthesis with neural rendering: Volume Rendering with Radiance Fields
In this problem, a neural network learns to render a scene from an arbitrary viewpoint. Both of these works use a volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.
Source: Advances in Neural Rendering, https://www.neuralrender.com/
rendering: Representing Scenes as Neural Radiance Fields, published 2020 Mildenhall
Neural Radiance Fields (NeRF) https://www.youtube.com/watch?v=nCpGStnayHk Two Minute Papers explonation
Represent a static scene as a continuous 5D function that outputs the radiance emitted in each direction (theta, phi) at each point (x; y; z) in space, and a density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through (x; y; z).
Uses regressing from a single 5D coordinate (x; y; z; theta, phi) to a single volume density and viewdependent RGB color.
NeRF uses a differentiable volume rendering formula to train a coordinatebased multilayer perceptron (MLP) to directly predict color and opacity from 3D position and 2D viewing direction. It is a recent and popular volumetric rendering technique to generate images is Neural Radiance Fields (NeRF) due to its exceptional simplicity and performance for synthesising highquality images of complex realworld scenes.
The key idea in NeRF is to represent the entire volume space with a continuous function, parameterised by a multilayer perceptron (MLP), bypassing the need to discretise the space into voxel grids, which usually suffers from resolution constraints. It allows realtime synthesis of photorealistic new views.
NeRF is good with complex geometries and deals with occlusion well.
Source: Advances in Neural Rendering, 2021, ttps://www.neuralrender.com/
Source: Advances in Neural Rendering, https://www.neuralrender.com/
Neural volume rendering refers to methods that generate images or video by tracing a ray into the scene and taking an integral of some sort over the length of the ray. Typically a neural network like a multilayer perceptron (MLP) encodes a function from the 3D coordinates on the ray to quantities like density and color, which are integrated to yield an image. One of the reasons NeRF is able to render with great detail is because it encodes a 3D point and associated view direction on a ray using periodic activation functions, i.e., Fourier Features.
The impact of the NeRF paper lies in its brutal simplicity: just an MLP taking in a 5D coordinate and outputting density and color yields photoreastic results.
The inital model had limitions: Training and rendering is slow and it can only represent static scenes. It “bakes in” lighting. A trained NeRF representation does not generalize to other scenes or objects. All of these problems have since developed further, there are even realtime nerfs now.
A good overview can be found in "NeRF Explosion 2020", https://dellaert.github.io/NeRF.
https://dellaert.github.io/NeRF/ Source: https://towardsdatascience.com/nerfandwhathappenswhengraphicsbecomesdifferentiable88a617561b5d
A deeper integration of graphics knowledge into the network is possible based on differentiable graphics modules. Such a differentiable module can for example implement a complete computer graphics renderer, a 3D rotation, or an illumination model. Such components add a physically inspired inductive bias to the network, while still allowing for endtoend training via backpropagation. This can be used to analytically enforce a truth about the world in the network structure, frees up network capacity, and leads to better generalization, especially if only limited training data is available.
Source: Advances in Neural Rendering, https://www.neuralrender.com/
Unconstrained Images
NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections, MartinBrualla, 2021
Can handle inages with variable illumination (not photometrically static) cars and people may move, construction may begin or end, seasons and weather may change,
Crowdsampling the Plenoptic Function with NeRF, published 2020
Given a large number of tourist photos taken at different times of day, this machine learning based approach learns to construct a continuous set of light fields and to synthesize novel views capturing alltimesofday scene appearance. achieve convincing changes across a variety of times of day and lighting conditions. mask out transient objects such as people and cars during training and evaluation.
Source: https://www.semanticscholar.org/paper/CrowdsamplingthePlenopticFunctionLiXian
Source: Crowdsampling the Plenoptic Function, 2020
unsupervised manner. The approach takes unstructured Internet photos spanning some range of timevarying appearance in a scene and learns how to reconstruct a plenoptic slice, a representation of the light field that respects temporal structure in the plenoptic function when interpolated over timefor each of the viewing conditions captured in our input data. By designing our model to preserve the structure of real plenoptic functions, we force it to learn timevarying phenomena like the motion of shadows according to sun position. This lets us, for example, recover plenoptic slices for images taken at different times of day and interpolate between them to observe how shadows move as the day progresses (best seen in our supplemental video).
Optimize for Photometric Loss: The difference between the predicted color of the pixel (shown in Figure 9) and the actual color of the pixel makes the photometric loss. This eventually allows us to perform backpropagation on the MLP and minimize the loss. In effect, we learn a representation of the scene that can produce highquality views from a continuum of viewpoints and viewing conditions that vary with time.
Source: NeRFW Nerf in the wild
NeRFW disentangles lighting from the underlying 3D scene geometry. The latter remains consistent even as the former changes.
Source: nerf in the wild
 Reference implementation (nerf and nerf in the wild using pytorch) https://github.com/kwea123/nerf_pl
NeROIC: Neural Object Capture and Rendering from Online Image Collection, 2022
https://formyfamily.github.io/NeROIC/ We present a novel method to acquire object representations from online image collections, capturing highquality geometry and material properties of arbitrary objects from photographs with varying cameras, illumination, and backgrounds.
Towards Instant 3D Capture (with a cell phone): Nerfies
Source: Advances in Neural Rendering, https://www.neuralrender.com/
Neural Residual Flow Fields for Efficient Video Representations
Article:https://lnkd.in/gsMbpD7 Code Repo:https://lnkd.in/gRWvhra5 Implicit neural representation (INR) has emerged as a powerful paradigm for representing signals, such as images, videos, 3D shapes, etc. Although it has shown the ability to represent fine details, its efficiency as a data representation has not been extensively studied. In INR, the data is stored in the form of parameters of a neural network and general purpose optimization algorithms do not generally exploit the spatial and temporal redundancy in signals. In this paper, we suggest a novel INR approach to representing and compressing videos by explicitly removing data redundancy. Instead of storing raw RGB colors, we propose Neural Residual Flow Fields (NRFF), using motion information across video frames and residuals that are necessary to reconstruct a video. Maintaining the motion information, which is usually smoother and less complex than the raw signals, requires far fewer parameters. Furthermore, reusing redundant pixel values further improves the network parameter efficiency. Experimental results have shown that the proposed method outperforms the baseline methods by a significant margin.
Multiresolution Nerfs
MipNerf supersampling, 2021
The rendering procedure used by neural radiance fields (NeRF) samples a scene with a single ray per pixel and may therefore produce renderings that are excessively blurred or aliased when training or testing images observe scene content at different resolutions. The straightforward solution of supersampling by rendering with multiple rays per pixel is impractical for NeRF, because rendering each ray requires querying a multilayer perceptron hundreds of times. https://github.com/google/mipnerf
See also:
SemanticAware Implicit Neural AudioDriven Video Portrait Generation,(NeRF), 2022
Animating highfidelity **video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation.
Relighting with 4D Incident Light Fields
It is possible to relight and delight real objects illuminated by a 4D incident light field, representing the illumination of an environment. By exploiting the richness in angular and spatial variation of the light field, objects can be relit with a high degree of realism.
Source:
Relighting with NeRF
Another dimension in which NeRFstyle methods have been augmented is in how to deal with lighting, typically through latent codes that can be used to relight a scene. NeRFW was one of the first followup works on NeRF, and optimizes a latent appearance code to enable learning a neural scene representation from less controlled multiview collections.
Neural Reflectance Fields improve on NeRF by adding a local reflection model in addition to density. It yields impressive relighting results, albeit from single point light sources. NeRV uses a second “visibility” MLP to support arbitrary environment lighting and “onebounce” indirect illumination.
Source: https://en.wikipedia.org/wiki/Light_stage
Source:
Source: Advances in Neural Rendering, https://www.neuralrender.com/Learning to Factorize and Relight a City
A learningbased framework for disentangling
outdoor scenes into temporallyvarying illumination and permanent scene
factors imagery from Google Street View, where the same locations are
captured repeatedly through time.
Source: https://en.wikipedia.org/wiki/Light_stage
and Understanding with Implicit Scene Representation, 2021
Nerf for computer vision task : Scene Labeling Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation
 https://github.com/3a1b2c3/semantic_nerf
 https://nesf3d.github.io/
The intrinsic multiview consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate.
Pose Estimation
iNeRF: Inverting Neural Radiance Fields for Pose Estimation, YenChen et al. IROS 2021  bibtex
ANeRF: Surfacefree Human 3D Pose Refinement via Neural Rendering, Su et al. Arxiv 2021  bibtex
NeRF: Neural Radiance Fields Without Known Camera Parameters, Wang et al., Arxiv 2021  github  bibtex
iMAP: Implicit Mapping and Positioning in RealTime, Sucar et al., Arxiv 2021  bibtex
GNeRF: GANbased Neural Radiance Field without Posed Camera, Meng et al., Arxiv 2021  bibtex
BARF: BundleAdjusting Neural Radiance Fields, Lin et al., ICCV 2021  bibtex
SelfCalibrating Neural Radiance Fields, Park et al., ICCV 2021  github  bibtex
Nerf methods in comparison
Advances in Neural Rendering, https://arxiv.org/abs/2111.05849 Source: Advances in Neural Rendering, https://www.neuralrender.com/
Combining LIDAR and radiance fields
https://urbanradiancefields.github.io/
Deformable/Animation

https://github.com/hustvl/TiNeuVox
Open problems / research topics

Faster Inference and Rendering

Faster Training

View Synthesis for Dynamic Scenes/ Video

Deformable/Animation

Editing / Editable NeRFs

Generalization https://factorizeacity.github.io/weather.html

Compositionality

Object Category Modeling

Model Reconstruction

Labelling and Depth Estimation
Source:
https://github.com/3a1b2c3/seeingSpace/wiki/Relatedfields(Photogrametry,LIDAR,SLAMetc) https://github.com/3a1b2c3/seeingSpace/wiki/Importantconcepts https://github.com/3a1b2c3/seeingSpace/wiki/Recommendedresourcesandreading
Conclusion
Over the past year (2020), we’ve learned how to make the rendering process differentiable, and turn it into a deep learning module. This sparks the imagination, because the deep learning motto is: “If it’s differentiable, we can learn through it”. If we know how to differentially go from 3D to 2D, it means we can use deep learning and backpropagation to go back from 2D to 3D as well.
With neural rendering we no longer need to physically model the scene and simulate the light transport, as this knowledge is now stored implicitly inside the weights of a neural network. The compute required to render an image is also no longer tied to the complexity of the scene (the number of objects, lights, and materials), but rather the size of the neural network.
Neural rendering has already enabled applications that were previously intractable, such as rendering of digital avatars without any manual modeling. Neural rendering could have a profound impact in making complex photo and video editing tasks accessible to a much broader audience.
This is no longer a neural network that is predicting physics. This is physics (or optics) plugged on top of a neural network inside a PyTorch engine. We have now a differentiable simulation of the real world (harnessing the power of computer graphics) on top of a neural representation of it.
https://towardsdatascience.com/threegrandchallengesinmachinelearning771e1440eafc Vincent Vanhoucke, Distinguished Scientist at Google This is why I call the grand challenge for perception the Inverse Video Game problem: predict not only a static scene but its functional semantics and possible futures. You should be able to take a video, run it through your computer vision model, and get a representation of a scene you can not only parse, but can roll forward in time to generate plausible future behaviors, from any viewpoint, like a video game engine would. It would respect the physics (a ball shot at a goal would follow its normal trajectory), semantics (a table would be movable, a door could be opened), and agents in the scene (people, cars would have reasonable NPC behaviors).
A lot of computer vision and graphics algorithms can be defined in a closed form solution, which therefore allows for optimizations.
Neural computing is very computationally expensive in part because (by design) it can’t be reduced to a closed form solution in the same way. Many of such closedform contain infinite integral that are impossible to solve. We then use approximations (like Monte Carlo Markov Chain) but they take a long time to converge. It is really worth it to use a neural network to at list prototype and find the right settings before computing the final version.
Source: https://medium.com/@hurmh92/autonomousdrivingslamand3dmappingrobote3cca3c52e95
https://arxiv.org/pdf/2203.01414.pdf ICARUS: A Specialized Architecture for Neural Radiance FieldRendering