# Sehender Raum : Seeing Space (START READING HERE) - 3a1b2c3/seeingSpace Wiki

## Notes about capturing, rendering and digitally reconstruction the world

When I learned about traditional computer graphics and photogrammetry I missed the big picture about how all the pieces connect: with hardware, physics and machine learning aspects. It made it harder to understand recent research and its meaning for the field. Rendering 3D models from 2D images remains a challenging problem but incredible progress has been made since I first became interested in the topic 20 years ago (see below)

Catching up with newer research in image based rendering: A TLDR on how traditional computer graphics fits with computer vision, machine learning and capture hardware.

I am very interested in "neural rendering" and love to hear from anybody in the field https://www.linkedin.com/in/katrinschmid/

# "Classic rendering" in computer graphics

Classical computer graphics methods approximate the physical process of image formation in the real world: light sources emit photons that interact with the objects in the scene, as a function of their geometry and material properties, before being recorded by a camera. This process is known as light transport. The process of transforming a scene definition including cameras, lights, surface geometry and material into a simulated camera image is known as rendering.

The two most common approaches to rendering are rasterization and raytracing.

• Rasterization is a feedforward process in which geometry is transformed into the image domain, sometimes in back-to-front order known as painter’s algorithm.
• Raytracing is a process in which rays are cast backwards from the image pixels into a virtual scene, and reflections and refractions are simulated by recursively casting new rays from the intersections with the geometry.

## The rendering equation (published in 1986)

The rendering equation describes physical light transport for a single camera or the human vision. A point in the scene is imaged by measuring the emitted and reflected light that converges on the sensor plane. Radiance (L) represents the ray strength, measuring the combined angular and spatial power densities. Radiance can be used to indicate how much of the power emitted by the light source that is reflected, transmitted or absorbed by a surface will be captured by a camera facing that surface from a specified angle of view.

Source: https://www.semanticscholar.org/paper/Inverse-Rendering-and-Relighting-From-Multiple-Plus-Liu-Do/da1ba94e01596d0241d7d426b071ae9731d148b3, https://www.mdpi.com/2072-4292/13/13/2640, Rendering for Data Driven Computational Imaging, Tristan Swedish

# Inverse and Differential rendering (aka "Computervision")

Inverse graphics attempts to take sensor data and infer 3D geometry, illumination, materials, and motions such that a graphics renderer could realistically reproduce the observed scene. Renderers, however, are designed to solve the forward process of image synthesis. To go in the other direction, we propose an approximate differentiable renderer (DR) that explicitly models the relationship between changes in model parameters and image observations.

Source: http://rgl.epfl.ch/publications/NimierDavidVicini2019Mitsuba2

Inverse rendering and differentiable rendering have been a topic of research for some time. However, major breakthroughs have only been made in recent years due to improved hardware and advancements in deep learning.

# Neural Rendering, ca 1990s

Is a relative new technique that combines classical or other 3D representation and renderer with deep neural networks that rerender the classical render into a more complete and realistic views. In contrast to Neural Image-based Rendering (N-IBR), neural rerendering does not use input views at runtime, and instead relies on the deep neural network to recover the missing details. Deepfakes are an early neural rendering technique in which a person in an existing image or video is replaced with someone else's likeness. The original approach is believed to be based on Korshunova et al (2016), which used a convolutional neural network (CNN).

A typical neural rendering approach takes as input images corresponding to certain scene conditions (for example, viewpoint, lighting, layout, etc.), builds a "neural” scene representation from them, and "renders” this representation under novel scene properties to synthesize novel images.

The learned scene representation is not restricted by simple scene modeling approximations and can be optimized for high quality novel images. At the same time, neural rendering approaches incorporate ideas from classical graphics—in the form of input features, scene representations, and network architectures—to make the learning task easier, and the output more controllable. Neural rendering has many important use cases such as semantic photo manipulation, novel view synthesis, relighting, free viewpoint video, as well as facial and body reenactment.

Source: Advances in Neural Rendering, https://www.neuralrender.com/ Source: Advances in Neural Rendering, https://www.neuralrender.com/

Artifacts such as ghosting, blur, holes, or seams can arise due to view-dependent effects, imperfect proxy geometry or too few source images. To address these issues, N-IBR methods replace the heuristics often found in classical IBR methods with learned blending functions or corrections that take into account view-dependent effects.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

# Image-based rendering (IBR): Plenoptic function and capture

Computational imaging (CI) is a class of imaging systems that, starting from an imperfect physical measurement and prior knowledge about the class of objects or scenes being imaged, deliver estimates of a specific object or scene presented to the imaging system.

In contrast to classical rendering, which projects 3D content to the 2D plane, image-based rendering techniques generate novel images by transforming an existing set of images, typically by warping and compositing them together. The essence of image-based rendering technology is to obtain all the visual information of the scene directly through images. Its used in computer graphics and computer vision, and it is also widely used in virtual reality technology.

## The Plenoptic function (Adelson and Bergen, 1991)

The world as we see it using our eyes is a continuous three-dimensional function of the spatial coordinates. To generate photo-realistic views of a real-world scene from any viewpoint, it not only requires to understand the 3D scene geometry, but also to model complex viewpoint-dependent appearance resulting of light transport phenomena. A photograph is a two-dimensional map of the “number of photons” that map from the three-dimensional scene.

While the rendering equation is a useful model for computer graphics some problems are easier to solve by a more generalized light model.

### The plenoptic function is also inspired by multi-faceted insect eyes or lens arrays.

Source: https://en.wikipedia.org/wiki/Compound_eye, Rendering for Data Driven Computational Imaging, Tristan Swedish

The plenoptic function describes the degrees of freedom of a light ray with the parameters: Irradiance, position, wavelength, time, angle, phase, polarization, and bounce.

Source: Rendering for Data Driven Computational Imaging, Tristan Swedish, https://www.blitznotes.org/ib/physics/waves.html, https://courses.lumenlearning.com/boundless-chemistry/chapter/the-nature-of-light/

Light has the properties of waves. Like ocean waves, light waves have crests and troughs.

• The distance between one crest and the next, which is the same as the distance between one trough and the next, is called the wavelength.
• Wave phase is the offset of a wave from a given point. When two waves cross paths, they either cancel each other out or compliment each other, depending on their phase.
• Irradiance is the amount of light energy from one thing hitting a square meter of another each second. Photons that carry this energy have wavelengths from energetic X-rays and gamma rays to visible light to the infrared and radio. The unit of irradiance is the watt per square meter.
• Polarization and Bounce are often omitted for simplicity
• The full equation is also time dependent.

## Static 5D and 4D Lightfields: capture and rendering (Andrey Gershun, 1936)

If Vx, Vy, Vz are fixed, the plenoptic function describes a panorama at fixed viewpoint (Vx, Vy, Vz). A regular image with a limited field of view can be regarded as an incomplete plenoptic sample at a fixed viewpoint. As long as we stay outside the convex hull of an object or a scene, if we fix the location of the camera on a plane, we can use two parallel planes (u,v) and (s,t) to simplify the complete 5D plenoptic function to a 4D lightfield plenoptic function.

A Light field is a mathematical function of one or more variables whose range is a set of multidimensional vectors that describe the amount of light flowing in every direction through every point in space*. It restricts the information to light outside the convex hull of the objects of interest. The 7D plenoptic function can under certain assumptions and relaxations simplify o a 4D light field, which is easier to sample and operate on. A hologram is a photographic recording of a light field, rather than an image formed by a lens. A light field is a function that describes how light transport occurs throughout a 3D volume. It describes the direction of light rays moving through every x=(x, y, z) coordinate in space and in every direction d, described either as θ and ϕ angles or a unit vector. Collectively they form a 5D feature space that describes light transport in a 3D scene.

The magnitude of each light ray is given by the radiance and the space of all possible light rays is given by the five-dimensional plenoptic function. The 4D lightfield has 2D spatial (x,y) and 2D angular (u,v) information that is captured by a plenoptic sensor.

• the incident light field Li(u, v, alpha, beta) describing the irradiance of light incident on objects in space
• the radiant light field Lr (u, v, alpha, beta) quantifying the irradiance created by an object
• time is an optional 5th dimension

### Capturing, storing and compressing static and dynamic light fields

Light field rendering [Levoy and Hanrahan 1996] eschews any geometric reasoning and simply samples images on a regular grid so that new views can be rendered as slices of the sampled light field. Lumigraph rendering [Gortler et al. 1996] showed that using approximate scene geometry can ameliorate artifacts due to undersampled or irregularly sampled views. The plenoptic sampling framework [Chai et al. 2000] analyzes light field rendering using signal processing techniques and shows that the Nyquist view sampling rate for light fields depends on the minimum and maximum scene depths. Furthermore, they discuss how the Nyquist view sampling rate can be lowered with more knowledge of scene geometry. Zhang and Chen [2003] extend this analysis to show how non-Lambertian and occlusion effects increase the spectral support of a light field, and also propose more general view sampling lattice patterns. Rendering algorithms based

One type uses an array of micro-lenses placed in front of an otherwise conventional image sensor to sense intensity, color, and directional information. Multi-camera arrays are another type. Compared to a traditional photo camera that only captures the intensity of the incident light, a light-field camera provides angular information for each pixel.

In principle, this additional information allows 2D images to be reconstructed at a given focal plane, and hence a depth map can be computed. While special cameras and cameras arrangements have been build to capture light fields it is also possible them with a conventional camera or smart phone under certain constraints (see Crowdsampling the Plenoptic Function).

Source: Stanford light field camera; Right: Adobe (large) lens array, source https://cs.brown.edu/courses/csci1290/labs/lab_lightfields, "Lytro Illum", a discontinued commercially available light field camera

#### Neural Scene representations

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Source: Advances in Neural Rendering, https://www.neuralrender.com/

##### Encodings comparison

Whereas discrete signal representations like pixel images or voxels approximate continuous signals with regularly spaced samples of the signal, these neural fields approximate the continuous signal directly with a continuous, parametric function, i.e., a MLP which takes in coordinates as input and outputs a vector (such as color or occupancy).

Neural approximations of scalar- and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids

From Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

A demonstration of the reconstruction quality of different encodings and parametric data structures for storing trainable feature embeddings. Each configuration was trained for 11 000 steps using our fast NeRF implementation (Section 5.4), varying only the input encoding and MLP size. The number of trainable parameters (MLP weights + encoding parameters), training time and reconstruction accuracy (PSNR) are shown below each image. Our encoding (e) with a similar total number of trainable parameters as the frequency encoding configuration (b) trains over 8× faster, due to the sparsity of updates to the parameters and smaller MLP. Increasing the number of parameters (f) further improves reconstruction accuracy without significantly increasing training time.

Comparison of Encodings (from instant nerf paper) A practical inntroduction https://keras.io/examples/vision/nerf/#setup

##### Input format: Local Light Field Fusion (LLFF), 2019

Used in original Nerf paper for still images, can get light-fields to the nyquist frequency limit..

LLFF uses Colmap to calculate the position of each of the camera* (poses files), then uses a trained AI to calculate the distance map and from there it generates the MPI, which is the output we'll use to create the MPI videos (and the metadata). So this pose recentering needs to be applied on real data where the camera poses are arbitrary, is that correct? Leave aside rendering, does it have impact on training: train on llff (real data with arbitrary camera poses) without rencenter_poses and with NDC? Intuitively it depends on how much the default world coordinate differs from the poses_avg, in practice when using COLMAP, do they differ much?

NDC makes very specific assumptions, that the camera is facing along -z and is entirely behind the z=-near plane. So if the rotation is wrong it will fail (in its current implementation). This is analogous to how a regular graphics pipeline like OpenGL Pose_bounds.npy contains 3x5 pose matrices and 2 depth bounds for each image. Each pose has [R T] as the left 3x4 matrix and [H W F] as the right 3x1 matrix.

##### Multi Sphere Image, Multi-plane image (MPI), local layered representation format and DeepMPI representation (2.5 D), 2020

MSI: a Multi Sphere Image. LM: a Layered Mesh with individual layer textures. LM-TA: a Layered Mesh with a texture atlas.

Deep image or video generation approaches that enable explicit or implicit control of scene properties such as illumination, camera parameters, pose, geometry, appearance, and semantic structure. MPIs (rgba) have the ability to produce high-quality novel views of complex scenes in real time and the view consistency that arises from a 3D scene representation (in contrast to neural rendering approaches that decode a separate view for each desired viewpoint).

Our method takes in a set of images of a static scene, promotes each image to a local layered representation (MPI), and blends local light fields rendered from these MPIs to render novel views. As a rule of thumb, you should use images where the **maximum disparity between views is no more than about 64 pixels (watch the closest thing to the camera and don't let it move more than ~1/8 the horizontal field of view between images). Our datasets usually consist of 20-30 images captured handheld in a rough grid pattern. https://github.com/Fyusion/LLFF

Its depth-wise resolution is limited by the number of discrete planes, and thus the MPIs cannot be converted to other 3D representations such as mesh, point cloud, etc. I

DeepMPI (2020) extends prior work on multiplane images (MPIs) to model viewing conditions that vary with time Our work makes three key contributions: first, a representation, called a DeepMPI, for neural rendering that extends prior work on multiplane images (MPIs) [68] to model viewing conditions that vary with time; second, a method for training DeepMPIs on sparse, unstructured crowdsampled data that is unreg- 1 [1] describes the plenoptic function as 7D, but we can reduce this to a 4D color lightfield supplemented by time by applying the later observations of [33]. Crowdsampling the Plenoptic Function 3 istered in time

##### Parametric Encoding: Acorn: Adaptive coordinate networks for neural scene representation, 2021

parametric approach uses a tree subdivision of the domain Rd , wherein a large auxiliary coordinate encoder neural network (ACORN) [Martel et al. 2021] is trained to output dense feature grids in the leaf node around x. These dense feature grids, which have on the order of 10 000 entries, are then linearly interpolated, as in Liu et al. [2020]. This approach tends to yield a larger degree of adaptivity compared with the previous parametric encodings, albeit at greater computational cost which can only be amortized when sufficiently many inputs x fall into each leaf node. Sparse parametric encodings.

Source: https://www.computationalimaging.org/publications/acorn

##### Conversion to mesh or voxel

Neural radiance field (NeRF) techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint

The neural network can also be converted to mesh in certain circumstances https://github.com/bmild/nerf/blob/master/extract_mesh.ipynb), we need to first infer which locations are occupied by the object. This is done by first create a grid volume in the form of a cuboid covering the whole object, then use the nerf model to predict whether a cell is occupied or not. This is the main reason why mesh construction is only available for 360 inward-facing scenes as forward facing scenes

##### Point-Based Rendering

ADOP: Approximate Differentiable One-Pixel Point Rendering, https://t.co/npOqsAstAx https://t.co/LE4ZdckQPO

##### PlenOctrees For Real-time Rendering of Neural Radiance Fields, 2021, NeRF-SH

Neural Radiance Fields (NeRFs) in real time using PlenOctrees, an octree-based 3D representation which supports view-dependent effects. Our method can render 800x800 images at more than 150 FPS, which is over 3000 times faster than conventional NeRFs. We do so without sacrificing quality while preserving the ability of NeRFs to perform free-viewpoint rendering of scenes with arbitrary geometry and view-dependent effects. Real-time performance is achieved by pre-tabulating the NeRF into a PlenOctree. In order to preserve view-dependent effects such as specularities, we factorize the appearance via closed-form spherical basis functions. Specifically, we show that it is possible to train NeRFs to predict a spherical harmonic representation of radiance, removing the viewing direction as an input to the neural network. Furthermore, we show that PlenOctrees can be directly optimized to further minimize the reconstruction loss, which leads to equal or better quality compared to competing methods. Moreover, this octree optimization step can be used to reduce the training time, as we no longer need to wait for the NeRF training to converge fully. Our real-time neural rendering approach may potentially enable new applications such as 6-DOF industrial and product visualizations, as well as next generation AR/VR systems. PlenOctrees are amenable to in-browser rendering as well; https://alexyu.net/plenoctrees/

##### Plenoxels: Radiance Fields without Neural Networks, 2021

Proposes a view-dependent sparse voxel model, Plenoxel (plenoptic volume element), that can optimize to the same fidelity as Neural Radiance Fields (NeRFs) without any neural networks. Our typical optimization time is 11 minutes on a single GPU, a speedup of two orders of magnitude compared to NeRF.

##### Point-NeRF: Point-based Neural Radiance Fields, 2022

Volumetric neural rendering methods like NeRF generate high-quality view synthesis results but are optimized per-scene leading to prohibitive reconstruction time. On the other hand, deep multi-view stereo methods can quickly reconstruct scene geometry via direct network inference. Point-NeRF combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field.

https://arxiv.org/abs/2201.08845

##### Mixed scene representations for neural rendering, 2019

Mixture of Volumetric Primitives for Efficient Neural Rendering

###### Instant Neural Graphics Primitives

Implementation of four neural graphics primitives, being neural radiance fields (NeRF), signed distance functions (SDFs), neural images, and neural volumes. In each case, we train and render a MLP with multiresolution hash input encoding using the tiny-cuda-nn framework.

https://github.com/NVlabs/instant-ngp, needs RTX graphics card

TODO

### Novel (virtual) 2D view synthesis from plenoptic samples

Synthesize plenoptic slices that can be interpolated to recover local regions of the full plenoptic function. Given a dense sampling of views, photorealistic novel views can be reconstructed by simple light field sample interpolation techniques. For novel view synthesis with sparser view sampling, the computer vision and graphics communities have made significant progress by predicting traditional geometry and appearance representations from observed images. The study of image-based rendering is motivated by a simple question: how do we use a finite set of images to reconstruct an infinite set of views.

View synthesis can be approached by either explicit estimation of scene geometry and color, or using coarser estimates of geometry to guide interpolation between captured views. One approach aims to explicitly reconstruct the surface geometry and the appearance on the surface from the observed sparse views, other approaches adopt volume-based representations to directly to model the appearance of the entire space and use volumetric rendering techniques to generate images for 2D displays. The raw samples of a light field are saved as disks. resolution large amounts of data

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Source: https://github.com/Arne-Petersen/Plenoptic-Simulation, A System for Acquiring, Processing, and Rendering Panoramic Light Field sStills for Virtual Reality

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Light field rendering pushes the latter strategy to an extreme by using dense structured sampling of the lightfield to make re-construction guarantees independent of specific scene geometry. Most image based renering algorithms are designed to model static appearance, DeepMPI (Deep Multiplane Images), which further captures viewing condition dependent appearance.

Camera calibration is often assumed to be prerequisite, while in practise, this information is rarely accessible, and requires to be pre-computed with conventional techniques, such as SfM.

#### 3d scene reconstruction and inverse and differential rendering

##### Inverse rendering and differential rendering: explicitly reconstructing the scene

The key concept behind neural rendering approaches is that they are differentiable. A differentiable function is one whose derivative exists at each point in the domain. This is important because machine learning is basically the chain rule with extra steps: a differentiable rendering function can be learned with data, one gradient descent step at a time. Learning a rendering function statistically through data is fundamentally different from the classic rendering methods we described above, which calculate and extrapolate from the known laws of physics.

They can be classified into explicit and implicit representations. Explicit methods describe scenes as a collection of geometric primitives, such as triangles, point-like primitives, or higher-order parametric surfaces.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

One popular class of approaches uses mesh-based representations of scenes with either use or view-dependent appearance. Differentiable rasterizers or pathtracers can directly optimize mesh representations to reproduce a set of input images using gradient descent. However, gradient-based mesh optimization based on image reprojection is often dicult, likely because of local minima or poor conditioning of the loss landscape. Furthermore, this strategy requires a template mesh with xed topology to be provided as an initialization before optimization [22], which is typically unavailable for unconstrained real-world scenes.

Inverse rendering aims to estimate physical attributes of a scene, e.g., reflectance, geometry, and lighting from image(s). Also called Differentiable Rendering it promises to close the loop between computer vision and graphics.

#### Novel view synthesis with neural rendering: Volume Rendering with Radiance Fields

In this problem, a neural network learns to render a scene from an arbitrary viewpoint. Both of these works use a volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

##### Neural Radiance Fields (NeRF) rendering: Representing Scenes as Neural Radiance Fields, published 2020 Mildenhall

Represent a static scene as a continuous 5D function that outputs the radiance emitted in each direction (theta, phi) at each point (x; y; z) in space, and a density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through (x; y; z).

Uses regressing from a single 5D coordinate (x; y; z; theta, phi) to a single volume density and view-dependent RGB color.

NeRF uses a differentiable volume rendering formula to train a coordinate-based multilayer perceptron (MLP) to directly predict color and opacity from 3D position and 2D viewing direction. It is a recent and popular volumetric rendering technique to generate images is Neural Radiance Fields (NeRF) due to its exceptional simplicity and performance for synthesising high-quality images of complex real-world scenes.

The key idea in NeRF is to represent the entire volume space with a continuous function, parameterised by a multi-layer perceptron (MLP), bypassing the need to discretise the space into voxel grids, which usually suffers from resolution constraints. It allows real-time synthesis of photorealistic new views.

NeRF is good with complex geometries and deals with occlusion well.

Source: Advances in Neural Rendering, 2021, ttps://www.neuralrender.com/

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Neural volume rendering refers to methods that generate images or video by tracing a ray into the scene and taking an integral of some sort over the length of the ray. Typically a neural network like a multi-layer perceptron (MLP) encodes a function from the 3D coordinates on the ray to quantities like density and color, which are integrated to yield an image. One of the reasons NeRF is able to render with great detail is because it encodes a 3D point and associated view direction on a ray using periodic activation functions, i.e., Fourier Features.

The impact of the NeRF paper lies in its brutal simplicity: just an MLP taking in a 5D coordinate and outputting density and color yields photoreastic results. The inital model had limitions: Training and rendering is slow and it can only represent static scenes. It “bakes in” lighting. A trained NeRF representation does not generalize to other scenes or objects. All of these problems have since developed further, there are even realtime nerfs now.
A good overview can be found in "NeRF Explosion 2020", https://dellaert.github.io/NeRF.

https://user-images.githubusercontent.com/74843139/135747420-4d91bc80-2893-44a4-8d32-16bf7024b4f2.mp4

A deeper integration of graphics knowledge into the network is possible based on differentiable graphics modules. Such a differentiable module can for example implement a complete computer graphics renderer, a 3D rotation, or an illumination model. Such components add a physically inspired inductive bias to the network, while still allowing for end-to-end training via backpropagation. This can be used to analytically enforce a truth about the world in the network structure, frees up network capacity, and leads to better generalization, especially if only limited training data is available.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

#### Unconstrained Images

###### NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections, Martin-Brualla, 2021

Can handle inages with variable illumination (not photometrically static) cars and people may move, construction may begin or end, seasons and weather may change,

###### Crowdsampling the Plenoptic Function with NeRF, published 2020

Given a large number of tourist photos taken at different times of day, this machine learning based approach learns to construct a continuous set of light fields and to synthesize novel views capturing all-times-of-day scene appearance. achieve convincing changes across a variety of times of day and lighting conditions. mask out transient objects such as people and cars during training and evaluation.

Source: https://www.semanticscholar.org/paper/Crowdsampling-the-Plenoptic-Function-Li-Xian

Source: Crowdsampling the Plenoptic Function, 2020

unsupervised manner. The approach takes unstructured Internet photos spanning some range of time-varying appearance in a scene and learns how to reconstruct a plenoptic slice, a representation of the light field that respects temporal structure in the plenoptic function when interpolated over time|for each of the viewing conditions captured in our input data. By designing our model to preserve the structure of real plenoptic functions, we force it to learn time-varying phenomena like the motion of shadows according to sun position. This lets us, for example, recover plenoptic slices for images taken at different times of day and interpolate between them to observe how shadows move as the day progresses (best seen in our supplemental video).

Optimize for Photometric Loss: The difference between the predicted color of the pixel (shown in Figure 9) and the actual color of the pixel makes the photometric loss. This eventually allows us to perform backpropagation on the MLP and minimize the loss. In effect, we learn a representation of the scene that can produce high-quality views from a continuum of viewpoints and viewing conditions that vary with time.

Source: NeRF-W Nerf in the wild NeRF-W disentangles lighting from the underlying 3D scene geometry. The latter remains consistent even as the former changes.

Source: nerf in the wild

###### NeROIC: Neural Object Capture and Rendering from Online Image Collection, 2022

https://formyfamily.github.io/NeROIC/ We present a novel method to acquire object representations from online image collections, capturing high-quality geometry and material properties of arbitrary objects from photographs with varying cameras, illumination, and backgrounds.

#### Towards Instant 3D Capture (with a cell phone): Nerfies

Source: Advances in Neural Rendering, https://www.neuralrender.com/

#### Neural Residual Flow Fields for Efficient Video Representations

Article:https://lnkd.in/gsMbpD-7 Code Repo:https://lnkd.in/gRWvhra5 Implicit neural representation (INR) has emerged as a powerful paradigm for representing signals, such as images, videos, 3D shapes, etc. Although it has shown the ability to represent fine details, its efficiency as a data representation has not been extensively studied. In INR, the data is stored in the form of parameters of a neural network and general purpose optimization algorithms do not generally exploit the spatial and temporal redundancy in signals. In this paper, we suggest a novel INR approach to representing and compressing videos by explicitly removing data redundancy. Instead of storing raw RGB colors, we propose Neural Residual Flow Fields (NRFF), using motion information across video frames and residuals that are necessary to reconstruct a video. Maintaining the motion information, which is usually smoother and less complex than the raw signals, requires far fewer parameters. Furthermore, reusing redundant pixel values further improves the network parameter efficiency. Experimental results have shown that the proposed method outperforms the baseline methods by a significant margin.

### Multi-resolution Nerfs

#### MipNerf supersampling, 2021

The rendering procedure used by neural radiance fields (NeRF) samples a scene with a single ray per pixel and may therefore produce renderings that are excessively blurred or aliased when training or testing images observe scene content at different resolutions. The straightforward solution of supersampling by rendering with multiple rays per pixel is impractical for NeRF, because rendering each ray requires querying a multilayer perceptron hundreds of times. https://github.com/google/mipnerf

#### Building NeRF at City Scale, 2021

Instead of having different pictures a few centimeters apart this approach can handle have pictures from thousands of kilometers apart, ranging from satellites to pictures taken on the road. As you can see, NeRF alone fails to use such drastically different pictures to reconstruct the scenes. CityNeRF is capable of packing city-scale 3D scenes into a unified model, which preserves high-quality details across scales varying from satellite-level to ground-level.

Source:

First trains the neural network successively from distant viewpoints to close-up viewpoints -- and to train the neural network on transitions in between these "levels". This was inspired by "level of detail" systems currently in use by traditional 3D computer rendering systems. "Joint training on all scales results in blurry texture in close views and incomplete geometry in remote views. Separate training on each scale yields inconsistent geometries and textures between successive scales." So the system starts at the most distant level and incorporates more and more information from the next closer level as it progresses from level to level.

Modifies the neural network itself at each level by adding what they call a "block". A block has two separate information flows, one for the more distant and one for the more close up level being trained at that moment. It's designed in such a way that a set of information called "base" information is determined for the more distant level, and then "residual" information (in the form of colors and densities) that modifies the "base" and adds detail is calculated from there.

As current CityNeRF is built upon static scenes, it cannot handle inconsistency in the training data. We observed that, in Google Earth Studio [1], objects with slender geometry, such as a lightning rod, flicker as the camera pulls away. Artifacts like flickering moir patterns in the windows of skyscrapers, and differences in detail manifested as distinct square regions on the globe are also observed in the rendered images served as the ground truths2. Such defects lead to unstable rendering results around certain regions and bring about inconsistencies. A potential remedy is to treat it as a dynamic scene and associate each view with an appearance code that is jointly optimized as suggested in [6, 10]. Another potential limitation is on computation. The progressive strategy naturally takes longer training time, hence requires more computational resources.

#### Block-NeRF, 2022

is a method that enables large-scale scene reconstruction by representing the environment using multiple compact NeRFs that each fit into memory. At inference time, Block-NeRF seamlessly combines renderings of the relevant NeRFs for the given area. In this example, we reconstruct the Alamo Square neighborhood in San Francisco using data collected over 3 months. Block-NeRF can update individual blocks of the environment without retraining on the entire scene, as demonstrated by the construction on the right.

Video results can be found on the project website waymo.com/research/block-nerf.

#### Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs

We explore how to leverage neural radiance fields (NeRFs) to build interactive 3D environments from large-scale visual captures spanning buildings or even multiple city blocks collected primarily from drone data. In contrast to the single object scenes against which NeRFs have been traditionally evaluated, this setting poses multiple challenges including (1) the need to incorporate thousands of images with varying lighting conditions, all of which capture only a small subset of the scene, (2) prohibitively high model capacity and ray sampling requirements beyond what can be naively trained on a single GPU, and (3) an arbitrarily large number of possible viewpoints that make it unfeasible to precompute all relevant information beforehand (as real-time NeRF renderers typically do). https://arxiv.org/abs/2112.10703

#### Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation,(NeRF), 2022

Animating high-fidelity **video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation.

### Relighting with 4D Incident Light Fields

It is possible to re-light and de-light real objects illuminated by a 4D incident light field, representing the illumination of an environment. By exploiting the richness in angular and spatial variation of the light field, objects can be relit with a high degree of realism.

Source:

#### Relighting with NeRF

Another dimension in which NeRF-style methods have been augmented is in how to deal with lighting, typically through latent codes that can be used to re-light a scene. NeRF-W was one of the first follow-up works on NeRF, and optimizes a latent appearance code to enable learning a neural scene representation from less controlled multi-view collections.

Neural Reflectance Fields improve on NeRF by adding a local reflection model in addition to density. It yields impressive relighting results, albeit from single point light sources. NeRV uses a second “visibility” MLP to support arbitrary environment lighting and “one-bounce” indirect illumination.

Source: https://en.wikipedia.org/wiki/Light_stage

Source:

Source: Advances in Neural Rendering, https://www.neuralrender.com/
##### Learning to Factorize and Relight a City

A learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors imagery from Google Street View, where the same locations are captured repeatedly through time.

Source: https://en.wikipedia.org/wiki/Light_stage

##### NeRD: Neural Reflectance Decomposition from Image Collections

NeRD is a method that can decompose image collections from multiple views taken under varying or fixed illumination conditions. The object can be rotated, or the camera can turn around the object. The result is a neural volume with an explicit representation of the appearance and illumination in the form of the BRDF and Spherical Gaussian (SG) environment illumination.

### Nerf for computer vision task : Scene Labelling and Understanding with Implicit Scene Representation, 2021

The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate.

#### Pose Estimation

``````iNeRF: Inverting Neural Radiance Fields for Pose Estimation, Yen-Chen et al. IROS 2021 | bibtex
A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering, Su et al. Arxiv 2021 | bibtex
NeRF--: Neural Radiance Fields Without Known Camera Parameters, Wang et al., Arxiv 2021 | github | bibtex
iMAP: Implicit Mapping and Positioning in Real-Time, Sucar et al., Arxiv 2021 | bibtex
GNeRF: GAN-based Neural Radiance Field without Posed Camera, Meng et al., Arxiv 2021 | bibtex
BARF: Bundle-Adjusting Neural Radiance Fields, Lin et al., ICCV 2021 | bibtex
Self-Calibrating Neural Radiance Fields, Park et al., ICCV 2021 | github | bibtex
``````

Source:

# Conclusion

Over the past year (2020), we’ve learned how to make the rendering process differentiable, and turn it into a deep learning module. This sparks the imagination, because the deep learning motto is: “If it’s differentiable, we can learn through it”. If we know how to differentially go from 3D to 2D, it means we can use deep learning and backpropagation to go back from 2D to 3D as well.

With neural rendering we no longer need to physically model the scene and simulate the light transport, as this knowledge is now stored implicitly inside the weights of a neural network. The compute required to render an image is also no longer tied to the complexity of the scene (the number of objects, lights, and materials), but rather the size of the neural network.

Neural rendering has already enabled applications that were previously intractable, such as rendering of digital avatars without any manual modeling. Neural rendering could have a profound impact in making complex photo and video editing tasks accessible to a much broader audience.

This is no longer a neural network that is predicting physics. This is physics (or optics) plugged on top of a neural network inside a PyTorch engine. We have now a differentiable simulation of the real world (harnessing the power of computer graphics) on top of a neural representation of it.

https://towardsdatascience.com/three-grand-challenges-in-machine-learning-771e1440eafc Vincent Vanhoucke, Distinguished Scientist at Google This is why I call the grand challenge for perception the Inverse Video Game problem: predict not only a static scene but its functional semantics and possible futures. You should be able to take a video, run it through your computer vision model, and get a representation of a scene you can not only parse, but can roll forward in time to generate plausible future behaviors, from any viewpoint, like a video game engine would. It would respect the physics (a ball shot at a goal would follow its normal trajectory), semantics (a table would be movable, a door could be opened), and agents in the scene (people, cars would have reasonable NPC behaviors).

A lot of computer vision and graphics algorithms can be defined in a closed form solution, which therefore allows for optimizations.

Neural computing is very computationally expensive in part because (by design) it can’t be reduced to a closed form solution in the same way. Many of such closed-form contain infinite integral that are impossible to solve. We then use approximations (like Monte Carlo Markov Chain) but they take a long time to converge. It is really worth it to use a neural network to at list prototype and find the right settings before computing the final version.

Source: https://medium.com/@hurmh92/autonomous-driving-slam-and-3d-mapping-robot-e3cca3c52e95