1) Sehender Raum : Seeing Space - 3a1b2c3/seeingSpace Wiki

Notes about capturing, rendering and digitally reconstruction the world

When I learned about traditional computer graphics and photogrammetry I missed the big picture about how all the pieces connect: with hardware, physics and machine learning aspects. It made it harder to understand recent research and its meaning for the field. Rendering 3D models from 2D images remains a challenging problem but incredible progress has been made since I first became interested in the topic 20 years ago (see below)

With neural rendering computer graphics and vision might be heading for its "Charles Darwin" moment where we can see and remove some limiting assumptions form the field. Some disjoint pieces may just fall into place: Computer graphics and vision now have a shared framework. Exciting times.

I love to hear from anybody in the field https://www.linkedin.com/in/katrinschmid/

Table of contents generated with markdown-toc

"Classic rendering" in computer graphics

Classical computer graphics methods approximate the physical process of image formation in the real world: light sources emit photons that interact with the objects in the scene, as a function of their geometry and material properties, before being recorded by a camera. This process is known as light transport. The process of transforming a scene definition including cameras, lights, surface geometry and material into a simulated camera image is known as rendering.

The two most common approaches to rendering are rasterization and raytracing.

  • Rasterization is a feedforward process in which geometry is transformed into the image domain, sometimes in back-to-front order known as painter’s algorithm.
  • Raytracing is a process in which rays are cast backwards from the image pixels into a virtual scene, and reflections and refractions are simulated by recursively casting new rays from the intersections with the geometry.

The rendering equation (published in 1986)

The rendering equation describes physical light transport for a single camera or the human vision. A point in the scene is imaged by measuring the emitted and reflected light that converges on the sensor plane. Radiance (L) represents the ray strength, measuring the combined angular and spatial power densities. Radiance can be used to indicate how much of the power emitted by the light source that is reflected, transmitted or absorbed by a surface will be captured by a camera facing that surface from a specified angle of view.

Source: https://www.semanticscholar.org/paper/Inverse-Rendering-and-Relighting-From-Multiple-Plus-Liu-Do/da1ba94e01596d0241d7d426b071ae9731d148b3, https://www.mdpi.com/2072-4292/13/13/2640, Rendering for Data Driven Computational Imaging, Tristan Swedish

Inverse and Differential rendering (aka "Computervision")

Inverse graphics attempts to take sensor data and infer 3D geometry, illumination, materials, and motions such that a graphics renderer could realistically reproduce the observed scene. Renderers, however, are designed to solve the forward process of image synthesis. To go in the other direction, we propose an approximate differentiable renderer (DR) that explicitly models the relationship between changes in model parameters and image observations. Differentiable rendering enables optimization of 3D object properties like the geometry of a mesh. Unlike traditional rendering, differentiable rendering can backpropagate gradients from image space to 3D geometry.

Source: http://rgl.epfl.ch/publications/NimierDavidVicini2019Mitsuba2

Inverse rendering and differentiable rendering have been a topic of research for some time. However, major breakthroughs have only been made in recent years due to improved hardware and advancements in deep learning.

Neural Rendering, ca 1990s

Is a relative new technique that combines classical or other 3D representation and renderer with deep neural networks that rerender the classical render into a more complete and realistic views. In contrast to Neural Image-based Rendering (N-IBR), neural rerendering does not use input views at runtime, and instead relies on the deep neural network to recover the missing details. Deepfakes are an early neural rendering technique in which a person in an existing image or video is replaced with someone else's likeness. The original approach is believed to be based on Korshunova et al (2016), which used a convolutional neural network (CNN).

A typical neural rendering approach takes as input images corresponding to certain scene conditions (for example, viewpoint, lighting, layout, etc.), builds a "neural” scene representation from them, and "renders” this representation under novel scene properties to synthesize novel images.

The learned scene representation is not restricted by simple scene modeling approximations and can be optimized for high quality novel images. At the same time, neural rendering approaches incorporate ideas from classical graphics—in the form of input features, scene representations, and network architectures—to make the learning task easier, and the output more controllable. Neural rendering has many important use cases such as semantic photo manipulation, novel view synthesis, relighting, free viewpoint video, as well as facial and body reenactment.

Source: Advances in Neural Rendering, https://www.neuralrender.com/ Source: Advances in Neural Rendering, https://www.neuralrender.com/

Artifacts such as ghosting, blur, holes, or seams can arise due to view-dependent effects, imperfect proxy geometry or too few source images. To address these issues, N-IBR methods replace the heuristics often found in classical IBR methods with learned blending functions or corrections that take into account view-dependent effects.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Image-based rendering (IBR): Plenoptic function and capture

Computational imaging (CI) is a class of imaging systems that, starting from an imperfect physical measurement and prior knowledge about the class of objects or scenes being imaged, deliver estimates of a specific object or scene presented to the imaging system.

In contrast to classical rendering, which projects 3D content to the 2D plane, image-based rendering techniques generate novel images by transforming an existing set of images, typically by warping and compositing them together. The essence of image-based rendering technology is to obtain all the visual information of the scene directly through images. Its used in computer graphics and computer vision, and it is also widely used in virtual reality technology.

The plenoptic function 8D, Gabriel Lippmann, 1908

The world as we see it using our eyes is a continuous three-dimensional function of the spatial coordinates. To generate photo-realistic views of a real-world scene from any viewpoint, it not only requires to understand the 3D scene geometry, but also to model complex viewpoint-dependent appearance resulting of light transport phenomena. A photograph is a two-dimensional map of the “number of photons” that map from the three-dimensional scene.

While the rendering equation is a useful model for computer graphics some problems are easier to solve by a more generalized light model.

The plenoptic function is also inspired by multi-faceted insect eyes or lens arrays

Source: https://en.wikipedia.org/wiki/Compound_eye, Rendering for Data Driven Computational Imaging, Tristan Swedish

The plenoptic function describes the degrees of freedom of a light ray with the parameters: Irradiance, position, wavelength, time, angle, phase, polarization, and bounce.

Source: Rendering for Data Driven Computational Imaging, Tristan Swedish, https://www.blitznotes.org/ib/physics/waves.html, https://courses.lumenlearning.com/boundless-chemistry/chapter/the-nature-of-light/

Light has the properties of waves. Like ocean waves, light waves have crests and troughs.

  • The distance between one crest and the next, which is the same as the distance between one trough and the next, is called the wavelength.
  • Wave phase is the offset of a wave from a given point. When two waves cross paths, they either cancel each other out or compliment each other, depending on their phase.
  • Irradiance is the amount of light energy from one thing hitting a square meter of another each second. Photons that carry this energy have wavelengths from energetic X-rays and gamma rays to visible light to the infrared and radio. The unit of irradiance is the watt per square meter.
  • Polarization and Bounce are often omitted for simplicity
  • The full equation is also time dependent.

Static 5D and 4D Lightfields: capture and rendering (Gabriel Lippmann, 1908)

Light-field capture was first proposed in 1908 by Nobel laureate physicist Gabriel Lippmann (who also contributed to early color photography). If Vx, Vy, Vz are fixed, the plenoptic function describes a panorama at fixed viewpoint (Vx, Vy, Vz). A regular image with a limited field of view can be regarded as an incomplete plenoptic sample at a fixed viewpoint. As long as we stay outside the convex hull of an object or a scene, if we fix the location of the camera on a plane, we can use two parallel planes (u,v) and (s,t) to simplify the complete 5D plenoptic function to a 4D lightfield plenoptic function.

A Light field is a mathematical function of one or more variables whose range is a set of multidimensional vectors that describe the amount of light flowing in every direction through every point in space*. It restricts the information to light outside the convex hull of the objects of interest. The 7D plenoptic function can under certain assumptions and relaxations simplify o a 4D light field, which is easier to sample and operate on. A hologram is a photographic recording of a light field, rather than an image formed by a lens. A light field is a function that describes how light transport occurs throughout a 3D volume. It describes the direction of light rays moving through every x=(x, y, z) coordinate in space and in every direction d, described either as θ and ϕ angles or a unit vector. Collectively they form a 5D feature space that describes light transport in a 3D scene.

The magnitude of each light ray is given by the radiance and the space of all possible light rays is given by the five-dimensional plenoptic function. The 4D lightfield has 2D spatial (x,y) and 2D angular (u,v) information that is captured by a plenoptic sensor.

  • the incident light field Li(u, v, alpha, beta) describing the irradiance of light incident on objects in space
  • the radiant light field Lr (u, v, alpha, beta) quantifying the irradiance created by an object
  • time is an optional 5th dimension

Capturing, storing and compressing static and dynamic light fields

Light field rendering [Levoy and Hanrahan 1996] eschews any geometric reasoning and simply samples images on a regular grid so that new views can be rendered as slices of the sampled light field. Lumigraph rendering [Gortler et al. 1996] showed that using approximate scene geometry can ameliorate artifacts due to undersampled or irregularly sampled views. The plenoptic sampling framework [Chai et al. 2000] analyzes light field rendering using signal processing techniques and shows that the Nyquist view sampling rate for light fields depends on the minimum and maximum scene depths. Furthermore, they discuss how the Nyquist view sampling rate can be lowered with more knowledge of scene geometry. Zhang and Chen [2003] extend this analysis to show how non-Lambertian and occlusion effects increase the spectral support of a light field, and also propose more general view sampling lattice patterns. Rendering algorithms based

One type uses an array of micro-lenses placed in front of an otherwise conventional image sensor to sense intensity, color, and directional information. Multi-camera arrays are another type. Compared to a traditional photo camera that only captures the intensity of the incident light, a light-field camera provides angular information for each pixel.

In principle, this additional information allows 2D images to be reconstructed at a given focal plane, and hence a depth map can be computed. While special cameras and cameras arrangements have been build to capture light fields it is also possible them with a conventional camera or smart phone under certain constraints (see Crowdsampling the Plenoptic Function).

Source: Stanford light field camera; Right: Adobe (large) lens array, source https://cs.brown.edu/courses/csci1290/labs/lab_lightfields, "Lytro Illum", a discontinued commercially available light field camera

Implicit neural scene representations

Source: Advances in Neural Rendering, https://www.neuralrender.com/

* General purpose format, can store images, points, voxels, meshes, compression * Cloudy blurry artifacts when not enough data available * Can handle reflection and transparent objects, small details, fuzzy objects * Not meant for survey/measuring, can generate low-res geometry via marching cubes

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Encodings comparison

Whereas discrete signal representations like pixel images or voxels approximate continuous signals with regularly spaced samples of the signal, these neural fields approximate the continuous signal directly with a continuous, parametric function, i.e., a MLP which takes in coordinates as input and outputs a vector (such as color or occupancy).

image Neural approximations of scalar- and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids

From Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

A demonstration of the reconstruction quality of different encodings and parametric data structures for storing trainable feature embeddings. Each configuration was trained for 11 000 steps using our fast NeRF implementation (Section 5.4), varying only the input encoding and MLP size. The number of trainable parameters (MLP weights + encoding parameters), training time and reconstruction accuracy (PSNR) are shown below each image. Our encoding (e) with a similar total number of trainable parameters as the frequency encoding configuration (b) trains over 8× faster, due to the sparsity of updates to the parameters and smaller MLP. Increasing the number of parameters (f) further improves reconstruction accuracy without significantly increasing training time.

Comparison of Encodings (from instant nerf paper) A practical introduction https://keras.io/examples/vision/nerf/#setup

Input format: Local Light Field Fusion (LLFF), 2019

Used in original Nerf paper for still images, can get light-fields to the nyquist frequency limit..

LLFF uses Colmap to calculate the position of each of the camera* (poses files), then uses a trained AI to calculate the distance map and from there it generates the MPI, which is the output we'll use to create the MPI videos (and the metadata). So this pose recentering needs to be applied on real data where the camera poses are arbitrary, is that correct? Leave aside rendering, does it have impact on training: train on llff (real data with arbitrary camera poses) without rencenter_poses and with NDC? Intuitively it depends on how much the default world coordinate differs from the poses_avg, in practice when using COLMAP, do they differ much?

NDC makes very specific assumptions, that the camera is facing along -z and is entirely behind the z=-near plane. So if the rotation is wrong it will fail (in its current implementation). This is analogous to how a regular graphics pipeline like OpenGL Pose_bounds.npy contains 3x5 pose matrices and 2 depth bounds for each image. Each pose has [R T] as the left 3x4 matrix and [H W F] as the right 3x1 matrix.

Multi Sphere Image, Multi-plane image (MPI), local layered representation format and DeepMPI representation (2.5 D), 2020

MSI: a Multi Sphere Image. LM: a Layered Mesh with individual layer textures. LM-TA: a Layered Mesh with a texture atlas.

Deep image or video generation approaches that enable explicit or implicit control of scene properties such as illumination, camera parameters, pose, geometry, appearance, and semantic structure. MPIs (rgba) have the ability to produce high-quality novel views of complex scenes in real time and the view consistency that arises from a 3D scene representation (in contrast to neural rendering approaches that decode a separate view for each desired viewpoint).

Our method takes in a set of images of a static scene, promotes each image to a local layered representation (MPI), and blends local light fields rendered from these MPIs to render novel views. As a rule of thumb, you should use images where the **maximum disparity between views is no more than about 64 pixels (watch the closest thing to the camera and don't let it move more than ~1/8 the horizontal field of view between images). Our datasets usually consist of 20-30 images captured handheld in a rough grid pattern. https://github.com/Fyusion/LLFF

Its depth-wise resolution is limited by the number of discrete planes, and thus the MPIs cannot be converted to other 3D representations such as mesh, point cloud, etc. I

Source: https://www.semanticscholar.org/paper/ACORN%3A-Adaptive-Coordinate-Networks-for-Neural-Martel-Lindell/2d0c07aa97b5b422c1ac512b1c184f412a19f28e/

DeepMPI (2020) extends prior work on multiplane images (MPIs) to model viewing conditions that vary with time Our work makes three key contributions: first, a representation, called a DeepMPI, for neural rendering that extends prior work on multiplane images (MPIs) [68] to model viewing conditions that vary with time; second, a method for training DeepMPIs on sparse, unstructured crowdsampled data that is unreg- 1 [1] describes the plenoptic function as 7D, but we can reduce this to a 4D color lightfield supplemented by time by applying the later observations of [33]. Crowdsampling the Plenoptic Function 3 istered in time

Parametric Encoding: Acorn: Adaptive coordinate networks for neural scene representation, 2021

parametric approach uses a tree subdivision of the domain Rd , wherein a large auxiliary coordinate encoder neural network (ACORN) [Martel et al. 2021] is trained to output dense feature grids in the leaf node around x. These dense feature grids, which have on the order of 10 000 entries, are then linearly interpolated, as in Liu et al. [2020]. This approach tends to yield a larger degree of adaptivity compared with the previous parametric encodings, albeit at greater computational cost which can only be amortized when sufficiently many inputs x fall into each leaf node. Sparse parametric encodings.

Source: https://www.computationalimaging.org/publications/acorn

multi-resolution decomposition
Conversion to mesh or voxel

Neural radiance field (NeRF) techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint

The neural network can also be converted to mesh in certain circumstances https://github.com/bmild/nerf/blob/master/extract_mesh.ipynb), we need to first infer which locations are occupied by the object. This is done by first create a grid volume in the form of a cuboid covering the whole object, then use the nerf model to predict whether a cell is occupied or not. This is the main reason why mesh construction is only available for 360 inward-facing scenes as forward facing scenes

Point-Based Rendering

ADOP: Approximate Differentiable One-Pixel Point Rendering, https://t.co/npOqsAstAx https://t.co/LE4ZdckQPO

PlenOctrees For Real-time Rendering of Neural Radiance Fields, 2021, NeRF-SH

Neural Radiance Fields (NeRFs) in real time using PlenOctrees, an octree-based 3D representation which supports view-dependent effects. Our method can render 800x800 images at more than 150 FPS, which is over 3000 times faster than conventional NeRFs. We do so without sacrificing quality while preserving the ability of NeRFs to perform free-viewpoint rendering of scenes with arbitrary geometry and view-dependent effects. Real-time performance is achieved by pre-tabulating the NeRF into a PlenOctree. In order to preserve view-dependent effects such as specularities, we factorize the appearance via closed-form spherical basis functions. Specifically, we show that it is possible to train NeRFs to predict a spherical harmonic representation of radiance, removing the viewing direction as an input to the neural network. Furthermore, we show that PlenOctrees can be directly optimized to further minimize the reconstruction loss, which leads to equal or better quality compared to competing methods. Moreover, this octree optimization step can be used to reduce the training time, as we no longer need to wait for the NeRF training to converge fully. Our real-time neural rendering approach may potentially enable new applications such as 6-DOF industrial and product visualizations, as well as next generation AR/VR systems. PlenOctrees are amenable to in-browser rendering as well; https://alexyu.net/plenoctrees/

Source https://alexyu.net/plenoctrees/

Realtime online demo: https://alexyu.net/plenoctrees/demo/?load=https://angjookanazawa.com/plenoctree_data/ficus_cams.draw.npz;https://angjookanazawa.com/plenoctree_data/ficus.npz&hide_layers=1

Plenoxels: Radiance Fields without Neural Networks, 2021

Proposes a view-dependent sparse voxel model, Plenoxel (plenoptic volume element), that can optimize to the same fidelity as Neural Radiance Fields (NeRFs) without any neural networks. Our typical optimization time is 11 minutes on a single GPU, a speedup of two orders of magnitude compared to NeRF.

Source https://github.com/sxyu/svox2

Also https://github.com/naruya/VaxNeRF

Point-NeRF: Point-based Neural Radiance Fields, 2022

Volumetric neural rendering methods like NeRF generate high-quality view synthesis results but are optimized per-scene leading to prohibitive reconstruction time. On the other hand, deep multi-view stereo methods can quickly reconstruct scene geometry via direct network inference. Point-NeRF combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field.

https://arxiv.org/abs/2201.08845

Mixed scene representations for neural rendering, 2019

Mixture of Volumetric Primitives for Efficient Neural Rendering

Source: https://arxiv.org/pdf/2103.01954.pdf

Instant Neural Graphics Primitives

See also https://github.com/3a1b2c3/seeingSpace/wiki/NVIDIA-instant-Nerf-on-google-colab,-train-a-nerf-without-a-massive-gpu

Implementation of four neural graphics primitives, being neural radiance fields (NeRF), signed distance functions (SDFs), neural images, and neural volumes. In each case, we train and render a MLP with multiresolution hash input encoding using the tiny-cuda-nn framework.

https://github.com/NVlabs/instant-ngp/raw/master/docs/assets_readme/fox.gif

https://github.com/NVlabs/instant-ngp, needs RTX graphics card

Compression

TODO

Novel (virtual) 2D view synthesis from plenoptic samples

Synthesize plenoptic slices that can be interpolated to recover local regions of the full plenoptic function. Given a dense sampling of views, photorealistic novel views can be reconstructed by simple light field sample interpolation techniques. For novel view synthesis with sparser view sampling, the computer vision and graphics communities have made significant progress by predicting traditional geometry and appearance representations from observed images. The study of image-based rendering is motivated by a simple question: how do we use a finite set of images to reconstruct an infinite set of views.

View synthesis can be approached by either explicit estimation of scene geometry and color, or using coarser estimates of geometry to guide interpolation between captured views. One approach aims to explicitly reconstruct the surface geometry and the appearance on the surface from the observed sparse views, other approaches adopt volume-based representations to directly to model the appearance of the entire space and use volumetric rendering techniques to generate images for 2D displays. The raw samples of a light field are saved as disks. resolution large amounts of data

The Volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Source: https://github.com/Arne-Petersen/Plenoptic-Simulation, A System for Acquiring, Processing, and Rendering Panoramic Light Field sStills for Virtual Reality

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Light field rendering pushes the latter strategy to an extreme by using dense structured sampling of the lightfield to make re-construction guarantees independent of specific scene geometry. Most image based renering algorithms are designed to model static appearance, DeepMPI (Deep Multiplane Images), which further captures viewing condition dependent appearance.

Camera calibration is often assumed to be prerequisite, while in practise, this information is rarely accessible, and requires to be pre-computed with conventional techniques, such as SfM.

3d scene reconstruction and inverse and differential rendering

Inverse rendering and differential rendering: explicitly reconstructing the scene

The key concept behind neural rendering approaches is that they are differentiable. A differentiable function is one whose derivative exists at each point in the domain. This is important because machine learning is basically the chain rule with extra steps: a differentiable rendering function can be learned with data, one gradient descent step at a time. Learning a rendering function statistically through data is fundamentally different from the classic rendering methods we described above, which calculate and extrapolate from the known laws of physics.

They can be classified into explicit and implicit representations. Explicit methods describe scenes as a collection of geometric primitives, such as triangles, point-like primitives, or higher-order parametric surfaces.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

One popular class of approaches uses mesh-based representations of scenes with either use or view-dependent appearance. Differentiable rasterizers or pathtracers can directly optimize mesh representations to reproduce a set of input images using gradient descent. However, gradient-based mesh optimization based on image reprojection is often dicult, likely because of local minima or poor conditioning of the loss landscape. Furthermore, this strategy requires a template mesh with xed topology to be provided as an initialization before optimization [22], which is typically unavailable for unconstrained real-world scenes.

Inverse rendering aims to estimate physical attributes of a scene, e.g., reflectance, geometry, and lighting from image(s). Also called Differentiable Rendering it promises to close the loop between computer vision and graphics.

Novel view synthesis with neural rendering: Volume Rendering with Radiance Fields

In this problem, a neural network learns to render a scene from an arbitrary viewpoint. Both of these works use a volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Neural Radiance Fields (NeRF) rendering: Representing Scenes as Neural Radiance Fields, published 2020 Mildenhall

Represent a static scene as a continuous 5D function that outputs the radiance emitted in each direction (theta, phi) at each point (x; y; z) in space, and a density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through (x; y; z).

Uses regressing from a single 5D coordinate (x; y; z; theta, phi) to a single volume density and view-dependent RGB color.

NeRF uses a differentiable volume rendering formula to train a coordinate-based multilayer perceptron (MLP) to directly predict color and opacity from 3D position and 2D viewing direction. It is a recent and popular volumetric rendering technique to generate images is Neural Radiance Fields (NeRF) due to its exceptional simplicity and performance for synthesising high-quality images of complex real-world scenes.

The key idea in NeRF is to represent the entire volume space with a continuous function, parameterised by a multi-layer perceptron (MLP), bypassing the need to discretise the space into voxel grids, which usually suffers from resolution constraints. It allows real-time synthesis of photorealistic new views.

NeRF is good with complex geometries and deals with occlusion well.

Source: Advances in Neural Rendering, 2021, ttps://www.neuralrender.com/

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Neural volume rendering refers to methods that generate images or video by tracing a ray into the scene and taking an integral of some sort over the length of the ray. Typically a neural network like a multi-layer perceptron (MLP) encodes a function from the 3D coordinates on the ray to quantities like density and color, which are integrated to yield an image. One of the reasons NeRF is able to render with great detail is because it encodes a 3D point and associated view direction on a ray using periodic activation functions, i.e., Fourier Features.

The impact of the NeRF paper lies in its brutal simplicity: just an MLP taking in a 5D coordinate and outputting density and color yields photoreastic results. The inital model had limitions: Training and rendering is slow and it can only represent static scenes. It “bakes in” lighting. A trained NeRF representation does not generalize to other scenes or objects. All of these problems have since developed further, there are even realtime nerfs now.
A good overview can be found in "NeRF Explosion 2020", https://dellaert.github.io/NeRF.

https://user-images.githubusercontent.com/74843139/135747420-4d91bc80-2893-44a4-8d32-16bf7024b4f2.mp4

https://dellaert.github.io/NeRF/ Source: https://towardsdatascience.com/nerf-and-what-happens-when-graphics-becomes-differentiable-88a617561b5d

A deeper integration of graphics knowledge into the network is possible based on differentiable graphics modules. Such a differentiable module can for example implement a complete computer graphics renderer, a 3D rotation, or an illumination model. Such components add a physically inspired inductive bias to the network, while still allowing for end-to-end training via backpropagation. This can be used to analytically enforce a truth about the world in the network structure, frees up network capacity, and leads to better generalization, especially if only limited training data is available.

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Unconstrained Images

NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections, Martin-Brualla, 2021

Can handle inages with variable illumination (not photometrically static) cars and people may move, construction may begin or end, seasons and weather may change,

Crowdsampling the Plenoptic Function with NeRF, published 2020

Given a large number of tourist photos taken at different times of day, this machine learning based approach learns to construct a continuous set of light fields and to synthesize novel views capturing all-times-of-day scene appearance. achieve convincing changes across a variety of times of day and lighting conditions. mask out transient objects such as people and cars during training and evaluation.

Source: https://www.semanticscholar.org/paper/Crowdsampling-the-Plenoptic-Function-Li-Xian

Source: Crowdsampling the Plenoptic Function, 2020

unsupervised manner. The approach takes unstructured Internet photos spanning some range of time-varying appearance in a scene and learns how to reconstruct a plenoptic slice, a representation of the light field that respects temporal structure in the plenoptic function when interpolated over time|for each of the viewing conditions captured in our input data. By designing our model to preserve the structure of real plenoptic functions, we force it to learn time-varying phenomena like the motion of shadows according to sun position. This lets us, for example, recover plenoptic slices for images taken at different times of day and interpolate between them to observe how shadows move as the day progresses (best seen in our supplemental video).

image https://www.pyimagesearch.com/2021/11/17/computer-graphics-and-deep-learning-with-nerf-using-tensorflow-and-keras-part-2/

Optimize for Photometric Loss: The difference between the predicted color of the pixel (shown in Figure 9) and the actual color of the pixel makes the photometric loss. This eventually allows us to perform backpropagation on the MLP and minimize the loss. In effect, we learn a representation of the scene that can produce high-quality views from a continuum of viewpoints and viewing conditions that vary with time.

Source: NeRF-W Nerf in the wild NeRF-W disentangles lighting from the underlying 3D scene geometry. The latter remains consistent even as the former changes.

Source: nerf in the wild

NeROIC: Neural Object Capture and Rendering from Online Image Collection, 2022

https://formyfamily.github.io/NeROIC/ We present a novel method to acquire object representations from online image collections, capturing high-quality geometry and material properties of arbitrary objects from photographs with varying cameras, illumination, and backgrounds.

Towards Instant 3D Capture (with a cell phone): Nerfies

Source: Advances in Neural Rendering, https://www.neuralrender.com/

Neural Residual Flow Fields for Efficient Video Representations

Article:https://lnkd.in/gsMbpD-7 Code Repo:https://lnkd.in/gRWvhra5 Implicit neural representation (INR) has emerged as a powerful paradigm for representing signals, such as images, videos, 3D shapes, etc. Although it has shown the ability to represent fine details, its efficiency as a data representation has not been extensively studied. In INR, the data is stored in the form of parameters of a neural network and general purpose optimization algorithms do not generally exploit the spatial and temporal redundancy in signals. In this paper, we suggest a novel INR approach to representing and compressing videos by explicitly removing data redundancy. Instead of storing raw RGB colors, we propose Neural Residual Flow Fields (NRFF), using motion information across video frames and residuals that are necessary to reconstruct a video. Maintaining the motion information, which is usually smoother and less complex than the raw signals, requires far fewer parameters. Furthermore, reusing redundant pixel values further improves the network parameter efficiency. Experimental results have shown that the proposed method outperforms the baseline methods by a significant margin.

Multi-resolution Nerfs

MipNerf supersampling, 2021

The rendering procedure used by neural radiance fields (NeRF) samples a scene with a single ray per pixel and may therefore produce renderings that are excessively blurred or aliased when training or testing images observe scene content at different resolutions. The straightforward solution of supersampling by rendering with multiple rays per pixel is impractical for NeRF, because rendering each ray requires querying a multilayer perceptron hundreds of times. https://github.com/google/mipnerf

See also:

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation,(NeRF), 2022

Animating high-fidelity **video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation.

Relighting with 4D Incident Light Fields

It is possible to re-light and de-light real objects illuminated by a 4D incident light field, representing the illumination of an environment. By exploiting the richness in angular and spatial variation of the light field, objects can be relit with a high degree of realism.

Source:

Relighting with NeRF

Another dimension in which NeRF-style methods have been augmented is in how to deal with lighting, typically through latent codes that can be used to re-light a scene. NeRF-W was one of the first follow-up works on NeRF, and optimizes a latent appearance code to enable learning a neural scene representation from less controlled multi-view collections.

Neural Reflectance Fields improve on NeRF by adding a local reflection model in addition to density. It yields impressive relighting results, albeit from single point light sources. NeRV uses a second “visibility” MLP to support arbitrary environment lighting and “one-bounce” indirect illumination.

Source: https://en.wikipedia.org/wiki/Light_stage

Source:

Source: Advances in Neural Rendering, https://www.neuralrender.com/
Learning to Factorize and Relight a City

A learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors imagery from Google Street View, where the same locations are captured repeatedly through time.

image

Source: https://en.wikipedia.org/wiki/Light_stage

Nerf for computer vision task : Scene Labeling and Understanding with Implicit Scene Representation, 2021

The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate.

Pose Estimation

iNeRF: Inverting Neural Radiance Fields for Pose Estimation, Yen-Chen et al. IROS 2021 | bibtex
A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering, Su et al. Arxiv 2021 | bibtex
NeRF--: Neural Radiance Fields Without Known Camera Parameters, Wang et al., Arxiv 2021 | github | bibtex
iMAP: Implicit Mapping and Positioning in Real-Time, Sucar et al., Arxiv 2021 | bibtex
GNeRF: GAN-based Neural Radiance Field without Posed Camera, Meng et al., Arxiv 2021 | bibtex
BARF: Bundle-Adjusting Neural Radiance Fields, Lin et al., ICCV 2021 | bibtex
Self-Calibrating Neural Radiance Fields, Park et al., ICCV 2021 | github | bibtex

Nerf methods in comparison

Advances in Neural Rendering, https://arxiv.org/abs/2111.05849 Source: Advances in Neural Rendering, https://www.neuralrender.com/

Combining LIDAR and radiance fields

https://urban-radiance-fields.github.io/

Deformable/Animation

Source:

https://github.com/3a1b2c3/seeingSpace/wiki/Related-fields-(Photogrametry,-LIDAR,-SLAM-etc) https://github.com/3a1b2c3/seeingSpace/wiki/Important-concepts https://github.com/3a1b2c3/seeingSpace/wiki/Recommended-resources-and-reading

Conclusion

Over the past year (2020), we’ve learned how to make the rendering process differentiable, and turn it into a deep learning module. This sparks the imagination, because the deep learning motto is: “If it’s differentiable, we can learn through it”. If we know how to differentially go from 3D to 2D, it means we can use deep learning and backpropagation to go back from 2D to 3D as well.

With neural rendering we no longer need to physically model the scene and simulate the light transport, as this knowledge is now stored implicitly inside the weights of a neural network. The compute required to render an image is also no longer tied to the complexity of the scene (the number of objects, lights, and materials), but rather the size of the neural network.

Neural rendering has already enabled applications that were previously intractable, such as rendering of digital avatars without any manual modeling. Neural rendering could have a profound impact in making complex photo and video editing tasks accessible to a much broader audience.

This is no longer a neural network that is predicting physics. This is physics (or optics) plugged on top of a neural network inside a PyTorch engine. We have now a differentiable simulation of the real world (harnessing the power of computer graphics) on top of a neural representation of it.

https://towardsdatascience.com/three-grand-challenges-in-machine-learning-771e1440eafc Vincent Vanhoucke, Distinguished Scientist at Google This is why I call the grand challenge for perception the Inverse Video Game problem: predict not only a static scene but its functional semantics and possible futures. You should be able to take a video, run it through your computer vision model, and get a representation of a scene you can not only parse, but can roll forward in time to generate plausible future behaviors, from any viewpoint, like a video game engine would. It would respect the physics (a ball shot at a goal would follow its normal trajectory), semantics (a table would be movable, a door could be opened), and agents in the scene (people, cars would have reasonable NPC behaviors).

A lot of computer vision and graphics algorithms can be defined in a closed form solution, which therefore allows for optimizations.

Neural computing is very computationally expensive in part because (by design) it can’t be reduced to a closed form solution in the same way. Many of such closed-form contain infinite integral that are impossible to solve. We then use approximations (like Monte Carlo Markov Chain) but they take a long time to converge. It is really worth it to use a neural network to at list prototype and find the right settings before computing the final version.

Source: https://medium.com/@hurmh92/autonomous-driving-slam-and-3d-mapping-robot-e3cca3c52e95

https://arxiv.org/pdf/2203.01414.pdf ICARUS: A Specialized Architecture for Neural Radiance FieldRendering

⚠️ **GitHub.com Fallback** ⚠️