Combining Volume Rendering with Adaptive Ray Marching in Neural Radiance Fields

Abstract

In this project, we explore a more efficient way of volume rendering given a neural scene representation. In specific, given a locally conditioned Neural Radiance Field introduced in PixelNeRF [6], we replace the evenly sampled points in their volume rendering procedure with adaptively sampled points to reduce the time cost associated with the procedure. We obtain adaptively sampled points by first predicting the approximate intersection between the query ray and the scene via the differentiable ray-marching technique [4] using LSTM, and then evenly sample points with tiny steps.

Related Work

Our work is inspired by recent work in the field that focuses on predicting neural scene representation with one or few input images. We review the related work behind the key techniques in our project.

Local Conditioning on Neural Radiance Fields PixelNeRF [6] augments Neural Radiance Field (NeRF) [2], a 3D scene representation algorithm based on neural network, by including a fully convolutional image encoder to learn spatial image features that can be shared across scenes, as well as the local-conditioning technique that extract the local features by projecting the query points onto input images. Another line of work [1, 3] introduced similar techniques with different way of aggregating local features, and they refer this kind of technique as Wrap-Conditioned Embedding (WCE). Reizenstein et al. [3] also incorporates WCE into Scene Representation Network (SRN) [4], which showed a decent improvement in performance than the vanilla SRN and actually resulted in the best depth estimation results in their experiments.

Aggregating projected features with Transformer When we apply WCE to multi-view inputs, we need to aggregate local features extracted from multiple query points on multiple views. PixelNeRF simply process each query point independently, and aggregate by averaging over multiple views. NerFormer instead uses a Transformer to process these features and produced better results. They argued that a Transformer (1) can learn to aggregate features from multiple views in a smart way and (2) is capable of spatial reasoning on multiple query points.

Differentiable Ray Marching Both PixelNeRF [6] and NerFormer [3] use volume rendering with evenly spaced points along the query ray, which might contain many points in empty spaces. Since volume rendering is a slow explicit rendering technique, this could result in wasteful computational resources. On the other hand, SRN [4] uses a differentiable ray marching technique, which employs an LSTM to adaptively sample points along the query ray until it finds the intersection of the query ray with the scene. SRN then pass this intersection point through a pixel generator network that directly yields the color, instead of using volume rendering. Reizenstein et al. [3] shows that NerFormer performs better than SRN+WCE in many categories, possibly suggesting that volume rendering, although more expensive, produces better results than ray marching.

Method

We propose an adapted version of differentiable ray marching algorithm that will first use ray marching long short-term memory (RM-LSTM) along the query ray to learn the adaptive step-length for the tracing algorithm. The goal of the RM-LSTM algorithm is to land in a neighborhood to the actual surface of the scene in order for the neural radiance field to learn the output. As this point, we can either directly output a color or, as a better alternative, sample points around the terminal point predicted by the RM-LSTM algorithm with much smaller step sizes and perform volume rendering along those points.

We did our experiment by modifying the current structure of PixelNeRF. Instead of a standard volume rendering procedure, we extract the latent features learned by the CNN encoder of PixelNeRF and then feed the features as input into the RM-LSTM algorithm in the picture below. Essentially, the RM-LSTM predicts the steplength on the ray of interest and update the depth. The RM-LSTM is set to have the number of feature channels being 512. We set the number of steps to be 10 in our experiments due to computational constraints. We then pass the output through the whole PixelNeRF network again to obtain rgb values. In order to guide the initial convergence process of LSTM, we add a regularization term to the usual MSE reconstruction loss by penalizing any final depth outside 0.5 (near plane) and 2.0 (far plane). In our experiments, we use pretrained ResNet 34 model for the CNN encoder. We denote this structure as Raymarcher.

For our extension, we did the same procedure as before up to the point where LSTM outputs the final depth. Now instead of extracting rgb values from PixelNeRF on this single point, we sample n = 10 points around +/-0.05 neighborhood of predicted depth and use standard volume rendering process to calculate the final rgb value. We denote this structure as AdaptiveVolumeRenderer (AVR).

Results

In our experiment, we compare our two models, Raymarcher and Adaptive Volume Renderer, against a baseline that uses standard volume rendering with PixelNeRF. We train the Raymarcher as well as the AVR simultaneously as two networks that shares the same PixelNeRF encoder. We use the pretrained weights from PixelNeRF and fine tune for 200k steps.

We evaluate our models on ShapeNet cars dataset, where the training set contains 2150 cars with resolution 32 and 8 views per scene, 1 of them randomly selected as the input/source view. We then test on 352 cars for single-view reconstruction, using an informative input view (view 64). We compare the performance using PSNR, SSIM [5], the rendering speed, and videos of produced output. The results are shown below.

Volume Rendering (baseline). PSNR: 23.34. SSIM: 0.9048. LPIPS: 0.111. Rendering speed: 0.4 fps.

Raymarcher. PSNR: 23.02. SSIM: 0.9013. LPIPS: 0.137. Rendering speed: 34.9 fps.

Adaptive Volume Renderer. PSNR: 23.22. SSIM: 0.9056. LPIPS: 0.108. Rendering speed: 4.2 fps.

As we can see, compared with the standard Volume Rendering procedure, Adaptive Volume Renderer achieves comparable results ~7 tims faster. Raymarcher achieves a even better ~80 times speed up with a slight decrease in rendering quality.

Conclusion and Future Work

In this project, we proposed and implemented two rendering procedures for a locally conditioned NeRF that are more efficient. AVR achieves comparable quality with a noticeable speed up whereas Raymarcher achieves a considerable speed up with a slight decrease in quality. For future work, we want to investigate whether this techniques can be conveniently combined with other techniques that aims to speed up NeRF.

References

Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman Shapovalov, Tobias Ritschel, Andrea Vedaldi, and David Novotny. Unsupervised learning of 3d object categories from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4700–4709, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021
Vincent Sitzmann, Michael Zollh ̈ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems, 32, 2019
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018