I know I said that I will make a pause with AI-powered stuff but things are developing at breakneck speed and it looks like every week there is something more impressive that I feel needs to be addressed.
While ChatGPT is getting a lot of attention, and rightfully so as it is beyond impressive and anyone that has not yet checked it out doesn’t know what they are missing, there are other developments that I think need some review.
One such project comes, again, from NVIDIA as Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.
A neural radiance field (NeRF) is a type of artificial neural network that can create new views of 3D scenes based on a set of 2D images. It uses a special kind of loss function called a rendering loss to generate images that accurately match the input images. Essentially, the NeRF algorithm takes the input images and uses them to interpolate and create a single, complete scene. This approach is highly effective for generating synthetic images from real-world data.
A NeRF network is designed to take in data about the direction and location of the view of a scene (5 dimensions of input) and output information about the color and transparency of the objects in the scene (4 dimensions of output). It uses a method called volume rendering to create new views of the scene. Because the NeRF algorithm is computationally intensive, generating images of complex scenes can take a long time, sometimes even hours or days. However, there are newer algorithms that can significantly improve the performance of NeRF, allowing it to generate images faster.
To get this approach to recreating real-world visual data as 3D to work NeRF needs to know the location and orientation of your camera. For now, iPhone is the only smartphone that does that out of the box but there are other solutions for us mortals.
To get the camera positional data that NeRF needs we can use COLMAP. COLMAP is a popular open-source structure-from-motion (SfM) and multi-view stereo (MVS) software that can be used to create 3D models from images. SfM is a technique for estimating the 3D structure of a scene from a set of 2D images, and MVS is a method for generating a 3D point cloud or a dense 3D model from multiple images. COLMAP can be used to automatically reconstruct the 3D geometry of a scene from a set of images and can generate high-quality 3D models that can be used in a variety of applications, such as 3D printing, virtual reality, and more.
NeRF and COLMAP are similar in that they both generate a “3D view” with COLMAP being more precise but lacking the ability of NeRF to “see” behind obstacles and generate missing views. This means that NeRF will try to guess if any information is missing while COLMAP will not and can’t by design. Also, COLMAP is designed to recreate 3D models from a set of images and needs a large amount of visual data. At the same time, NeRF can be made using just a couple of photos and is not optimized for 3D model generation, and needs certain voodoo magic like cube marching to extract a 3D model out of a radiance field.
Usually, people go with Getting Started with NVIDIA Instant NeRFs. But, I wanted to check out what other developers have made so far. Spoiler, it's too early for any useful application but with the speed, with which this is developing I’m sure we will see full-blown apps that use NeRF before soon.
So I went with NeRF Studio.
Installation is documented on their GitHub. They tested it with CUDA 11.3 but I also tested it on 11.8 and it works. If you are installing CUDA on windows, Nvidia has a handy guide on how to do that on their website, or you could go to their download page and get a prebuilt binary for a hassle-free install.
For the first test, I found a glam cam 360 video made from photos of some Hollywood events. Downloaded it from youtube and placed it in /data/ folder under /nerfstudio. to get positional camera information we need to install COLMAP. Nerfstudio has an installation process for COLMAP on their site but it involves recompiling the source code which is a hassle for a quick test, but COLMAP can be installed from binaries that are found on their website without issue.
Now, this is almost the best-case scenario as there are enough photos to recreate the scene even with older tools but I wanted to give it the best case. The worst case will be explored in subsequent posts so stay tuned.
Nerfstudio has a built-in script that takes COLMAP and runs it through video or images and processes them to get data it can use.
$> ns-process-data video --data .\data\video\JulieBenz-4x_BSRGAN.mp4 --output-dir .\data\video\Julie4x\
once NeRFStudio script is done with COLMAP we can run the training script with
$> ns-train nerfacto --data .\data\video\Julie4x\
We can view the training process by using the NeRFStudio Viewer. We need to install the viewer if we want to use it locally. And we do as I found that their remote version is not working, at least for me and in the end, I do want it to run on my machine.
The viewer is a NodeJs app packaged with Yarn so while in the nerfstudio/nerfstudio/viewer/app folder run yarn start and it will compile it and start the local server.
Then go to
http://localhost:3000/
or other provided local address in your browser and voila, there is your NeRF preview in real-time while it’s training.
While in NeRF Studio viewer you can add camera paths and render them out.
It’s early days for NeRF but even now the potential is immense and with a little more polish to the user interface and by hiding some actions from the user like COLMAP generation from the video which users aren’t that really interested in.
The marriage between standard photogrammetry and neural network enables this approach to some amazing feats like looking behind corners, reducing the number of input images, and increasing the signal-to-noise ratio of standard photogrammetry approaches that will try to assimilate every speck of dust into its reconstruction sometimes making cleanup of 3D models a pain.
This was just a quick and dirty introduction and there’s a lot to explore here, like how it will handle moving scenes, and what is the lower limit to get a good result, we already know that camera/video quality means a lot, other software implementations, and so on.
So, plenty of content for future posts.