Tips for 3D scanning

Tips for easy 3D scanning (using the hardware you already have).

Goals of this article: Getting you up to speed on latest developments in 3d scanning so that you can reliably output the meshes you need for free or at low cost, no matter your hardware. Level: Intermediate (I will assume some familiarity with Android/iOS and windows/linux/macOS computers, and you've heard of 3D scanning before). No matter if you process it via the cloud, or locally, if you capture the data with photos or with structured light (LIDAR, Infrared FaceID), what you want to acquire is always points, LOTS AND LOTS OF POINTS. Let's help you capture the ideal dataset, whatever hardware you have.

As we approach the end of 2021, Android phones and its main OS developper Google have not provided the same comprehensive object capture/ARkit as iOS and Apple. However Apple likes everything to be “walled in”, so you’ve also got to process iOS captures…on a MacOS. Meh.

Cloud 3D reconstruction changed the game as you can let someone else run energy-efficient mac mini M1 or one large Mac pro, and just sent the data and rent it as service. The introduction of Polycam Web allows any Android or quadcopter drone users to take advantage of this framework using “photo mode” and uploading the photos from outside iOS ecosystem.

Best Apps for iOS: Polycam, Trnio.

Best apps for Android: OpenCamera (1-2 second interval HQ jpeg capture with all filtering disabled)

How to acquire the ideal dataset: Since photogrammetry works based on tracking 2D features to create depth maps, you need some level of overlap in-between photos. This is exactly the same thing as “panorama mode” or panorama stitching, but in 3D this time. The other thing you want is sharpness since the 2D features cannot be tracked if there is the presence of motion blur or out of focus areas (difficult for scanning very small objects). Last but not least, you want coverage (seeing every side).

To achieve overlap, you want abut 20% of the frame to remain similar while you move or rotate, and make sure your object has plenty of grain or texture to capture. You’ll never be able to capture a chrome ball with today’s algorithms since reflective surfaces are view-dependent, they change optically depending on your own position in each photos. If overlap is not achieved, you will find most 3D reconstruction software will fail to achieve a continuous result and start making wild guesses on how much your traveled physically in-between each picture. The easiest method to maximize both coverage and overlap is a low+high 2 pass orbit; You first rotate around the object from your highest height, holding the phone up and centering the desired object in the frame while rotating around it, then you do a second pass from a low angle to capture undersides. Optionally you can then break from the orbit and gradually get closer to a detail of the object you want to make sure to capture correctly. Remember: if you don’t have coverage, for the computer it’s an unknown blob. Even with the latest machine learning, reality is so rich and surprisingly complex that you’ll find computers don’t make great guesses yet, or rather not in 3D, (not yet).

Side note: It’s as much a philosophical debate as it is about algorithm design; filling the void reveals biases in the training data.

Finally, let’s expand a bit on why sharpness matters, and how that intersects with lighting, sensor size, and ISO noise (the sensor sensitivity). A smartphone achieves photographic quality with a small sensor generally by doing all sorts of fancy computational tricks to make the picture look good despite the smaller sensor (this is why cameras that make a physical shutter sound are bigger and heavier and capture raw photos instead). Unfortunately the tricks that make pics look good for social media also make it a less-than-ideal candidate for photogrammetry. Your aim is to get as close as possible to a “straight out of the sensor” jpeg from your smartphone, because when a smartphone removes the ISO noise using an algorithm, it introduces a 2D error, which will propagate into a 3D error. Now you can imagine that any other computational photography tricks will also propagate errors.

Example: Many smartphones introduced “night mode” which is achieved by shooting a burst of high-iso photos, and recombining 10 of them to eliminate the noise. Any imperceptible alignment errors in the 10 photos introduces errors too.

So, to capture the ideal dataset, we want to avoid any 2D filtering in the smartphone, and find a balance between shutter speed and iso noise. Shutter speed is the measure of how long you let light in, and a measure of movement-blur. Since you’re rotating around your object, if you don’t pause and click for each photo (tedious) you will see blur happen in any low light situation. The thing is, you’re also dealing with a small sensor (doesn’t capture lots of light) AND you usually want to scan on overcast days to achieve an “unlit” look that can make your scan look good in any new ray-traced/game lighting conditions.

Shutter speed is not something smartphone users think about much, but it’s an obsession for cinematographers and photographers alike; people move, you move, and 1/100 is usually a minimum to achieve sharp results. Due to small sensor size, your smartphone will often drop as low as 1/30th of a second, so if you’re doing the 2 pass orbit I recommended above, you might see that the center object is sharp, but the background has a rotation blur. That’s bad. 2D feature tracking for 3D reconstruction happens when it can recognize features, but the parallax between your object to scan and the background is essential (parallax: how fast things are relatively moving, like when you’re in a car or train and focus on a object with your eyes and follow it and the background suddenly seems to move counter to your vehicle direction). If the background is blurred, the algorithms will lack a ground plane/reference frame to guess your camera position in 3D space. So you have to boost iso to achieve 1/60 or higher shutter speed if you’re scanning fast. Thankfully on iOS, Polycam and Trnio do this automatically, taking pictures at the ideal time when the blur is lowest, even with slow speeds it will only take a picture if sharp.

Ideally, someone could make an Android app that does this for dataset capture too! You don’t want to waste time deleting blurry pics, let the smartphone’s CPU decide when to take the picture by estimating your momentum using the velocity data from gyroscope sensor (lowest=better).

In the meantime for Android users, you can decide if you prefer manually taking pictures, or using the 1-2S interval capture mode of OpenCamera app.

Get a feel for it! There’s a sort of rhythm to starting-stopping to capture sharp pictures while scanning fast on this kind of interval. Using this method and a 2pass high/low orbit, you can scan a medium-sized rock or a tree-trunk in less than 3 minutes.

Smartphones with ARM CPUs are amazing for acquiring datasets because they’re lightweight and always in your pocket, but since the small sensor can get in the way, let’s explore other options for taking 100’s of photos with ease before we talk about how to do cloud and local 3D reconstruction.

My pick: used iphone SE II with cracked screen (you’re not paying for anything extra, this is a work device for me not a toy). Add tempered glass protection (like the previous owner should have done) and a silicone case, you’re good to go!

Part 2: DSLRs, micro 4/3, or the cheapest sharpest cameras

The large the sensor size, the more light goes in, the higher the shutter speed you can achieve with lower noise. Noise prevents good 2D feature tracking and leads to 3D errors. So bigger sensor and higher aperture (small f1.7 type number) will lead to less errors and faster, more confident scanning.

Problem: DSLRs are super expensive. Solution: Due to smartphones being good enough for social media, plenty of people are selling Used micro 4/3rd cameras, who occupied a niche in sensor size in-between the heavyweight full-frame sensors (like a canon 5D MKII), APS-C, and the very small sensors of smartphones and point-and-shoot cameras (which are pretty useless now lol).

So that means you can get good used cameras with micro 4/3 sensors for decent prices. Cameras are extremely solid and micro-4/3 cameras often have a “silent shutter” mode, aka electronic shutter, where the mechanical curtain does not physically move, extending the operational lifetime of the camera by many many years (moving pieces = failure risk). Look for a camera with 10mpix or more resolution, good iso performance (measured in IQ, here’s a good website to compare models). To achieve sharpness and avoid motion-blur, a camera with in-body or lens stabilisation is ideal.

My pick: gx-85 with 12-35 kit lens (dual stabilisation and no AA filter).

Jpeg is usually enough, don’t bother with RAW because it will just clog up your RAM if local or Wifi if doing cloud, unless you’re using RAW to create “unlit” texture look by lifting up the shadows in something like lightroom or Affinity Photo batch mode, producing an “ideal HQ jpeg” using the RAW data.

Peak sharpness for most lens is halfway between highest and lowest f/aperture, usually f5.6 for small sensors because of optical diffraction effects. You want the highest f number below f8 that still gives you desired shutter speed above 1/60th of a second and tolerance for ISO noise (100 to 1600 ISO). Experiment with ISO and F number, you’ll find for example that it’s worth it for small object to boost the ISO to get a higher f number (smaller aperture) to ensure there is no depth of field blur (blurry background). This will ensure continuity. In other cases, the complete opposite is true, such as capturing large objects in low light, where a wide aperture (small f number) is ideal.

Generally, the wider the lens the better, but any barrel distortion (“Gopro effect”) will introduce optical errors in reconstruction. I bought a 7.5mm fisheye lens and it proved itself to be a waste of money, since I don’t have the patience for 2D undistortion or shoot in RAW and pre-process. Wide-angle, high aperture lenses with no barrel distortion are very expensive because they are optically complex objects with precise construction needs, (usually German or Japanese lenses). Investing in one of them could prove useful to scan faster if you’re scanning for a business. Lenses are good investments, while camera bodies drop in value over time. Micro 4/3 mount like the one used by Panasonic is extremely popular and leads to a good lens selection. Lenses for full-frame sensors quickly get astronomically expensive and really heavy (it’s a matter of geometry, you’ve got to have lots and lots of glass if you want high aperture ratio).

Part 3: LIDAR, FaceID, and other structured light hardware.

What if you somehow have access to something else than just capturing incoming photons? What if consumer hardware trying to make better AR happened to have similar structured light equipment as something that used to cost 30K usd a few decades ago?

The basic principle: A portable device shoots structured light at the scene, and analyzes the way it bounces off the surface to generate the points instead of 2D tracking. This has a few pros and cons compared to pure photogrammetry:

Pros:

- Works at night (no texture though) since your capture device becomes a light!

- Reconstruction can happen locally and at extremely low CPU cost. All it has to do is merge the various 2.5D slices you’re sending it. Most devices now do realtime reconstruction. This means if you have a successful capture method, you could capture 100’s of objects a day. Industrial-grade tech!

-Stability: it can capture people’s faces even if they are moving slightly, since each 2.5D slice has lots of data, alignment noise is lower.

-Instant scanning of indoors walls! This is what it’s designed for. Measure anything, AR-enhance your living room, etc.

Cons:

-Range: a major limitation is that the max range of structured light follows the inverse square law of light, where intensity and therefore precision has a quadratic drop-off as you get farther away (similar visually to “exponential” curve). This means scanning large/gigantic objects is out of question, as the max range is 3m for IR and 10-20m for LIDAR.

-Alignment errors: Due to range limitations and precision, you will find that you often don’t have enough background information to track your movement correctly, leading to wild reconstruction errors where the model duplicates itself instead of recognizing that it’s you the observer/camera that moved.

-LIDAR isn’t super high res, especially on the geometry side, it’s kind of blocky and disappointing. Most folks I talked to were disappointed with the promises of iPad LIDAR for example.

My pick: Any iphone with FaceID, Scandy pro App, and the lookout accessory.

The lookout accessory will allow you to mirror/flip the direction of IR scanning, not to recognize your face but instead to do full-body scanning of friends who are standing still, and small objects. Pretty cool!

Note: Scanning to Vertex color instead of texture allows for great flexibility to import it into NomadSculpt, giving you an offline, on-device pipeline for scanning and creative recombinations.

Part 4: Reconstruction energy/$cost considerations and trends

The Degrowth perspective is that we should all attempt to produce less devices, more energy efficient ones, and use less cloud, spend less time on energy-hungry desktops. How does this relate to all that I mentioned above? Let’s imagine various scanning scenarios and make some assumptions:

Scenario 1: Phone scanning, cloud reconstruction (efficient cloud machines such as ARM cpus consume 30 watts per reconstruction node, but inefficient network and contributes to ICT energy demand growth, which is all-too-often met via fossil fuels who should rather stay in the ground. Avoid 4G and 5G, prefer Wifi to reduce your carbon footprint and phone bill).

Scenario 2: IR local scanning (most efficient, with 5 watts iPad/Surface Pro realtime reconstruction on ARM CPU)

Scenario 3: DSLR and local reconstruction (Cameras consumes next to zero energy, but you are responsible for the energy efficiency of the reconstruction. If you’re not careful you could end up using 250 watts for no good reason and reconstruct for hours and hours, unable to use your PC for anything else).

Recommendation: consider using as few photos as possible for cloud, using the most efficient ARM CPUs, or processing locally during the noon peak where solar panels or renewable energy is highest. Processing the scans at night means some nuclear power plant, batteries, or gas-powered thermal power plant has to run and use resources to power this use. The Degrowth perspective is that the excess generation of renewables is still very generous, and it’s kind of SolarPunk to time your computing needs based on the Sun.

For large scans: (of over 100 photos), there is a diminishing return to resolution, and batch-resizing jpegs from 4K and above down to 2048px horizontal (using Affinity Photo for iPad) ensures the fastest upload/reconstruction time, while minimizing ICT footprint.

Best Cloud option: Polycam (subscription, but you get 2 free scans per month as a trial).

Local option: RealityCapture PPI (pay per input, you licence per megapixel of dataset, as low as $0.30 per scan).

Open source alternative: MeshRoom is a free 3D reconstruction software using the AliceVision nodal framework (depth map generation is super slow as of 2021 though and unless you’re running this on ARM it’s a big waste of energy). The Degrowth point of view here is that we need to swap the depth map generation with something much faster that still runs on ARM CPUs and isn’t RAM hungry. If the issue of temporal coherence can be resolved, something like the BoostingMonocularDepth algorithm could be ideal.

Parting thoughts: Hope this article helped. Some of you might be wondering why I think blogs are still worth investing in, instead of making videos on youtube; It’s about accessibility, portability, and ownership. I have this article saved offline and can publish it on a self-hosted device like my raspberry pi, it’s screen readable so more inclusive than most Youtube videos, but also uses less data which is great for inclusion and degrowth. I prefer seeking the maximum amount of information in the least amount of data. Plus, it’s easy to make a PDF guide out of it and send it to friends.

← Back to the blog