Autonomous Vehicle (AV) are progressing at a rapid pace (notwithstanding COVID 19 constraints). The past 2 months have experienced significant corporate events. These include Amazon’s acquisition of Zoox, Volkswagen’s investments in Argo, Yandex plans to spin-out its AV joint venture with Uber, and LiDAR unicorns (LUnicorns) Velodyne and Luminar announcing plans to go public at multi-B$ valuations (more than Zoox!).
On the deployment front, the pandemic has provided an opportunity for China-based AV companies to aggressively deploy ride-hailing, whereas trucking automation is making significant strides in the US (Waymo, Daimler, Ike, TuSimple, Aurora). Elon Musk has yet again announced that Tesla will have basic functionality for Level 5 AVs (no humans required in the car) by the end of 2020. Topping all this, General Motors is re-organizing its vaunted Corvette engineering team to support EVs (electric vehicles) and AVs. The focus is primarily on EVs, but the article mentions plans to upgrade GM Cruise’s AV platform (Origin) to Corvette like handling and comfort levels. Who says AVs are destined to become a boring utility?
Artificial Intelligence (AI) based systems are required for replacing a human driver. Continued innovation and testing of these systems have driven the need for richer sensor data, either through the use of many sensors per AV or higher sensor capabilities – range, accuracy, speed, visibility of FoV (Field of View), resolution and data rates. Paradoxically, the increased sophistication of sensors raises barriers for deployment – higher sensor and compute costs, increased power consumption and thermal issues, reliability and durability concerns, higher time to decision making (latency), and possibly more confusion and errors. It also increases requirements for data transmission bandwidth, memory and computing capabilities (all driving up power, heat, and $$$s).
Multiple directions can be pursued to thin down the sensor stack and focus on what matters in a driving environment (lean sensing). The rest of the article covers four approaches to realizing this: learning based sensor design, event-based sensing, Region of Interest (ROI) scanning, and semantic sensing.
1. Optimizing Sensor Design Through AI-Based Learning
Using complex and sophisticated sensor “instruments” and compute fabrics is fine during the development phase as the AI develops and trains itself (learning) to replace the human driver. Successful machine learning should be able to identify the features that are important in the deployment phase. Analysis of the neuron behavior in DNN (Deep Neural Networks) can reveal the aspects of sensor data that are important versus those that are superfluous (similar to DNN neurons processing 2d vision information). This in turn can help thin down sensor and compute specifications for deployment. One of the goals of machine learning during the AV development phase should be to specify sensor suites that provide actionable data at the right time with the optimal level of complexity – to enable timely and efficient decision making and driving decisions.
A previous article argued that AV players (like Waymo, Uber, Aurora, Cruise, Argo, Yandex) chose to control and own LiDAR sensor technology to ensure tighter coupling with the AI software stack. This coupling can also help understand which LiDAR performance features are critical for deployment. Working with multiple sensor modalities helps identify individual sensor features that are critical in different driving situations, eliminates duplicate and redundant information, and reduces unneeded sensor complexity. Tesla’s anti-LIDAR stance and Elon Musk’s “Lidar is a crutch” comment is an extreme case – where presumably the data and machine learning based on radar and camera data from over 0.5M cars deployed in the field has convinced Tesla that LiDAR is not required in AVs.
Human drivers sense a tremendous amount of information through different modalities – visual, audio, smell, haptic, etc. An inexperienced driver absorbs all this data, initially assuming that all of it is relevant. With practice and training, expert drivers can filter out the irrelevant and focus on the relevant information, both in time and space. This enables them to react quickly in the short term (braking for a sudden obstacle on the road or safely navigating out of traffic in the event of vehicle malfunction) and longer-term (changing lanes to avoid a slower moving vehicle). Machines trying to simulate human intelligence should be able to follow a similar model – initially acquire a large amounts of sensor data and train on this, but become more discriminating once the training achieves a certain level. Learning should allow a computer to select, sense, and act on the relevant data to ensure timely and efficient decision making.
Arriving at an optimally lean or thin sensor stack design is a function of AI and machine learning. Assuming this is done, the sensor system needs to decide what data to collect (event-based, ROI based) and how to process this data (semantic sensing).
2. Event-Based Sensing
Event-based sensing has existed in the military domain where one sensor (say a radar) can be used to detect an incoming threat, and cue another sensor (a camera or LiDAR) to pay more attention and devote more resources in that region (to recognize whether it is a friend or foe, for example). However, other techniques rely solely on the individual sensor itself to identify the event.
Prophesee (“predicting and seeing where the action is”) is a French company that specializes in developing event-based cameras. Their thrust is to emulate human or neuromorphic vision where the receptors in the retina react to dynamic information and the brain focuses on processing changes in the scene (especially for dynamic tasks like driving).
The basic idea is to use camera and pixel architectures that detect changes in light intensity over a threshold (an event), and providing only this data to the compute stack for further processing. Relative to a high-resolution framed camera, an event-based camera registers and transmits data for only 10-30% of the pixels experiencing intensity changes. The advantages are significant – lower data bandwidth, decision latency, storage, and power consumption.
In the scene below where the car and camera are stationary, the imagery collected by a standard framed HD camera provides a nice visual – but most of the data is not relevant for the immediate driving task (for example the buildings). A human driver easily filters out stationary objects like buildings and trees and focuses on moving pedestrians and cars to decide on the driving action. Prophesee’s event-based camera emulates this and captures pixel locations with intensity changes (events).
When the car is moving, pixels corresponding to surfaces with relative motion are activated. Certain pixels corresponding to large areas of intensity uniformity (like the sky or large surfaces of a stationary building) do not experience an intensity change and do not activate.
An event-based sensor requires significant innovation in the pixel circuitry. The pixels work asynchronously and at much higher speeds since they do not have to integrate photons like in a conventional frame-based camera. The interconnection technologies to enable this is being worked in collaboration with Sony. They recently announced a project to jointly develop a stacked event-based vision sensor with and industry leading pixel size (< 5 μm) and dynamic range (124 dB).
I spoke with Luca Verre, CEO of Prophesee regarding challenges and possibilities for event-based cameras, and by extension to event-based LiDAR.
SR: Couldn’t a fast frame-based camera do what you do? Difference consecutive frames and locate events, but at the same time also have intensity level information on the whole scene?
LV: No. Frame-differencing requires FPGA or SoC resources, whereas Prophesee’s cameras deliver events natively. Typically, only 10-30% of the data collected by framed cameras is relevant for driving control. Using such a camera creates unnecessary data overload, cost, and complexity, as opposed to our event-based cameras which identify and transmit only the useful information. Finally, achieving temporal precision of sub-milliseconds would require a frame-based camera running at 1 Khz frame rates,(difficult because of the required exposure times). Event-based cameras achieve this and high dynamic range performance since pixels asynchronously adjust exposure time according to the lighting conditions in the scene.
SR: In many situations, stationary objects are also relevant – like a pedestrian or parked car close to the car. Would an event-based camera suppress such information?
LV: Event-based cameras detect dynamic information. As long as the camera or objects in front of the camera do not move in a relative sense, there are no events and it is a safe situation. If relative motion occurs, it translates into an intensity change, and registers as an event.
SR: Can your cameras work in light-starved situations?
LV: Yes, see the figure above for a moving car. Since there is no exposure time, each pixel accumulates photons and triggers new events only when the light intensity changes by a certain amount (sensitivity threshold). Because of this, event-based cameras perform well in sub-lux lighting conditions. The trade-off is the time necessary to read-out the information as required by the application.
SR: Are you working on event-based LiDAR – in terms of looking for changes in intensity and depth information at a pixel?
LV: We are supporting initiatives with LIDAR partners that are fusing events and LiDAR information. Prophesee’s cameras identify the events and the LiDAR grabs 3D information from the ROI.
SR: Are event-based cameras expensive to make?
LV: The sensors used by our cameras are manufactured using conventional processes and rely on standard communication interfaces. Costs for the sensor and the camera system are comparable to conventional cameras.
3. Region of Interest (ROI) Scanning
Aeye (California based LiDAR company) promotes IDAR™ (Intelligent Detection and Ranging) – a 1550 nm wavelength LiDAR using Time of Flight (ToF) techniques to extract depth and intensity information in the scene. Per Aeye’s website, the IDAR™ is “the world’s first solid-state, leading-edge artificial perception system for autonomous vehicles that leverages biomimicry, never misses anything, understands that all objects are not created equal and does everything in real-time”.
Aeye believes in the idea of “saliency” – and argues that the goal of a perception system in an AV is to detect and react to surprises (if it not surprising, it is boring!) The IDAR™ is designed to be agile – by judiciously choosing locations in the scene that it transmits to and receives photons from (rather than spraying them uniformly across the FoV). These decisions are guided by information from the LiDAR itself or other sensors like a high-resolution camera, and intelligence (the “I” in the IDAR™). The goal is to inject photons in a region likely to return salient laser returns (surprises!). Like Prophesee, dynamic regions in the scene are of more interest and likely to create more surprises.
The mechanics of rapidly pointing the LiDAR towards regions of interest is achieved by a combination of intelligence and a fast, high resonant frequency, 2D Micro-Electro-Mechanical Mirror System (MEMs). The advantages of this approach are increased performance (accuracy, latency, resolution, range, speed) in the ROI, and reduced system complexity, cost, and power consumption. The figure below shows an example of a scanning pattern that can be created by AEye’s software-definable LiDAR. It demonstrates precise control of the MEMs and laser, and the ability to dynamically generate high-resolution shot patterns in a specific ROI, ensuring that LiDAR and computing resources are deployed efficiently.
Allan Steinhardt is the Chief Scientist at Aeye. We discussed IDAR™, MEMs, and the driving principles of information relevance or saliency.
SR: Does the IDAR™ have to be sold as a bundled camera-LiDAR system with its software?
AS: We sell bundled or unbundled systems, depending on customer preference. The LiDAR can cue itself independently or can be cued by a camera (if the camera frame rate is fast enough to keep up). Many traffic situations benefit from having a camera cue the LiDAR (for example, approaching car headlights at night).
The software for our system is like what most other LiDARs have – firmware and embedded SW for sensor control. We supply SDK (Software Development Kit) to customers to experiment with the adaptive control of LiDAR scan patterns. We also supply a library of scan patterns that can be used by our customers for different driving environments (the figure above is an example of such a scan pattern).
SR: Would your system work better if the LiDAR were cued by an event-based camera?
AS: This is customer dependent – some want an event-based camera, some do not. The latency for an event camera is lower, although we can essentially make a regular camera work like an event-based camera (for example, by viewing the road horizon and extracting events in software which we can then use to cue the lidar).
SR: How do you handle fast surprises? You need to determine where to focus the LiDAR, position it, fire the laser, analyze returns – is all this possible within ms time frames?
AS: Our LiDAR adapts in real-time on a scan line to scan line basis. We can determine within 40 micro-seconds whether an object is present or not. Based on this, we update the sensor control and modify laser shot timing to generate the required resolution in the ROI.
SR: How is the MEMs scanner able to respond that fast?
AS: We use a 2 mirror MEMs system, scanning 2 orthogonal axes. One of the MEMs operates in a resonant mode (no active position control), the other can MEMs position can be actively controlled through a step scan. A full frame scan spanning a 120° HFoV can be accomplished in 40 microseconds. This is possible based on the ability of our system to use small MEMs (1 mm diameter) effectively, an important part of Aeye’s secret sauce.
SR: But don’t the MEMs suffer from vibration issues and affect system sensitivity because of the small size?
AS: Small MEMs have significant advantages – high resonant frequencies (1 Khz, making the system immune to lower frequency road vibrations), higher scanning speeds and angular ranges, and high reliability of the MEMs flexure structures (in terms of fatigue). Typically, small MEMs translate into lower system optical efficiency. Aeye avoids this through its patented bi-static optical system design that provides SNRs capable of 1 km visibility.
4. Semantic Sensing
Event-based and ROI sensing seem like logical directions to enable sensor “thinning” and make them practical for AV deployment. There are opposing views, however. According to Raul Bravo, President of Outsight (France based LiDAR and 3D sensing software company), relying on dynamically creating higher resolution in the event zone in real-time is problematic – because if you know where the event is, you should already be acting, and if you have to search for the event and then interrogate it, then you are too late to act anyway.
Outsight’s real-time software works with other commercial ToF LiDARs, processing past and current data raw point clouds intelligently at the edge to generate a semantic understanding of the scene. Rather than providing raw point cloud data, Outsight’s software provides classified point data.
The basic premise behind Outsight’s approach is that in simplistic terms, the human brain approaches risk/response in two fundamental ways. The first demands an immediate response – it is intuitive and instinctive (System 1). This function is performed by a portion of the brain called the Amygdala (or the Reptilian Brain). It does not reason, it acts quickly and impulsively. The Neocortex, on the other hand, takes a more logical and deliberate approach to understanding and deciding on an action. It needs as much data as possible (System 2).
Outsight solves the System 1 problem (reacting quickly to short term surprises in the driving environment) by providing an artificial Amygdala, and using this to facilitate the long-term Neocortex like functions. The basis of the semantic information that supports the short term decision making is a SLAM (Simultaneous Location and Mapping) on-chip approach which uses the past and present raw point cloud data to create relevant and actionable point-clouds and object detection. The SLAM information includes relative object locations and velocities in the car’s environment.
My discussions with Raul on Outsight’s approach are summarized below.
SR: Can you give an example of how System 1 functions?
RB: System 1 process data in real-time based on past and current information, delivering only meaningful points to enable more efficient point-wide classification and obstacle detection. Getting rid of 90% of the raw data points of LiDAR that are not relevant to dynamic object detection (points of the road surface, the vegetation, the sky, the static environment) allows you to feed the object identification layer with only relevant information (for example, moving or moveable objects for object tracking or road markings for lane-keeping). This improves the identification and control process (robust, fast, lower bandwidth requirements).
SR: It is not clear how Systems 1 and 2 interact, can you clarify?
RB: System 1 is deterministic and works without any type of machine learning. It functions in parallel with System 2 (which is the “neocortex’ or neural network, focused on longer-term functions like object identification). Outsight’s software ensures the management of critical low-level processing (is there an object or not?), while the higher abstraction layers of machine learning (System 2) focus on reasoning (is this object a dog or a cat?).
SR: Does this work with any LiDAR – ToF, FMCW, etc.?
RB: Yes, our system works with any 3D sensor output, including LiDAR.
SR: But surely the LiDAR wavelength matters? Or whether it is a flash or scanning LIDAR?
RB: Our software is wavelength agnostic. It can improve on certain wavelength related effects. For example, certain wavelengths are less robust to raindrops and can get confused. Outsight’s software filters these out since they are not relevant objects. In terms of scanning, the software corrects for motion distortion effects inherent in scanning LiDARs.
SR: Does Outsight’s software enable better system performance with LiDARs that have lower resolution and range specifications?
RB: Since the processing software integrates information from past events and data, lower resolution LiDAR can indeed be leveraged much better than traditional methods. For example, in the figure below, the raw LiDAR creates very few returns from a low reflectivity obstacle on the road like a tire (less than 5 points per frame, sometimes zero points). By integrating data over time, this number increases dramatically, providing higher SNR (signal to noise ratio).
Prediction: As AVs approach reality, practical deployment constraints (costs, size, heat, durability, decision making speed, hardware and software reliability) will force sensor and perception providers to focus on what is needed for field deployment rather than what is achievable in a lab or development environment. Superfluous requirements and specifications will continue to get eliminated, resulting in “thin” or lean sensing. In many ways, this is a circular game – a pull by AV providers to make sensors leaner and deployable, and a push by sensor providers to make AVs a reality.