Smart Video Surveillance Might be Listening

Article By : Youval Nachum

In security applications, you want to be able to ignore normal behaviors and be alerted to anomalies, so there is increasing use of Artificial Intelligence

There’s a lot of enthusiasm and demand for smart video surveillance, driven by heightened concerns over acts of terror, to protect us from more traditional threats in our homes, offices and city areas, and to reduce pilfering and larger scale theft or damage in stores, warehouses, and other areas. The approach has to be local smartness (in-camera) because it is impractical to require video feeds from all these cameras to be uploaded for constant review. In security applications, you want to be able to ignore normal behaviors and be alerted to anomalies, so there is increasing use of Artificial Intelligence in these systems and the market is expected to grow from $22.B in 2016 to over $55B in 2023 for a 13.6% CAGR (ABI Research).

Surveillance
Source; CEVA

But there are a couple fundamental limitations in vision-only based surveillance, starting with field-of-view (FOV) coverage, obviously limited for a fixed camera. The second problem is limiting anomaly detection to visual cues. You might hear a gunshot, but you probably won’t see it. And the sound is a cue that is independent of direction (if in range).

Solutions to the Fundamental Limitations of Vision-Only Based Surveillance

The obvious solution is to combine video and audio with pan-tilt-zoom control along with smart audio detection. The audio should support multiple microphones with beam-forming for direction-of-arrival detection; this is already quite common in smart speakers. Then an AI stage is trained to detect anomalous noises, for example, a gunshot, or screaming, or a breaking window. Multiple mics take care of 360 degree coverage and analysis of the source provides a direction to point the camera.

Another benefit in this approach is that the camera could standby, burning very little power, until audio trigger detection (which is intrinsically much lower power than an active video camera) wakes it up to analyze a scene. So teaming audio and video detection could be very effective in remote locations where battery-powered operation may be essential.

Opportunities extend beyond using audio as a trigger to guide the camera and then letting vision-based ML take over. Even when we’ve figured out what we want to look at, we humans continue to integrate what we hear together with what we see to draw conclusions. If you can only watch two people having an obviously animated conversation, you don’t know if they’re simply debating last night’s football game or if they’re in a disagreement that might lead to a fight. You have to not only watch them but also hear what they are saying; this doesn’t have to be at a natural language processing level, maybe only needing to check volume, pitch, and keywords.

AI as Part of the Solution: The First Step

The goal in this and other cases is not to guarantee detection of anomalous behavior but to filter down to likely anomalies that should be sent upstream for review and/or recording. AI training for this class of detection would naturally benefit from integration, based on combined test cases of audio and video streams. This does not feel like a huge step; vision-based and audio-based AI have each evolved quite significantly but mostly independently. Combining the two should be a natural next step.

All of this, of course, depends on being able to add smart audio to your smart video camera. You probably already know how you want to manage your camera but are maybe a little less familiar with the audio side. In a typical solution, you’ll position multiple mics (which can be very small) around your device, those will feed into beam-forming and active noise management hardware/software followed by an AI stage for trigger word detection, possibly voice biometric and voice command detection if important for your application and of course anomalous event detection. Products of this type are already available. It can only be a matter of time before combining smart audio with smart video becomes widespread.

– Youval Nachum, Senior Product Marketing Manager, CEVA

Virtual Event - PowerUP Asia 2024 is coming (May 21-23, 2024)

Power Semiconductor Innovations Toward Green Goals, Decarbonization and Sustainability

Day 1: GaN and SiC Semiconductors

Day 2: Power Semiconductors in Low- and High-Power Applications

Day 3: Power Semiconductor Packaging Technologies and Renewable Energy

Register to watch 30+ conference speeches and visit booths, download technical whitepapers.

Subscribe to Newsletter

Join the Conversation

  1. Logan Bell says:

    Thanks for this article. It’s now apparent that video surveillance is at such a level, that video analytics and advance AI is needed to ensure that it is being used to its full potential by organisations.