How do antiviruses scan for thousands of malware signatures in a short time?

Antivirus detection is a feature extraction and a classification problem.

A great analogy is the 20 questions game where the goal is to identify an arbitrary object by asking 20 seemingly unrelated yes/no questions. The idea behind the game is that each answer would eliminate half of the objects so it is theoretically possible to describe 2^20 (1,048,576) objects with only 20 binary features.

A different analogy is how the visual cortex processes visual information. The brain has very simple and fast hardware for detecting and classifying an infinite number of images. Only six layers of neurons (the number of neurons is estimated at 140 million) are used to extract progressively more complex features and pass them on to the next layer. The layers interact back and forward to each other to produce abstract notions that can be verified against memory.

Antivirus engines store many features of known malware in the definition file and when they scan a new file they optimize the extraction and classification (matching) of those features. Storing features also makes the detection more robust so that small changes in a piece of malware won't thwart detection. Feature extraction is also done in parallel so that resources are fully utilized.

Most features are designed by humans but there are some that do not make sense by themselves, like having a null byte at the end of the file or a ratio between file size and printable text size. Those nonsensical or unintuitive features are randomly generated and tested by data mining vast quantities of files. In the end the file is described and classified by the combination of features. As a side note, the best predictor for questions being closed on Stack Exchange is whether the first letter of the question is in lower case.

So when a new file is scanned, it is quickly classified into finer and finer categories and then it is matched against a small set of signatures. Each step would exclude a large number of clean files and would dictate what other features should be extracted next. The first steps are very small in terms of computing resources but they dictate which more expensive steps should be taken later.

By using only a few disk reads and CPU cycles the engine can determine the file type. Let's say it is a JAR file. Using this information, it starts collecting features of the JAR file. If it's signed, then the scan is aborted. If it's not importing any functions then the scan is aborted (I'm oversimplifying here). Is it using any tricky functionality? then more features should be extracted. Does it use known vulnerable functions? Then it should be thoroughly checked for known Java exploit signatures.

On-access scanning has the same principle behind but it also works like a gatekeeper. So each action (usually API call) taken by a process is being checked for and allowed or denied. Similarly, each suspicious action triggers more filters and more checks. During the checks the process or thread is waiting for the operation to complete but sometimes the whole process is actively suspended. This might look like significant overhead but once a specific action is verified, it is later cached and performed very quickly or not performed at all. The result is a performance degradation similar to having a machine a couple of percentage points slower. Check the PCMark scores for 20 AV products here.

So the speed optimization comes from very little work being performed on clean looking files which constitute the overwhelming majority of scanned files. The heavy lifting work is being done only on suspicious malware-looking files for which AV might even take seconds to emulate the process or even send it to the cloud for analysis.

The magic is in the progressive classification.


Malware signatures are unique values that indicate the presence of malicious code. Simply speaking, When an anti-virus program scans your computer, it calculates the signature for a file (say like a hash), then compares that signature/hash to a list of known bad signatures.

Calculating a single hash of a file and then comparing it against a list of millions of hashes is far easier than looking for each malware signature in a given file.

The signature could represent a series of bytes in the file. It could also be a cryptographic hash of the file or its sections. Each AV vendor does it a little bit differently.

There are usually settings for the 'performance vs depth' of antivirus software scans. These scans are essentially looking at how much of the code they'll process before they conclude that a file is safe or not.

It should be noted that anti-virus techniques have improved and more recent technologies that aren't just signature based. The combination of these techniques as well as optimizations to reduce performance impact are what makes AV's considerably faster than what they used to be.

Guaranteed, if you run you AV on its most detailed inspection against your primary OS drive, you'll notice a performance hit as the files are processed. Its not usually a loss of CPU performance but rather a loss in disk drive performance (obviously SSD's have an advantage here).

Some time saving techniques used by AV's

  • Ignoring non-executable files
  • Ignoring large files (ie >500MB)
  • Ignoring files with a checksum matching a known 'legitimate' file
  • Reading certain parts of files and ignoring the rest
  • Graciously conceding resource usage when it detects a user present at the computer
  • Looking for specific system calls that represent risky behaviour
  • Generating a baseline at first and then only scanning new/modified files
  • etc..

Additionally, further techniques used for Static detection by common AV solutions:

  • String Scanning method: Searches for sequence of bytes (strings) that are typical of a specific virus but not likely to be found in other programs.
  • Wildcards method: allows to skip bytes or byte ranges. For example "?" character are skipped and the wildcard % means that the scanner will try to match the next byte.
  • Mismatches method: allows any given number of bytes in a string to be of arbitrary value, regardless of their position.
  • Generic Detection method: This technique uses one common string to detect several or all known variants of a family of viruses.
  • Bookmarks method: calculates the distance between the start of the virus body and the detection string.
  • Smart Scanning: Smart scanning could skip junk instructions, such as NOPs, in the host file and also did not store them in the virus signature. To enhance the likelihood of detecting related variants of viruses, an area of the virus body was selected which had no references to data or other subroutines.
  • Skeleton Detection: The scanner parses the statements of the virus line-by-line and drops all nonessential statements. What is left is the skeleton of the body that has only essential macro code common in macro virus.
  • Heuristics Analysis: Heuristic analysis is an expert based analysis that determines the susceptibility of a system towards particular threat/risk using various decision rules or weighing methods. MultiCriteria analysis (MCA) is one of the means of weighing.
  • Virus specific detection: There are cases when the standard algorithm of the virus scanner cannot deal with a virus. In cases like this, a new detection code must be introduced to implement a virus-specific detection algorithm. This method includes Filtering, Decryptor Detection and X-Ray scanning. Source: Computer Virus Stategies and Detection Methods

Hope that helps!

Further Reading: Wikipedia