How are anti viruses so fast?

Disclosure: I work for an anti-virus vendor.

Because most anti-virus engines were born as protecting endpoints, and even now for many of them endpoint protection is major part of business, the modern anti-virus engines are optimized for scanning endpoints. This optimization includes many things, such as:

  • Not scanning the files which couldn't contain infections which can infect your computer;

  • Remembering which files were scanned, and not scanning them again unless the files were modified;

  • Optimizing scans for the file types when possible - for example when scanning executables, only certain parts of it need to be scanned. This minimizes disk reads, and improves performance.

A common misconception is that AV engines use hashes. They generally do not, for three reasons:

  1. First is that getting around hash detection is very easy and doesn't require modifying the actual malicious code at all;
  2. Second is that using hashes does not allow you to implement any kind of proactive protection - you will only be able to detect malware which you have seen;
  3. Calculating a hash requires the whole file to be read, while for some files (such as executables) this is not necessary. And reading the whole file on non-SSD hard drives is expensive operation - most AV engines should scan a large clean executable file faster than calculating hash on it

The common anti-virus (to my knowledge) use a kinda brute force type method where they get the hash of the file and compare it to thousands of known virus' hash. Is it just they have servers with super fast SSD and they upload the hashes to that and search really fast or am I completely wrong with my understanding of how they work.

I think you're considering those online "antiviruses" which only do hashing, and are quite poor in comparison to the (much more efficient and advanced) heuristic- and rule- based antiviruses SmokeDispenser mentioned.

Hashing is a good choice for the one use case where you have one virus file sent identical to a lot of people, and the AV engine you use is updated so often that the hash will be received and stored before you encounter the virus. In that one case, for example, you can immediately flag an email attachment as virus/phishing/ransomware/trojan etc., and you can do so with minimal computational expense. If you have to read the whole file anyway and you have a good connection (both things because e.g. you're a mail server), then additional I/O expense is negligible.

Since this is a minority of all use cases, (good) antiviruses don't (just) use hashes.

However, checking for a hash is really very, very fast. You don't need SSDs or any sophisticated hardware at all.

Imagine (even if it does not work exactly that way) that you have a hexadecimal hash such as 'baadc0debaadbaad' and want to verify whether it is present or not in an archive holding one hundred billion of them.

When you created the archive, you created two hundred fifty-six directories on a hard disk and named them from '00' to 'ff' (0 to 255 in decimal). In every one of them, you created 256 directories with names again from '00' to 'ff'. And in each of those you put 256 directories again, and in each third-level directory of the form AB/CD/EF you placed all the hashes that began with 'abcdef', divided in 256 files from "00.txt" to "ff.txt".

So the hash 'baadc0debaadbaad' ends up in a file "C:/ba/ad/c0/de.txt" which reads:

...
baadc0de81f872ac Ebola virus
baadc0debaadbaad Dark Avenger virus
baadc0debf31fe11 known clean file
...

How many hashes are in that file? If hashes are evenly distributed, just divide one hundred billion by the number of files. There are 256 folders with 256 subfolders with 256 subfolders with 256 files, so the number of files is 256*256*256*256 = 4294967296.

One hundred billion divided by that number is around twenty-four. That's how many hashes (on average) are expected to be in that one file.

So if you were given a hash, and asked to find it between those hundred billion hashes by hand, you would just need to click on the appropriate first-level folder, click on the second, click on the third, open the appropriate file and read twenty-four hashes. You'd be done probably in less than thirty seconds, having "apparently" examined three billion hashes per second. By hand. By yourself. It seems impossible, it's just good data organisation.

A computer can do the same thing even more efficiently.

Rule based scan

A more complicated case is how to scan for a known and analysed virus inside a file of which you know (initially) nothing.

In this case what comes to help is the fact that the operative system must execute the virus, so the virus can't be just anywhere - it must be where the operative system will actively look for it. Typically, a virus will infect an executable file by hiding somewhere (e.g. in a data area) and replacing the beginning of its executable area with a call to itself. This means that in most cases you need only scan the executable area and ensure it's clean (of course, viruses pop out all the time with clever ways of hiding in unlikely places, so reality is a bit more complicated than what I just said).

The antivirus looks in the same places, and in this case performs a rule-based analysis on the file. This is similar to the hash search: you check the first byte you encounter in the sensitive area, and based on its value you look at another byte somewhere else. The algorithm doing this kind of operation is sometimes called an automaton. For example, to see whether a text string is either "Hello, World!" or "Attack at dawn", you would check the first letter; if it was a H, you would ignore all the checks for "Attack at dawn".

With programs, this is made way more complicated by polymorphism and self-encryption - the fact that two different codes (say, 2+2 and 2*2) might yield the same result, so you need more checks ("If the first word is Hello, Howdy, or Good...") in order to establish that you've found a variation of the dread virus Salutations, Cosmos!. For self-encrypting code you would probably either need to find an algorithm to break the code (if it is feasible in very little time) or concentrate on recognizing the self-decryption routine, without which the virus would not be able to make itself executable by the system.

However, with a careful selection of the choice tree, you will be able to quickly converge either to a single possibility (the virus name) or to the confidence that no known virus is present.

Heuristics

This is the most complicated (and slow) kind of analysis, but also the most powerful because it allows you to strongly suspect the presence of a hitherto unknown virus. You can think of heuristics as lots of small rule-based recognizers; being small grants a speed enhancement that offsets their being many. Every recognizer will identify a piece of code performing an action, such as decrypting code, attempting to gain privileges on the system, open and write to files, hook system calls, rewrite code in memory, infiltrate the operative system and so on. Many legitimate programs do one or more of all these things, or appear to do so (for example decryption and decompression will look very similar to one another. Also some programs are encrypted to protect intellectual property, not to hide malicious code). For each operation you establish a "score", and executable code above a certain threshold may be regarded as suspicious.

The exact details of the heuristic engines are of course much more complicated than this; they need to balance speed, reliability (when they say it's suspicious, it has to be suspicious) and sensitivity (when it is suspicious, they need to be able to say it is suspicious).

Actually, they're complicated enough that they're usually closely guarded secrets of antivirus companies, and often touted as 'Intelligent' all the way to 'Artificial Intelligences'.

The need to continuously update the rules (it makes little sense to still look for a virus gone extinct ten years ago) and, less urgently but no less importantly, tune the heuristics, is the reason why an essential parameter to an antivirus' quality is the frequency of its updates (even if, to a point, an outstanding heuristic engine might allow less frequent updates, and an impossibly perfect heuristic engine could need none at all).

Optimizations

If you are an antivirus, and are able to load into the operating system kernel to such an extent as to intercept any disk-write operation, and know for sure that there is no way of booting the operating system without activating your code (i.e. you can't be blue-pilled), then it is true that no file on disk can be modified without your knowledge. At that point, you can keep track of which files have been scanned and found clean, and only scan files that got modified since the last scan, which is a really small fraction of the total (and you might even ignore disk writes performed by non-infected processes; as long as you allow for the fact that there are ways in which a virus written in e.g. Javascript could be able to ask another process such as WSCRIPT to write to the disk on its behalf).


While anti virus software might upload samples for analysis (see below), this is not how detection works.

Also, signature based systems are getting less common (as they can be easily evaded). They also do not work how you expect them to work.

Signatures are built over of a specific region or parts of a binary. If the location is not known, the signature window is being let “slide” across the binary. So this is more complex and generates way more samples than you might think as an intermediate product.

Nonetheless, searching for those signatures is not slow but also not exactly fast; there often is a notable delay when antivirus is used.

Nonetheless, doing so does not require enormous amounts of disk space or RAM (though some) - there is a lot more going on when a 4K video is decoded for display.

Besides signatures, there are also heuristics and behavior analysis that are happening, so the binary in question gets sandboxed and it’s behavior is analyzed. If actions taken are suspicious, such binaries might be blocked - those are the times the samples may get sent to the AV vendor for further analysis.

To conclude:

It’s not that fast. But the operations are not that complicated that they would need servers and couldn’t be done by a workstation on its own.

Such a server infrastructure would have to be huge, by the way. If the work was to hard for a workstation, that would mean a strong server could maybe serve a few customers, leaving a vendor with maybe 300 million copies to pay for and operate several million servers.