Exploring the VirusTotal Dataset | An Analyst's Guide to Effective Threat Research

VirusTotal stores a vast collection of files, URLs, domains, and IPs submitted by users worldwide. It features a variety of functionalities and integrates third-party detection engines and tools to analyze the maliciousness of submitted artifacts and gather relevant related information, such as file properties, domain registrars, and execution behaviors.

The VirusTotal dataset, the backbone of the platform, structures artifact-related information into objects and represents relevant relationships between them, providing contextual links between various artifacts. This makes VirusTotal a valuable resource for threat research, enabling users to perform activities such as clustering artifacts related to specific threat actors or campaigns, tracking malicious activities, and analyzing trends in the threat landscape.

In this post, part of a collaborative effort between VirusTotal and SentinelLabs, we explore how to effectively use VirusTotal’s wide range of querying capabilities, highlight scenarios in which these capabilities return informative results, and discuss factors that may impact the completeness or relevance of the data.

The content is aimed at VirusTotal users seeking to better understand the fundamental inner workings of the platform and how to effectively use it as part of their investigations. This contribution complements the comprehensive VirusTotal documentation by discussing certain aspects in greater detail along with a summary of relevant context and usage information, and demonstrating how VirusTotal capabilities are applied in real-world cases.

Overview

The VirusTotal platform analyzes files and network-related artifacts (URLs, domains, and IPs) submitted to the platform to detect maliciousness. The platform aggregates results from third-party detection engines, web scanners, and other tools to provide thorough analysis overviews.

VirusTotal stores submitted artifacts as well as information related to each artifact in a dataset, which we refer to as the VirusTotal dataset. The artifact-related information is extensive and diverse, including, for example, file properties such as filename, file type, digital signatures, and hashes, as well as URL components such as domains, URL paths, and URL query parameters.

VirusTotal provides interfaces for users to interact with the platform and search filters for querying it. The search filters allow for retrieving and pivoting through artifact-related information. Additionally, they enable clustering multiple artifacts and identifying newly submitted ones based on user-defined queries aimed at capturing overlaps in content or related information. This has many use cases in threat intelligence, such as identifying trends in the threat landscape, commonalities between different threats, and tracking specific threat groups or campaigns.

Below, we first provide an overview of the VirusTotal dataset structure and the different interfaces available for interacting with it, with a focus on querying the dataset using search filters. We then delve into search modifiers, a specific type of VirusTotal search filter, highlighting modifiers for querying data generated by artificial intelligence (AI) technology. Next, we discuss factors that may impact the relevance or completeness of results when querying using search modifiers, including an example of using search modifiers in an actual threat research investigation.

The VirusTotal Dataset

The VirusTotal dataset is the backbone of the VirusTotal platform. Current records indicate that it stores a vast amount of submitted artifacts and related information, including over 50 billion files, 6 billion URLs, and 4 billion domains. The data is stored in a structured and hierarchical manner.

Artifact-related information is structured into objects, which have an ID, a type, and attributes. Optionally, objects may also have one or multiple relationships.

The object ID uniquely identifies an object. An object can be directly related to a submitted artifact: a file, URL, domain, or IP address. In this case, the object’s ID is derived from the artifact itself — the SHA-256 hash for files, the IP address or domain itself for IPs or domains, and the SHA-256 hash or the Base-64 encoded form for URLs.

An object type indicates the type of information stored by an object. For example, an object of type file stores file-specific information about submitted files, such as filenames, file hashes, and the file extension, whereas an object of type domain stores domain-specific information about submitted domains, such as the domain’s registrar. The objects of type file, url, ip, or domain are directly related to submitted artifacts, while the rest, such as threat_actor or reference, are not.

An attribute is a data item that stores information related to an object and can be of a primitive or a complex data type, such as an array or a structure. For example, the file object (an object of type file) includes the string attribute sha256. This attribute stores the SHA-256 hash of a submitted file. Additionally, it may contain the attribute lnk_info, specifically present for Windows Shortcut (LNK) files. lnk_info is a structure containing information specific to LNK files and extracted from the submitted file, such as the date at which the shortcut had been created.

Relationships signify connections between objects of the same or different types, making them particularly useful for describing scenarios involving multiple artifacts. For instance, the malicious file config.lnk file (SHA-256 hash 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61) communicates with the IP address 149.51.230[.]198 to download a payload.

In the context of VirusTotal, this relationship is represented through the communicating_files relationship of the ip_address object, which is directly related to 149.51.230[.]198. communicating_files stores information about all files, in the form of file objects, observed to communicate with the IP address during sandbox execution. VirusTotal executes submitted executable files in sandboxes to capture behaviors and artifacts visible only while the file is executing, such as started processes, network communications, changes to the file system, or strings present in process memory.

Top-level collections group all objects of the same type. The top-level collections that VirusTotal currently implements are files (a set of all objects of type file), urls (a set of all objects of type url), ips (a set of all objects of type ip), domains (a set of all objects of type domain), collections (a set of all objects of type collection), threat actors (a set of all objects of type threat_actor), and references (a set of all objects of type reference).

The object of type collection is not to be confused with top-level collections. This object groups multiple objects of the same or different types given a user-specified context, such as a threat actor, a malicious campaign, or a malware family.

The top-level collections enable operations that relate to the set of all objects of a given type. Such an operation is submitting a new file for analysis, which adds an object of type file to the files collection.

Querying VirusTotal

VirusTotal exposes two interfaces for interacting with its dataset: the platform’s graphical user interface (GUI) for manual interaction and the application program interface (API) for programmatic interaction.

The VirusTotal GUI

The VirusTotal GUI is the web interface of the platform. To query VirusTotal using the GUI, users enter a search query into the search field. A search query is composed of one or multiple search filters.

A search filter can be a value uniquely identifying a submitted artifact – a URL, domain, file hash (MD5, SHA-1, or SHA-256), or an IP address. This filter cannot be combined with other filters and is used for retrieving information related to only a single submitted artifact.

When a user inputs a value, VirusTotal retrieves the corresponding artifact and related information from its dataset and displays this data to the user as a web analysis report. To retrieve the artifact and its related information, VirusTotal searches its dataset for an object that has an ID or an attribute (for MD5 or SHA-1 hashes) that matches the user-provided value. The platform also explores relationships to and from this object.

Web analysis report (search query: 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61 — Web analysis report (search query: *85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61*)

Querying using value uniquely identifying a submitted artifact

A search filter can also be a search modifier in the format modifier:value, where value may be a predefined or a user-specified search criterion.

Each modifier is mapped to one or more of the top-level collections files, urls, ips, domains, and collections, forming sets of file-, URL-, IP-, domain-, and collection-specific search modifiers. Further information on these modifiers can be found in the official VirusTotal documentation.

Multiple modifiers can be combined into more complex search queries using the logical operators AND, OR, and NOT. Parentheses can be used to group modifiers and logical operators, allowing for more precise queries by controlling the order of operations. All modifiers within a search query must be mapped to a single top-level collection.

A special modifier is entity, which defines the top-level collection to which the search query is applied. For example, entity:file applies the search query to the files collection, and entity:url applies the query to the urls collection. If a user does not explicitly specify the entity modifier, the platform defaults to entity:file.

Structured overview of commonly used file-specific search modifiers (entity:file) — Structured overview of commonly used file-specific search modifiers (*entity:file*)

For each modifier, VirusTotal searches for and retrieves objects within the collection that the modifier is mapped to. These objects have an ID or attribute, or a relationship to another object with an ID or attribute, that meets the search criterion. For example, entity:domain AND domain:test instructs VirusTotal to retrieve all objects of type domain whose IDs (the domains themselves) contain the string test. An exception is the content modifier, which instructs the platform to search through the content of submitted files.

In the case of a complex search query combining multiple modifiers using AND, OR, and/or NOT, VirusTotal combines the retrieved objects for each modifier into a resulting set that meets the combined criteria.

The platform then displays an overview of the resulting set of objects to the user in the form of a list of web analysis results. When a user clicks on an item in the list, the platform generates a web analysis report based on the corresponding object’s ID, as described previously.

List of web analysis results (search query: entity:domain AND domain:test) — List of web analysis results (search query: *entity:domain AND domain:test*)

In addition to scoping searches to a specific top-level collection, VirusTotal also uses the entity modifier to disambiguate between search modifiers mapped to more than one collection, such as fs (first submission date). For example, entity:file AND fs:2024-07-15 instructs VirusTotal to search the files collection for file objects where the first_submission_date attribute is set to July 15, 2024. In contrast, entity:url AND fs:2024-07-15 directs VirusTotal to search the urls collection for url objects where first_submission_date is set to the same date.

The VirusTotal API

A basic way to query VirusTotal using the API is to issue HTTP GET requests to API endpoints exposed by the platform and specify search filters as part of the request URLs. VirusTotal implements multiple endpoints, such as the following:

/api/v3/intelligence/search?query={query}: This endpoint allows querying VirusTotal in the same manner as the GUI, using a value that uniquely identifies a submitted artifact (a URL, domain, IP address, or file hash) or search modifiers. Example request URLs are
```
https://www.virustotal.com/api/v3/intelligence/search?query=test.com
https://www.virustotal.com/api/v3/intelligence/search?query=entity:domain+and+domain:test
```
/api/v3/files/{hash}: This endpoint retrieves the file object whose md5, sha1, or sha256 attribute matches the user-provided value. An example request URL is
```
https://www.virustotal.com/api/v3/files/e6adf40a959308ea9de69699c58d2f25
```

Querying using the /api/v3/files/{hash} API endpoint — Querying using the */api/v3/files/{hash}* API endpoint

VirusTotal returns JSON-formatted data in response to API requests. Users can parse this data and use it for additional actions, such as further querying and pivoting through the VirusTotal dataset.

Web requests can be issued using various methods, such as HTTP client libraries, command-line tools, or custom scripts. vt-py, the official Python library for the VirusTotal API, simplifies the process of sending web requests to endpoints and handling the responses, enabling users to perform various tasks programmatically.

The API vs. The GUI

There are several key differences between the VirusTotal GUI and API, particularly regarding scalability and the scope of available information.

The programmatic use of the API enables users to conduct large-scale querying of VirusTotal, which is not achievable through manual use of the GUI. For example, retrieving the names of the processes started by all Windows Shortcut files submitted to VirusTotal over 2024 is a task that is practically feasible only using the API.

Further, not all data stored in the VirusTotal dataset can be used as part of search queries using the GUI, which limits its querying capacity. For example, there are no search modifiers allowing users to query for URLs constructed in process memory during sandbox execution that submitted files have not contacted, such as secondary C2 URLs, which are contacted if communication with the primary C2 server fails. Although such search modifiers can be useful during investigations, the VirusTotal GUI displays these URLs in the Memory pattern URLs section of web analysis reports without providing a method to query for them directly.

The API endpoint /api/v3/files/{hash}/behaviours retrieves sandbox-generated data for files specified by their hash value. URLs discovered in process memory are stored in the memory_pattern_urls field returned by the endpoint. For files that meet other search criteria, users can programmatically extract the URLs and keep the file in consideration if a URL aligns with a specific search requirement.

memory_pattern_urls values — *memory_pattern_urls* values

The API may provide more information than what is visible to users in the GUI. For example, there can be discrepancies in sandbox-generated data provided to users through the GUI and the API. For example, the /api/v3/files/{hash}/behaviours endpoint retrieves all data generated by the sandbox CAPE for the file with the user-specified hash, including details on the suspicious behavior rules triggered by the file during execution. This information is not provided to users in the GUI.

CAPE suspicious behavior rules retrieved by api/v3/files/{hash}/behaviours — CAPE suspicious behavior rules retrieved by *api/v3/files/{hash}/behaviours*

AI Search Modifiers

VirusTotal leverages artificial intelligence (AI) to generate natural language summaries of the functionalities of code in executable files submitted to the platform, such as scripts, Microsoft Office documents, or binary files. This feature is particularly beneficial for malware analysis, assisting analysts in understanding the capabilities of malware under investigation.

VirusTotal integrates AI engines into its pipeline for analyzing submitted files. These engines use large language models (LLMs) trained on programming languages, which enable them to analyze and translate code into natural language summaries of its functionalities. Some AI engines also generate verdicts, which are labels categorizing analyzed code as benign, suspicious, or malicious. VirusTotal supports two types of engines: Code Insight and Crowdsourced AI.

Code Insight is VirusTotal’s in-house AI engine implementation, based on Google’s Gemini. Crowdsourced AI is a collection of third-party AI engines contributed by the community and is continuously enriched with new additions.

Depending on their training and design, the AI engines specialize in analyzing specific file types. For example, the ByteDefend AI engine is designed to analyze macro code in Microsoft Office files, including Word, Excel, and PowerPoint documents.

The Code Insight engine focuses on script files, such as PowerShell, Python, and Ruby scripts. It excludes from analysis any script files that exceed a set file size or similarity threshold (these values are currently undocumented). The VirusTotal platform compares each script’s code with scripts previously analyzed by the AI engine and calculates a similarity value. This value is then evaluated against the similarity threshold.

While we were writing this post, Google announced that Code Insight will also support Windows Portable Executable (PE) binary files. This feature is enabled by three interconnected phases:

Unpacking binaries using the malware analysis service Mandiant Backscatter: This step reveals the underlying code of potentially obfuscated (packed) malicious binaries submitted to VirusTotal, which is the intended subject of analysis.
Decompilation of the unpacked binaries using Hex-Rays IDA Pro decompilers: This step translates the assembly code of the unpacked binaries into decompiled code written in a higher-level programming language (pseudocode). Gemini analyzes decompiled code with greater efficiency compared to assembly code due to its conciseness.
Analysis of the decompiled code with Gemini: This step generates a natural language summary of the decompiled code’s functionalities.

The summaries generated by Code Insight and Crowdsourced AI engines for a given file are stored in the analysis fields of the crowdsourced_ai_results attribute of the corresponding object of type file. The optional verdict fields of this attribute store the verdicts, while the source fields indicate the AI engines that have analyzed the code within the file.

The crowdsourced_ai_results attribute — The *crowdsourced_ai_results* attribute

Users can query VirusTotal for specific verdicts or content in AI-generated summaries using search modifiers designed for that purpose. These modifiers are mapped to the files top-level collection.

For each AI search modifier, VirusTotal searches the platform’s dataset for file objects whose analysis and/or verdict fields of the crowdsourced_ai_results attribute meet the user-specified criterion. In addition, based on the source field, each AI search modifier focuses the search on summaries or verdicts generated by either Code Insight, specific Crowdsourced AI engines, or all available AI engines.

Search modifier	Usage and search scope
codeinsight	codeinsight:[text] Searches for text in summaries generated by Code Insight.
crowdsourced_ai_analysis	crowdsourced_ai_analysis:[text] Searches for text in summaries generated by Code Insight and all Crowdsourced AI engines.
crowdsourced_ai_verdict	crowdsourced_ai_verdict:[benign\|suspicious\|malicious] Searches for benign, suspicious or malicious verdicts generated by Code Insight and all Crowdsourced AI engines.
[ENGINE]_ai_analysis	[ENGINE]_ai_analysis:[content] Searches for text in summaries generated by a single Crowdsourced AI engine. [ENGINE] is the identifier for a specific engine, such as hispasec (hispasec_ai_analysis).
[ENGINE]_ai_verdict	[ENGINE]_ai_verdict:[benign\|suspicious\|malicious] Searches for benign, suspicious or malicious verdicts generated by a single Crowdsourced AI engine.

VirusTotal introduces new engine-specific search modifiers ([ENGINE]_ai_analysis and [ENGINE]_ai_verdict) as new engines are incorporated into Crowdsourced AI. For example, with the addition of the ByteDefend engine, the platform released two new search modifiers: bytedefend_ai_analysis and bytedefend_ai_verdict.

The AI search modifiers can be combined with other AI search modifiers or with any other modifiers supported by VirusTotal using the logical operators AND, OR, and NOT. For example, the search query crowdsourced_ai_analysis:"inject" AND crowdsourced_ai_analysis:"explorer.exe" can be used to identify files that perform injection involving the explorer.exe process. The results returned from VirusTotal include the PowerShell script da.ps1, which injects code from an external file into this process. This functionality of the script is documented in the summary generated by the Code Insight AI engine.

da.ps1 injects code into explorer.exe — *da.ps1* injects code into *explorer.exe*

Code Insight analysis of da.ps1 — Code Insight analysis of *da.ps1*

Another example is the search query crowdsourced_ai_analysis:"Shell.Run" AND behavior_created_processes:"powershell.exe". This query can be used to identify files that invoke the Run function of the Windows Script Host Shell object to execute the PowerShell process powershell.exe for conducting further activities. The results returned from VirusTotal include the Visual Basic script 297641663, which executes a PowerShell command using the Run function to download a payload from a remote server.

297641663 executes powershell.exe — *297641663* executes *powershell.exe*

Code Insight analysis of 297641663 — Code Insight analysis of *297641663*

Although the AI engines integrated into VirusTotal provide valuable insights, they should be used as tools to assist in malware analysis efforts, as part of a broader analysis strategy. AI engines are designed and trained to analyze code based on historical data, and therefore may not always accurately interpret novel techniques or highly obfuscated code in malware implementations. As a result, the summaries they generate may sometimes lack sufficient or useful information for analysts.

Clustering With Search Modifiers

The extensive number of VirusTotal search modifiers enables analysts to query the platform in a practical and precise way. This allows for retrieving submitted artifacts and related information that are relevant to specific threats under investigation. However, false positives (where retrieved data is not related to the investigated threat) and false negatives (where relevant data is missing) can impact the relevance and completeness of search results.

The way in which queries are formulated is important for addressing or alleviating the impact of these challenges. Combining search modifiers using the logical operators AND, OR, and NOT and refining search queries helps reduce the likelihood of false positives and false negatives. This is an iterative process where analysts may integrate information obtained from multiple sources into their query formulations.

For example, malware analysis may provide characteristics suspected to be unique to the investigated activity cluster, such as specific file names, hashes, registry keys, network indicators, code signatures, strings or functions used by the malware, or distinct patterns of behavior. Additionally, information from previous reports documenting activities potentially related to the current investigation can also be beneficial. Upon reviewing the accuracy and completeness of query results, analysts may adjust the queries to further improve their relevance and precision. To illustrate these concepts, we provide an example from an actual threat research investigation.

Clustering Scenario

In 2023, SentinelLabs conducted an investigation into suspected China-nexus actors targeting Southeast Asian gambling companies. The investigation led to the AdventureQuest.exe Windows PE executable, which had been submitted to VirusTotal on May 11, 2023. Analysis of the file revealed it to be a malware loader implemented using the .NET framework, deploying further executables on compromised systems. These executables download archive files from attacker-controlled servers. The archives contain sideloading capabilities, including malicious DLLs sideloaded by legitimate executables to deploy the Cobalt Strike backdoor.

AdventureQuest.exe is signed with a certificate issued to the Ivacy VPN vendor PMG PTE LTD (certificate serial number: 0E3E037C57A5447295669A3DB1A28B8A). It is probable that the PMG PTE LTD signing key has been stolen, a tactic often used by suspected Chinese threat actors to sign their malware. Based on overlaps in code and functionalities with malware observed in Operation ChattyGoblin, AdventureQuest.exe is likely part of the same activity cluster.

The certificate’s serial number provides a starting point for identifying any other malware loaders submitted to VirusTotal that are signed with the same certificate and share implementation characteristics with AdventureQuest.exe, suggesting a potential link to the same threat actor or campaign.

VirusTotal uses the Sigcheck tool to extract digital signature information from submitted Windows PE files, including the serial numbers of code signing certificates. After extraction, this information is stored in the signature_info attribute of the corresponding file objects. Users can query VirusTotal for specific signature information using the signature search modifier.

The signature_info attribute (in the file object for AdventureQuest.exe) — The *signature_info* attribute (in the *file* object for *AdventureQuest.exe*)

The signature_info attribute (in the web analysis report for AdventureQuest.exe) — The *signature_info* attribute (in the web analysis report for *AdventureQuest.exe*)

The query signature: "0E3E037C57A5447295669A3DB1A28B8A" searches for submitted files that have the serial number 0E3E037C57A5447295669A3DB1A28B8A in their signature information. The query returns 94 results, including both Windows PE executables like AdventureQuest.exe and other file types, such as Windows DLLs.

VirusTotal attempts to determine the type of each submitted file using third-party tools that search for magic numbers (byte sequences) and other types of signatures that identify specific file types, such as 0x4D 0x5A for Windows PE executables. These tools include the Unix utility file, Detect-it-Easy, and AI engines. The platform then inserts keywords indicating the file type in the type_tags attribute of the corresponding file object. Users can query VirusTotal for specific keywords stored in type_tags using the type search modifier.

The type_tags attribute (in the file object for AdventureQuest.exe) — The *type_tags* attribute (in the *file* object for *AdventureQuest.exe*)

The type_tags attribute (in the web analysis report for AdventureQuest.exe)

Building on the previous search, the query signature: "0E3E037C57A5447295669A3DB1A28B8A" AND type:"peexe" narrows the results to submitted Windows PE executables. The query returns 31 results, some of which, like AdventureQuest.exe, are implemented using the .NET framework, while others are not.

The file and Detect-it-Easy tools may provide VirusTotal with information about the environments in which submitted executables are built. The platform stores the output from these tools in the magic and detectiteasy attributes of the corresponding file objects. Users can query VirusTotal for specific content in these attributes using the magic and detectiteasy search modifiers.

The magic attribute (in the file object for AdventureQuest.exe) — The *magic* attribute (in the *file* object for *AdventureQuest.exe*)

The magic attribute (in the web analysis report for AdventureQuest.exe)

The detectiteasy attribute (in the file object for AdventureQuest.exe) — The *detectiteasy* attribute (in the *file* object for *AdventureQuest.exe*)

The detectiteasy attribute (in the web analysis report for AdventureQuest.exe)

Building on the previous search, the query signature: "0E3E037C57A5447295669A3DB1A28B8A" AND type:"peexe" AND magic:".NET" further narrows the results to submitted executables built using the .NET framework. The query returns 13 results.

Closer examination of the resulting files shows that most have PDB paths, such as Ivacy.pdb. However, AdventureQuest.exe does not have a PDB path, which is typical for malware, as malware authors often strip executables of debug information. This suggests that the files with PDB paths may not be associated with the investigated activity.

VirusTotal extracts information from the header of each submitted Windows PE executable and stores this information in the pe_info attribute of the corresponding file object. Users can query VirusTotal for specific content in this attribute using the metadata search modifier.

The pe_info attribute (in the file object for AdventureQuest.exe) — The *pe_info* attribute (in the *file* object for *AdventureQuest.exe*)

Further, the resulting files with PDB paths had been residing at file paths in the Ivacy VPN installation directory, such as C:\Program Files (x86)\Ivacy\IvacyService.exe. For each submitted file, VirusTotal records the names under which the file has been submitted, which may be full file paths rather than just filenames. The platform stores this information in the names attribute of the corresponding file object. Users can query VirusTotal for specific content in this attribute using the name search modifier.

The names attribute (in the file object for AdventureQuest.exe) — The *names* attribute (in the *file* object for *AdventureQuest.exe*)

The names attribute (in the web analysis report for AdventureQuest.exe) — The *names* attribute (in the web analysis report for *AdventureQuest.exe*)

Based on our insights and previous research on Operation ChattyGoblin, we know that the threat actors do not disguise their malware as Ivacy VPN components. This suggests that the files that had been located in the Ivacy VPN installation directory before submission to VirusTotal may be false positives. An analysis of some of these files using a .NET decompiler revealed that they are indeed legitimate Ivacy VPN components.

Building on the previous search, the query signature:"0E3E037C57A5447295669A3DB1A28B8A" AND tag:"peexe" AND magic:".NET" AND (NOT metadata:".pdb") AND (NOT name:"Program Files (x86)\\ivacy") further narrows the results to submitted executables that do not have PDB paths and had not been located in the Ivacy VPN installation directory before submission. The query returns one result, AdventureQuest.exe.

This suggests that VirusTotal does not host other malware loaders, which are signed with the same certificate as AdventureQuest.exe and are likely linked to the investigated threat cluster. However, the extensive number of VirusTotal search modifiers allows for the identification of such loaders based on characteristics beyond the used code signing certificate. For example, querying VirusTotal for a code segment specific to AdventureQuest.exe using the content modifier leads to further malware that is likely part of the same activity cluster. We leave this as an exercise for the reader.

Clustering With Search Modifiers | Limitations

Certain aspects of how VirusTotal collects information on submitted artifacts, which users can query using search modifiers, may increase the likelihood of missing relevant findings in some search scenarios. This is particularly relevant given the third-party tools and functionalities that VirusTotal uses for collecting this information, such as sandboxes and detection engines. Each of these tools has specific limitations, which affect the quality and quantity of information VirusTotal collects and stores in its dataset. In this section, we highlight some of these limitations to help users understand how they impact querying VirusTotal with search modifiers.

As mentioned earlier, VirusTotal executes submitted executable files (executables and scripts) in sandboxes to capture behaviors and artifacts visible only during execution. Additionally, most of the sandboxes VirusTotal integrates can identify MITRE ATT&CK techniques exhibited during execution. This is accomplished through a set of rules that map observed behaviors to MITRE ATT&CK techniques.

For each submitted file, the sandboxes generate a report documenting captured activities, which are accessible to VirusTotal users. To facilitate systematic searching of sandbox-generated data, VirusTotal stores this data in an object of type file_behaviour. This object has a relationship to the file object that is directly related to the submitted file. Users can query sandbox-generated data using a variety of search modifiers, such as behavior_created_processes (searches for a name of a created process), behavior_files (searches for a name or path of an opened, written, deleted, or dropped file), or attack_technique (searches for a MITRE ATT&CK technique ID).

Sandbox-generated data in a file_behaviour object — Sandbox-generated data in a *file_behaviour* object

VirusTotal’s sandboxes may not always capture relevant behaviors of executable files, for example, due to execution conditions that must be met or techniques intentionally implemented by malware authors to evade sandbox analysis. This includes command-line parameters, library or platform dependencies, or external configuration or data files. In contrast to private submissions, VirusTotal automatically executes the vast volume of executable files continuously submitted to the platform’s public corpus in sandboxes, without customizing their execution or execution environments. As a result, search queries with modifiers applied to sandbox-generated data, like behavior_files or attack_technique, may return incomplete results.

For example, the BlackCat ransomware requires operators to provide an execution password as a command-line parameter (referred to as an ‘access token’) for the malware to initiate encryption. For the BlackCat sample veros3.exe, the CAPE sandbox has not captured any file system activities, such as deleting, creating, or modifying files. When the correct access token is provided, this sample enumerates the files and folders on the filesystem of a compromised system and encrypts files as specified in embedded configuration data.

The CAPE sandbox report for veros3.exe — The CAPE sandbox report for *veros3.exe*

In addition to running executable files in sandboxes to capture their behaviors, VirusTotal is capable of extracting malware configurations from these files. To achieve this, the platform uses the Mandiant Backscatter malware analysis service, which implements configuration extraction modules to automatically extract configurations based on known implementation patterns. Users can search through extracted configuration data using the malware_config search modifier. However, the automated extraction might not work when the analyzed malware uses new or changed methods for storing its configuration data, which are not covered by the existing modules. As a result, search queries involving malware_config may return incomplete results.

It is also important to note that some information related to submitted artifacts may change over time. For example, the last_analysis_stats attribute of the file object stores the number of third-party detection engines that have labeled the corresponding submitted file as malicious. Users can use the positives search modifier to narrow searches based on whether this number is less than, greater than, or equal to a user-specified value. For example, positives:20+ narrows searches to files that have been labeled as malicious by more than 20 engines.

last_analysis_stats attribute — *last_analysis_stats* attribute

Setting positives to a relatively high number is a way to focus searches on files that are likely to be malicious. However, the returned results may not include malicious files for which an insufficient number of third-party detection engines have developed detections. The development of detections is fully in control of the engines’ vendors and may depend on a variety of factors.

A common factor prompting vendors to develop a detection for a specific malware implementation is the public release of a threat research report listing files that implement the malware. For example, on September 21, 2023, SentinelLabs released a report on the Sandman APT group, identifying the UpdateCheck.dll file as malware used by this group. Prior to this date, on March 15, 22, and 29, 2023, the number of engines detecting the file as malware was 5, 6, and 7, respectively. Shortly after the release of the report, this number spiked to 17 and reached 53 by September 29, 2023.

Number of engines detecting the UpdateCheck.dll malware — Number of engines detecting the *UpdateCheck.dll* malware

Conclusions

Effectively using VirusTotal for threat research requires a good understanding of the platform’s wide range of querying capabilities, the scenarios in which these capabilities return informative results beneficial to investigations, and the factors that may impact the completeness or relevance of the data returned.

While the GUI provides an agile and user-friendly way to query VirusTotal, the API enables large-scale querying, offers expanded querying capabilities, and allows for retrieving more extensive information. Additionally, the AI engines that VirusTotal integrates can significantly speed up malware analysis efforts; however, their outputs should be considered as part of a broader analysis strategy as they may lack sufficient or useful information due to limitations in design or training data. Moreover, the extensive set of search modifiers provides flexible search capabilities, but the relevance and completeness of results can be impacted by false positives and false negatives.

SentinelLabs and VirusTotal are committed to sharing information and insights that help new users gain a solid understanding of the platform’s capabilities, enabling them to make full use of the available VirusTotal features and conduct thorough investigations.

Exploring the VirusTotal Dataset | An Analyst’s Guide to Effective Threat Research

Overview

The VirusTotal Dataset