How Scanner Achieves Fast Queries

Data lakes excel at storage and long-term retention, but queries are often slow. SQL engines like Athena and Presto use partitioning to reduce the amount of data scanned—organizing files by date or account allows queries to skip irrelevant partitions. Converting raw logs to columnar formats like Parquet can further improve performance by enabling column pruning and better compression.

However, even with well-partitioned Parquet files, certain query patterns remain expensive. Searching deeply nested JSON fields (like requestParameters.bucketName in CloudTrail logs) or performing substring searches across text fields typically requires scanning entire tables. Partitioning helps limit the scope, but within each partition, engines must still parse and filter every record.

Scanner takes a different approach: it builds inverted indexes on top of data lake storage. Rather than optimizing scans, Scanner avoids most scanning entirely by indexing field values at ingestion time.

Query Performance: The Challenge and How Scanner Addresses It

Security investigations require iterative querying. Incident responders typically need to answer questions like: Did anyone modify S3 bucket policies? Which API keys accessed sensitive data? When did this suspicious IP first appear? Each answer informs the next query. When queries take 30+ minutes, investigation velocity drops significantly, limiting how many hypotheses can be tested within critical response timeframes.

Scanner solves this by indexing at ingestion time. During data import, Scanner parses the JSON once and builds an inverted index: a lookup table mapping field values (like "PutBucketPolicy" or "prod-bucket-1") to the files containing them. This index is stored in a separate S3 bucket. When you query, Lambda workers read the index files in parallel, find the files matching your criteria, and scan only those files.

How the Inverted Index Works

The inverted index maps every unique value in a field to the files containing it. During query execution, Scanner looks up each search term, gets a list of files, and finds the intersection—files matching all conditions.

Inverted Index Lookup Example

For the query: eventName:(PutBucketPolicy) eventSource:"s3.amazonaws.com" requestParameters.bucketName:"prod-*"

Scanner performs three lookups:

What to find

Lookup term

Files with this value

S3 bucket policy changes

putbucketpolicy

file_08, file_23, file_56

S3 API calls

s3

file_08, file_12, file_23, file_34, file_41, file_56

Production bucket names

prod

file_08, file_12, file_23, file_34, file_41

Intersection: Which files appear in all three lists?

file_08 ✓
file_23 ✓

Those are the only files that matter. Scanner scans 2 files instead of 1000+. This is the architecture: build the index once at ingestion, use it to skip the irrelevant data at query time.

Parallel Query Execution with Lambda

When you submit a query, Scanner spawns Lambda workers (one per index file, up to a maximum) to read the indexes in parallel. Each worker identifies which portions of your log files contain matching data, then passes this information to a fleet of scanning workers that process the data simultaneously. Once all workers finish and results are merged, the Lambda functions terminate immediately. You only pay for the compute time you actually use—there's no persistent infrastructure sitting idle.

Performance Comparison

Let's compare how the same query executes on traditional data lakes versus Scanner. The query searches 6 months of CloudTrail logs for S3 bucket policy changes on production buckets—a common security investigation task.

Traditional Data Lake (Athena/Presto) - Scans All Files

Here's the SQL query in Athena:

SELECT *
FROM cloudtrail_logs
WHERE eventname IN ('PutBucketPolicy', 'DeleteBucketPolicy', 'PutBucketEncryption',
                     'DeleteBucketEncryption', 'PutBucketPublicAccessBlock',
                     'DeleteBucketPublicAccessBlock')
  AND eventsource = 's3.amazonaws.com'
  AND (
    json_extract_scalar(requestparameters, '$.bucketName') LIKE 'prod-%'
  )
  AND eventtime >= DATE_ADD('month', -6, CURRENT_DATE)
  AND year >= CAST(YEAR(DATE_ADD('month', -6, CURRENT_DATE)) AS VARCHAR)
  AND (
    year > CAST(YEAR(DATE_ADD('month', -6, CURRENT_DATE)) AS VARCHAR)
    OR month >= CAST(MONTH(DATE_ADD('month', -6, CURRENT_DATE)) AS VARCHAR)
  );

The query filters on deeply nested JSON (requestParameters.bucketName) and specific event types. Athena can't use partition pruning—it has no way to skip folders without reading files. It scans every file from 6 months of logs, parses the JSON in each one, and filters at runtime.

Folder Path

Files Scanned

GB Scanned

📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/10/30/

🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...

~2.8 GB

📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/10/31/

🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...

~2.7 GB

📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/11/01/

🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...

~2.9 GB

📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/11/02/

🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...

~3.0 GB

... (175+ more daily folders)

TOTAL

~10-15 TB

Legend: 🟢 = File scanned (ALL files must be read and parsed)

Result: 10-15 TB scanned

Scanner with Inverted Index - Scans Only Relevant Files

Here's the same query in Scanner:

%ingest.source_type:"aws:cloudtrail"
eventSource:"s3.amazonaws.com"
eventName:(PutBucketPolicy DeleteBucketPolicy PutBucketEncryption
           DeleteBucketEncryption PutBucketPublicAccessBlock
           DeleteBucketPublicAccessBlock)
requestParameters.bucketName:"prod-*"

Time Range: Last 6 months (dynamically calculated from current date)

For each day in the time range, Scanner reads a small index file that maps field values to log files. This index acts like a card catalog: look up "PutBucketPolicy" and get the file list, look up "s3.amazonaws.com" and get another list, then intersect them to find files matching all conditions. Files with no matching events are never opened.

Folder Path

Files Scanned

GB Scanned

📁 tenant_abc/index_cloudtrail/2024/10/30/

⚪⚪🟢⚪⚪⚪⚪⚪⚪⚪⚪⚪ ...

0.31 GB

📁 tenant_abc/index_cloudtrail/2024/10/31/

⚪⚪⚪⚪⚪🟢⚪⚪⚪⚪⚪⚪ ...

0.28 GB

📁 tenant_abc/index_cloudtrail/2024/11/01/

🟢⚪⚪🟢🟢⚪⚪⚪⚪⚪⚪⚪ ...

0.89 GB

📁 tenant_abc/index_cloudtrail/2024/11/02/

⚪⚪⚪🟢🟢⚪⚪⚪⚪⚪⚪⚪ ...

0.67 GB

... (170+ more daily folders with mostly 0 GB)

TOTAL

~2-5 GB

Legend: 🟢 = File scanned | ⚪ = File skipped (index says no matches here)

Result: 2-5 GB scanned

Performance Impact

Athena scans everything: 10-15 TB of raw logs, parses JSON for every record, filters at the end. Scanner scans only what matters: the index identifies which 2-5 GB contain matching events.

Metric

Traditional

Scanner

Improvement

Data Scanned

10-15 TB

2-5 GB

~3,000-5,000x less

Query Time

30 min

1-3 sec

600-1800x faster

Cost per Query

$75-100

$0.01-0.10

750-10,000x cheaper

A query that costs $100 and takes 30 minutes with traditional tools costs $0.05 and takes 2 seconds with Scanner.

Why Scanner's query is simpler:

No JSON parsing: Fields are indexed at ingestion, not parsed at query time
No manual partitioning: The index finds matching files automatically
Access nested fields directly: Use dot notation like requestParameters.bucketName without extraction functions

The Trade-off: Storage for Speed

Building indexes requires additional storage (see storage ratio in the main architecture docs). This is a deliberate trade-off: some extra disk space in exchange for query speeds that make data lakes actually usable for investigations. Without fast queries, iterative investigation workflows are impractical—waiting 30+ minutes per query makes it infeasible to pursue multiple lines of inquiry during an incident.

Speed Determines Investigation Outcomes

During a security incident, investigation is inherently iterative. An API key accessing sensitive S3 buckets from an unknown IP raises questions: When did this start? What else has this key accessed? Are other keys compromised? Who generated this key? Each answer leads to several more queries.

With traditional tools:

Query 1 (find first use of the key): 45 minutes
Query 2 (check other buckets): 38 minutes
Query 3 (find related access patterns): 52 minutes
Total after 3 queries: 2 hours 15 minutes

After three queries and over two hours, the investigation has barely begun. Meanwhile, the time window for containment continues to shrink.

With Scanner:

Query 1: 8 seconds
Query 2: 5 seconds
Query 3: 12 seconds
Queries 4-20: Another 3 minutes combined
Total after 20 queries: 4 minutes

Same investigation, same data. Twenty pivots instead of three. The key has been traced back to a compromised CI/CD pipeline, every affected resource has been identified, lateral movement has been ruled out, and systems have been isolated.

The difference isn't just speed. It's the ability to ask every relevant question without rationing queries. Fast queries change what's possible during an investigation—more pivots, deeper analysis, faster containment.

PreviousHow it Works NextDetection Rule Engine

Last updated 26 days ago

Was this helpful?

Good night

Query Performance: The Challenge and How Scanner Addresses It

How the Inverted Index Works

Inverted Index Lookup Example

Parallel Query Execution with Lambda

Performance Comparison

Traditional Data Lake (Athena/Presto) - Scans All Files

Scanner with Inverted Index - Scans Only Relevant Files

Performance Impact

The Trade-off: Storage for Speed

Speed Determines Investigation Outcomes