How Scanner Achieves Fast Queries
Data lakes excel at storage and long-term retention, but queries are often slow. SQL engines like Athena and Presto use partitioning to reduce the amount of data scanned—organizing files by date or account allows queries to skip irrelevant partitions. Converting raw logs to columnar formats like Parquet can further improve performance by enabling column pruning and better compression.
However, even with well-partitioned Parquet files, certain query patterns remain expensive. Searching deeply nested JSON fields (like requestParameters.bucketName in CloudTrail logs) or performing substring searches across text fields typically requires scanning entire tables. Partitioning helps limit the scope, but within each partition, engines must still parse and filter every record.
Scanner takes a different approach: it builds inverted indexes on top of data lake storage. Rather than optimizing scans, Scanner avoids most scanning entirely by indexing field values at ingestion time.
Query Performance: The Challenge and How Scanner Addresses It
Security investigations require iterative querying. Incident responders typically need to answer questions like: Did anyone modify S3 bucket policies? Which API keys accessed sensitive data? When did this suspicious IP first appear? Each answer informs the next query. When queries take 30+ minutes, investigation velocity drops significantly, limiting how many hypotheses can be tested within critical response timeframes.
Scanner solves this by indexing at ingestion time. During data import, Scanner parses the JSON once and builds an inverted index: a lookup table mapping field values (like "PutBucketPolicy" or "prod-bucket-1") to the files containing them. This index is stored in a separate S3 bucket. When you query, Lambda workers read the index files in parallel, find the files matching your criteria, and scan only those files.
How the Inverted Index Works
The inverted index maps every unique value in a field to the files containing it. During query execution, Scanner looks up each search term, gets a list of files, and finds the intersection—files matching all conditions.
Inverted Index Lookup Example
For the query: eventName:(PutBucketPolicy) eventSource:"s3.amazonaws.com" requestParameters.bucketName:"prod-*"
Scanner performs three lookups:
S3 bucket policy changes
putbucketpolicy
file_08, file_23, file_56
S3 API calls
s3
file_08, file_12, file_23, file_34, file_41, file_56
Production bucket names
prod
file_08, file_12, file_23, file_34, file_41
Intersection: Which files appear in all three lists?
file_08✓file_23✓
Those are the only files that matter. Scanner scans 2 files instead of 1000+. This is the architecture: build the index once at ingestion, use it to skip the irrelevant data at query time.
Parallel Query Execution with Lambda
When you submit a query, Scanner spawns Lambda workers (one per index file, up to a maximum) to read the indexes in parallel. Each worker identifies which portions of your log files contain matching data, then passes this information to a fleet of scanning workers that process the data simultaneously. Once all workers finish and results are merged, the Lambda functions terminate immediately. You only pay for the compute time you actually use—there's no persistent infrastructure sitting idle.
Performance Comparison
Let's compare how the same query executes on traditional data lakes versus Scanner. The query searches 6 months of CloudTrail logs for S3 bucket policy changes on production buckets—a common security investigation task.
Traditional Data Lake (Athena/Presto) - Scans All Files
Here's the SQL query in Athena:
SELECT *
FROM cloudtrail_logs
WHERE eventname IN ('PutBucketPolicy', 'DeleteBucketPolicy', 'PutBucketEncryption',
'DeleteBucketEncryption', 'PutBucketPublicAccessBlock',
'DeleteBucketPublicAccessBlock')
AND eventsource = 's3.amazonaws.com'
AND (
json_extract_scalar(requestparameters, '$.bucketName') LIKE 'prod-%'
)
AND eventtime >= DATE_ADD('month', -6, CURRENT_DATE)
AND year >= CAST(YEAR(DATE_ADD('month', -6, CURRENT_DATE)) AS VARCHAR)
AND (
year > CAST(YEAR(DATE_ADD('month', -6, CURRENT_DATE)) AS VARCHAR)
OR month >= CAST(MONTH(DATE_ADD('month', -6, CURRENT_DATE)) AS VARCHAR)
);The query filters on deeply nested JSON (requestParameters.bucketName) and specific event types. Athena can't use partition pruning—it has no way to skip folders without reading files. It scans every file from 6 months of logs, parses the JSON in each one, and filters at runtime.
📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/10/30/
🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...
~2.8 GB
📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/10/31/
🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...
~2.7 GB
📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/11/01/
🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...
~2.9 GB
📁 AWSLogs/o-abc123/111122223333/CloudTrail/us-east-1/2024/11/02/
🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 ...
~3.0 GB
... (175+ more daily folders)
TOTAL
~10-15 TB
Legend: 🟢 = File scanned (ALL files must be read and parsed)
Result: 10-15 TB scanned
Scanner with Inverted Index - Scans Only Relevant Files
Here's the same query in Scanner:
%ingest.source_type:"aws:cloudtrail"
eventSource:"s3.amazonaws.com"
eventName:(PutBucketPolicy DeleteBucketPolicy PutBucketEncryption
DeleteBucketEncryption PutBucketPublicAccessBlock
DeleteBucketPublicAccessBlock)
requestParameters.bucketName:"prod-*"Time Range: Last 6 months (dynamically calculated from current date)
For each day in the time range, Scanner reads a small index file that maps field values to log files. This index acts like a card catalog: look up "PutBucketPolicy" and get the file list, look up "s3.amazonaws.com" and get another list, then intersect them to find files matching all conditions. Files with no matching events are never opened.
📁 tenant_abc/index_cloudtrail/2024/10/30/
⚪⚪🟢⚪⚪⚪⚪⚪⚪⚪⚪⚪ ...
0.31 GB
📁 tenant_abc/index_cloudtrail/2024/10/31/
⚪⚪⚪⚪⚪🟢⚪⚪⚪⚪⚪⚪ ...
0.28 GB
📁 tenant_abc/index_cloudtrail/2024/11/01/
🟢⚪⚪🟢🟢⚪⚪⚪⚪⚪⚪⚪ ...
0.89 GB
📁 tenant_abc/index_cloudtrail/2024/11/02/
⚪⚪⚪🟢🟢⚪⚪⚪⚪⚪⚪⚪ ...
0.67 GB
... (170+ more daily folders with mostly 0 GB)
TOTAL
~2-5 GB
Legend: 🟢 = File scanned | ⚪ = File skipped (index says no matches here)
Result: 2-5 GB scanned
Performance Impact
Athena scans everything: 10-15 TB of raw logs, parses JSON for every record, filters at the end. Scanner scans only what matters: the index identifies which 2-5 GB contain matching events.
Data Scanned
10-15 TB
2-5 GB
~3,000-5,000x less
Query Time
30 min
1-3 sec
600-1800x faster
Cost per Query
$75-100
$0.01-0.10
750-10,000x cheaper
A query that costs $100 and takes 30 minutes with traditional tools costs $0.05 and takes 2 seconds with Scanner.
Why Scanner's query is simpler:
No JSON parsing: Fields are indexed at ingestion, not parsed at query time
No manual partitioning: The index finds matching files automatically
Access nested fields directly: Use dot notation like
requestParameters.bucketNamewithout extraction functions
The Trade-off: Storage for Speed
Building indexes requires additional storage (see storage ratio in the main architecture docs). This is a deliberate trade-off: some extra disk space in exchange for query speeds that make data lakes actually usable for investigations. Without fast queries, iterative investigation workflows are impractical—waiting 30+ minutes per query makes it infeasible to pursue multiple lines of inquiry during an incident.
Speed Determines Investigation Outcomes
During a security incident, investigation is inherently iterative. An API key accessing sensitive S3 buckets from an unknown IP raises questions: When did this start? What else has this key accessed? Are other keys compromised? Who generated this key? Each answer leads to several more queries.
With traditional tools:
Query 1 (find first use of the key): 45 minutes
Query 2 (check other buckets): 38 minutes
Query 3 (find related access patterns): 52 minutes
Total after 3 queries: 2 hours 15 minutes
After three queries and over two hours, the investigation has barely begun. Meanwhile, the time window for containment continues to shrink.
With Scanner:
Query 1: 8 seconds
Query 2: 5 seconds
Query 3: 12 seconds
Queries 4-20: Another 3 minutes combined
Total after 20 queries: 4 minutes
Same investigation, same data. Twenty pivots instead of three. The key has been traced back to a compromised CI/CD pipeline, every affected resource has been identified, lateral movement has been ruled out, and systems have been isolated.
The difference isn't just speed. It's the ability to ask every relevant question without rationing queries. Fast queries change what's possible during an investigation—more pivots, deeper analysis, faster containment.
Last updated
Was this helpful?