Detection Rule Engine

Scanner's detection engine runs hundreds of detection rules simultaneously against massive log volumes—without repeatedly scanning the same data. Traditional scheduled queries become cost-prohibitive at scale, but Scanner's streaming architecture makes continuous detection both fast and economical.

The Challenge: Scheduled Queries Don't Scale

In traditional data lake environments, detection rules are implemented as scheduled queries that re-scan logs on a fixed interval. This approach has fundamental limitations:

  • Each detection rule scans the full dataset every time it runs

  • 100 detection rules × 1 query/minute = 100 full scans per minute

  • 1,000 detection rules × 1 query/minute = 1,000 full scans per minute

  • Most scans find nothing (redundant work)

  • Query costs scale linearly with the number of rules

  • Becomes economically infeasible as rule count or log volume increases

Scanner solves this with a streaming detection engine that caches intermediate query results in a specialized data structure, enabling hundreds of detection rules to run efficiently even at massive scale.

How Scanner's Detection Engine Works

Scanner's detection engine operates in two distinct phases:

Phase 1: Indexing Time (Building the Cache)

As logs are indexed, they flow through a Detection Engine state machine that executes all detection rule queries simultaneously.

What happens during indexing:

  1. Concurrent query execution: All detection rule queries (hundreds) run against incoming logs simultaneously

  2. Filter matching: Only logs that match the query filters (the portion before aggregation) are processed further

  3. Partial execution: Matching logs execute the query up to the first aggregation, and the aggregation values are stored in a time-based rollup tree data structure

  4. Efficient storage: The rollup tree is written to the Scanner Index files S3 bucket

This happens automatically during the normal indexing process—no additional scanning required.

Learn more: For details on Scanner's indexing pipeline, see How it Works → Stage 2: Indexing

Phase 2: Detection Checking (Querying the Cache)

Separately from indexing, detection workers (continuous ECS tasks) run detection rules against fixed-size windows of time, determined by the run_frequency_s and time_range_s parameters. For example, if a rule has a run_frequency_s of 60 seconds and a time_range_s of 3600 seconds (1 hour), the rule will run against one-hour time ranges every minute, i.e. it will run on the time windows [00:00..01:00], [00:01..01:01], [00:02..01:02], etc.

What happens during detection checking:

  1. Query the rollup tree: For each detection rule, query the rollup tree data structure using the rule's configured time range

  2. Execute remaining query: Complete the query by running the portions after the first aggregation

  3. Evaluate results: Check if the result set is non-empty (i.e., detection triggered)

  4. Send alerts: If triggered, send alerts to configured destinations (Slack, PagerDuty, webhooks, SOAR tools)

  5. Log detection events: Store detection events in the special _detections index for investigation

The Rollup Tree Data Structure

The rollup tree is a hierarchical, time-based data structure that stores aggregated query results at multiple time granularities.

Tree Structure

Each detection query's aggregations are stored in a tree with nodes at different time resolutions:

24 hours (root)
├── 6 hours
│   ├── 1 hour
│   │   ├── 15 minutes
│   │   │   ├── 5 minutes
│   │   │   │   ├── 1 minute

How it works:

As an example for this section, consider this example query: * | count | where @q.count > 10

  • For each detection rule, we determine the largest prefix of its query whose aggregation retains associative and identity properties. For the example query, this is * | count, and the aggregation operation is simple addition.

    • By "associative", we mean that the order of operations does not matter. E.g. (1 + 2) + 3 = 1 + (2 + 3)

    • By "identity", we mean that there is a neutral element that does not change the result. E.g. 1 + 0 = 1

  • Each node contains the aggregation values of that query prefix for its time window. For the example query, the values we store are simply the total count for each time window, so each node contains a single numeric value.

  • Due to being monoids, nodes can be combined to answer queries for arbitrary time ranges; only the smallest disjoint set of nodes covering the queried time range need to be read. For the example query, we can combine nodes easily via addition.

  • The remainder of the query (after the cached prefix) is executed in memory. For the example query, this is simply applying the condition | where @q.count > 10, a trivial operation.

Querying the tree:

The rollup tree is a segment tree data structure that efficiently answers range queries by reading the minimal set of nodes needed to cover the requested time range.

For a 24-hour time range query:

  • Best case: If the current time aligns with a 24-hour node boundary, Scanner reads just the root node (1 read)

  • Typical case: If the current time doesn't align, Scanner reads multiple nodes that tile together to cover the full 24 hours. For example, for the time range 00:43..02:35, it would read the nodes for these time ranges:

    • 00:43..00:44

    • 00:44..00:45

    • 00:45..01:00

    • 01:00..02:00

    • 02:00..02:15

    • 02:15..02:30

    • 02:30..02:35

The segment tree structure guarantees that any arbitrary time range can be decomposed into a relatively small number of disjoint segments, minimizing S3 reads while ensuring complete coverage of the requested time window.

Storage Characteristics

Location: Scanner Index files S3 bucket (same location as standard indexes)

Size: The rollup tree only stores aggregations for logs that match detection rule filters. Storage is typically 0.01% - 0.05% of the original log size, depending on filter selectivity.

Example:

  • 10 TB of logs per day

  • ~10% of logs match detection rule filters across all rules

  • Matching logs after filtering: ~1 TB

  • Rollup tree cache (compressed aggregations): 1-5 GB per day

Constraints and Limits

Node size limit: Each rollup tree node stores up to 64 MB of data. If a node reaches this limit, it is truncated to 64 MB.

Rules without aggregations: For detection rules that have no aggregation (e.g., only filters and table), the rollup tree stores up to 64 MB of raw matching log events per time node. Best practice: Always include aggregations (stats, groupbycount, etc.) to minimize storage overhead and stay well under the 64 MB limit.

Retention period: Rollup tree nodes are retained for 1 week (7 days) to support late-arriving data. After 1 week, nodes are automatically deleted from S3.

Allowed detection values: Both the run_frequency_s and time_range_s parameters must use one of the following values (specified in seconds): 60 (1 min), 300 (5 min), 900 (15 min), 3600 (1 hour), 21600 (6 hours), or 86400 (1 day).

Rule count: There is no maximum number of detection rules per tenant. However, compute usage and associated costs increase with the number of active rules.

Lifecycle Management

Rule updates: When a detection rule is modified (query, time range, frequency, etc.), the changes apply to new log data going forward only. Existing rollup tree nodes are not backfilled with the updated logic.

Rule deletion: When a detection rule is deleted, its rollup tree cache nodes are removed from S3. This cleanup happens asynchronously.

Rule disable/enable: When a detection rule is disabled, new rollup tree nodes are not created during indexing, and detection workers stop checking the rule. When re-enabled, the rule begins processing new logs going forward as if it were newly-activated, and historical data during the disabled period is not fully-backfilled.

Detection events retention: Detection events stored in the _detections index have unlimited retention by default. Retention can be configured to a shorter period if desired via Scanner's data retention settings.

Late-Arriving and Out-of-Order Data

Scanner handles late-arriving logs gracefully for up to 1 week. When logs arrive out of order or are indexed late:

  1. The detection engine updates the relevant rollup tree nodes with the new data

  2. Updated nodes are flagged for re-evaluation

  3. Detection workers re-check all affected time windows (i.e. those which intersect the new data) on their next run

  4. If the late data causes a detection rule to trigger, alerts are sent and detection events are created

This ensures detection accuracy even when log delivery is delayed or arrives out of chronological order.

Note that logs arriving more than 1 week late will not be processed by the detection engine since rollup tree nodes older than 1 week are deleted.

Example: S3 Data Exfiltration Detection

Here's how a detection rule with aggregations executes through the two-phase system:

name: High Volume S3 Data Exfiltration
query_text: |
  %ingest.source_type="aws:cloudtrail"
  eventSource="s3.amazonaws.com"
  eventName="GetObject"
  | stats sum(bytesTransferred) as total_bytes by userIdentity.arn
  | eval gbTransferred = total_bytes / (1024 * 1024 * 1024)
  | where gbTransferred > 100
time_range_s: 300 # Look back 5 minutes
run_frequency_s: 60 # Check every minute

At indexing time: S3 GetObject events matching the filters have bytesTransferred summed by userIdentity.arn and stored in 1-minute nodes of the rollup tree.

At detection checking time: Every 60 seconds, the detection worker queries the rollup tree for the last 5 minutes, applies the eval and where clauses, and checks if any user exceeds 100 GB. If Alice transferred 150 GB, an alert is sent and a detection event is created.

Query performance: Typically under 100 milliseconds, even over time ranges up to 24 hours, because we're reading cached aggregations instead of scanning raw logs.

Example: Detection Rule Without Aggregations

Detection rules can be written without aggregations, though this is less efficient:

name: S3 Bucket Deletion
query_text: |
  %ingest.source_type="aws:cloudtrail"
  eventName="DeleteBucket"
  | table timestamp, requestParameters.bucketName, userIdentity.arn
time_range_s: 60
run_frequency_s: 60

Since this rule has no aggregation (stats, groupbycount, etc.), the rollup tree stores up to 64 MB of raw matching events per time node. For high-volume events, this is less storage-efficient than using aggregations. Best practice: Use aggregations whenever possible to reduce rollup tree storage.

Query Examples and Detection Rules

Detection rules use the same query syntax as ad-hoc searches. Any query can become a detection rule.

For query syntax and examples, see:

For creating and managing detection rules, see:

Performance at Scale

Scanner's streaming detection engine makes it economically feasible to run extensive detection coverage, even at massive log volumes.

Comparison

Metric
Traditional Scheduled Queries
Scanner Streaming Detection

Detection rules supported

10-50 (practical limit)

Hundreds

Detection latency

Minutes to hours

Minutes (as fast as 1 min)

Data scanned per check

Full dataset per rule

Cached aggregations only

Cost scaling

Linear with rules × volume

Sublinear (shared indexing cost)

Example at scale (500 rules, 10 TB/day logs):

  • Traditional: 500 rules × 60 checks/hour × 24 hours × 10 TB = 7.2 PB scanned daily — Prohibitively expensive

  • Scanner: 500 rules × 60 checks/hour × 24 hours × 5 GB = 3.6 TB queried daily — ~2,000x less expensive

Key Benefits

  1. Massive scale: Run hundreds of detection rules without cost explosion

  2. Fast alerts: Detection latency measured in minutes (as fast as 1 minute), not hours or days

  3. Efficient time ranges: Lookback windows up to 24 hours are just as fast as 1-minute windows

  4. No redundant scanning: Each log is processed once during indexing, then queried via cached aggregations

  5. Shared indexing infrastructure: Aggregations happen during normal indexing—no additional log scanning required

  6. Independent detection workers: Detection checking runs on continuous ECS tasks that scale separately from log ingestion

This architectural choice makes comprehensive, real-time detection economically viable at any scale—from gigabytes to petabytes per day.


Last updated

Was this helpful?