Understanding Tokens and Query Performance

How you structure your queries affects how quickly Scanner can return results. This page covers key patterns for writing efficient queries.

For background on Scanner's architecture, see How Scanner Achieves Fast Queries.

How the Index Works

Scanner builds an inverted index that maps field values to the files containing them. The index keys are tokens stored in sorted order.

Think of it like a phone book: you can quickly find all names starting with "Sm" because names are sorted alphabetically. But finding all names ending with "son" requires reading every entry. This is why trailing wildcards are fast (prefix lookup) and leading wildcards are slow (full scan).

Token Matching: The Most Important Concept

Scanner breaks field values into tokens and indexes each token separately. You can search for individual tokens directly without wildcards.

What Makes a Token

Tokens consist of alphanumeric characters (a-z, A-Z, 0-9) and underscores. Everything else is a boundary.

  • Part of tokens: Letters, numbers, underscores (_)

  • Boundaries: Spaces, hyphens, dots, colons, slashes, brackets, quotes, equals, commas, and all other punctuation

For example, user_admin_backup is a single token, but user-admin-backup splits into three tokens: user, admin, backup.

Search is case-insensitive by default. Searching for error, Error, or ERROR all match the same tokens.

IP addresses are special: Scanner indexes both the full IP address and each octet separately. So 192.168.1.100 generates tokens for 192.168.1.100, 192, 168, 1, and 100.

Examples

The value scnr-QueryWorker-95a7410[$LATEST] becomes tokens scnr, QueryWorker, 95a7410, LATEST. Searching field: "QueryWorker" matches.

The value /var/log/application/error.log becomes tokens var, log, application, error, log. Searching field: "error" matches.

The value user-admin-backup becomes tokens user, admin, backup (hyphens are boundaries). Searching field: "admin" matches.

The value 192.168.1.100 becomes tokens 192.168.1.100, 192, 168, 1, 100 (IP addresses get special handling). Searching field: "192.168.1.100" matches, and so does field: "168" since individual octets are searchable.

The value user_admin_backup stays as one token user_admin_backup (underscores are NOT boundaries). Searching field: "admin" no match.

Why This Matters

Consider searching for Lambda functions containing "QueryWorker":

Both queries find the same results. The second is faster because QueryWorker is a complete token within values like scnr-QueryWorker-95a7410[$LATEST], so Scanner can look it up directly in the index.

When You Still Need Wildcards

Wildcards are necessary when there are no token boundaries:

  • Prefix matching within a token: admin* matches "administrator"

  • Substring within a token: searching for "Processor" within AdminActionProcessorWorker requires *Processor* (slow)

If possible, restructure substring searches as prefix searches. AdminActionProcessor* is fast; *Processor* is slow.

Wildcard Performance

Trailing Wildcards are Fast

Trailing wildcards (value*) use index prefix matching on sorted keys:

Leading Wildcards are Slow

Leading wildcards (*value) cannot use the index to narrow down the search. Scanner must directly scan all data in the time range and index specified, which can be significantly slower on large datasets:

Avoid Exact Match with Wildcards

Scanner has two operators for column queries (see Query Syntax):

  • : (contains) - searches for a token within the field value (uses the index)

  • = (exact match) - matches the entire raw field value exactly

Both operators are fast when used correctly. The problem is combining = with wildcards:

The : operator is fast because it looks up tokens in the index. The = operator with wildcards is slow because it must scan every raw field value to check if it matches the pattern.

If you're coming from Splunk, note that the = "*value*" pattern doesn't translate well. Use : "value" instead.

When to Use Each Operator

Goal
Operator
Example
Performance

Find token within field

:

message: "error"

Fast (index lookup)

Exact raw field value

=

status = "success"

Fast (direct match)

Token prefix match

: with *

user: "admin*"

Fast (index prefix scan)

Raw value pattern match

= with *...*

path = "*api*"

Slow (scans all values)

Index Selection

Queries run against a specific index using the @index= syntax (see examples in Data Exploration). If you don't specify an index, Scanner runs the query across all indexes you have permission to query.

Smaller, focused indices are faster to query than large ones containing all your data. When Scanner executes a query, it scans data within the selected index and time range. A smaller index means less data to scan, especially for queries that can't fully leverage the token index (like leading wildcard searches).

See Index Organization for strategies on structuring your indexes.

Best Practices Summary

  1. Understand token boundaries - you often don't need wildcards at all

  2. Use contains (:) for "find within" queries - avoid = "*value*" patterns

  3. Prefer trailing wildcards (value*) over leading wildcards (*value) when wildcards are needed

  4. Use exact match (=) for exact values - it's fast for direct lookups like status = "success"

  5. Apply filters before aggregations to reduce the amount of data scanned

Last updated

Was this helpful?