# Understanding Tokens and Query Performance

How you structure your queries affects how quickly Scanner can return results. This page covers key patterns for writing efficient queries.

For background on Scanner's architecture, see [How Scanner Achieves Fast Queries](https://docs.scanner.dev/scanner/what-and-why/how-it-works/how-scanner-achieves-fast-queries).

## How the Index Works

Scanner builds an [inverted index](https://docs.scanner.dev/scanner/what-and-why/how-it-works/how-scanner-achieves-fast-queries#how-the-inverted-index-works) that maps field values to the files containing them. The index keys are **tokens** stored in **sorted order**.

Think of it like a phone book: you can quickly find all names starting with "Sm" because names are sorted alphabetically. But finding all names *ending* with "son" requires reading every entry. This is why **trailing wildcards are fast** (prefix lookup) and **leading wildcards are slow** (full scan).

## Token Matching: The Most Important Concept

Scanner breaks field values into **tokens** and indexes each token separately. You can search for individual tokens directly without wildcards.

### What Makes a Token

Tokens consist of **alphanumeric characters (a-z, A-Z, 0-9) and underscores**. Everything else is a boundary.

* **Part of tokens:** Letters, numbers, underscores (`_`)
* **Boundaries:** Spaces, hyphens, dots, colons, slashes, brackets, quotes, equals, commas, and all other punctuation

For example, `user_admin_backup` is a single token, but `user-admin-backup` splits into three tokens: `user`, `admin`, `backup`.

**Search is case-insensitive** by default. Searching for `error`, `Error`, or `ERROR` all match the same tokens.

**IP addresses are special:** Scanner indexes both the full IP address and each octet separately. So `192.168.1.100` generates tokens for `192.168.1.100`, `192`, `168`, `1`, and `100`.

### Examples

The value `scnr-QueryWorker-95a7410[$LATEST]` becomes tokens `scnr`, `QueryWorker`, `95a7410`, `LATEST`. Searching `field: "QueryWorker"` **matches**.

The value `/var/log/application/error.log` becomes tokens `var`, `log`, `application`, `error`, `log`. Searching `field: "error"` **matches**.

The value `user-admin-backup` becomes tokens `user`, `admin`, `backup` (hyphens are boundaries). Searching `field: "admin"` **matches**.

The value `192.168.1.100` becomes tokens `192.168.1.100`, `192`, `168`, `1`, `100` (IP addresses get special handling). Searching `field: "192.168.1.100"` **matches**, and so does `field: "168"` since individual octets are searchable.

The value `user_admin_backup` stays as one token `user_admin_backup` (underscores are NOT boundaries). Searching `field: "admin"` **no match**.

### Why This Matters

Consider searching for Lambda functions containing "QueryWorker":

```python
# Slow - wildcards force a scan of all index keys:
log_stream: "*QueryWorker*"

# Fast - token lookup uses the sorted index:
log_stream: "QueryWorker"
```

Both queries find the same results. The second is faster because `QueryWorker` is a complete token within values like `scnr-QueryWorker-95a7410[$LATEST]`, so Scanner can look it up directly in the index.

### When You Still Need Wildcards

Wildcards are necessary when there are no token boundaries:

* **Prefix matching within a token**: `admin*` matches "administrator"
* **Substring within a token**: searching for "Processor" within `AdminActionProcessorWorker` requires `*Processor*` (slow)

If possible, restructure substring searches as prefix searches. `AdminActionProcessor*` is fast; `*Processor*` is slow.

## Wildcard Performance

### Trailing Wildcards are Fast

Trailing wildcards (`value*`) use index prefix matching on sorted keys:

```python
user_id: "admin*"           # Fast: matches tokens starting with "admin"
                            # Finds: admin, admin123, administrator, adminUser

eventName: "Describe*"      # Fast: matches tokens starting with "Describe"
                            # Finds: DescribeInstances, DescribeSecurityGroups
```

### Leading Wildcards are Slow

Leading wildcards (`*value`) cannot use the index to narrow down the search. Scanner must directly scan all data in the time range and index specified, which can be significantly slower on large datasets:

```python
# Slow - avoid these patterns:
user_id: "*admin"           # Must scan all data directly
message: "*error"           # Must scan all data directly
path: "*api"                # Must scan all data directly
```

## Avoid Exact Match with Wildcards

Scanner has two operators for column queries (see [Query Syntax](https://docs.scanner.dev/scanner/using-scanner-complete-feature-reference/query-syntax#column-queries)):

* **`:` (contains)** - searches for a **token** within the field value (uses the index)
* **`=` (exact match)** - matches the **entire raw field value** exactly

Both operators are fast when used correctly. The problem is combining `=` with wildcards:

```python
# Slow - scans all raw field values looking for the pattern:
message = "*error*"

# Fast - looks up the token "error" in the index:
message: "error"
```

The `:` operator is fast because it looks up tokens in the index. The `=` operator with wildcards is slow because it must scan every raw field value to check if it matches the pattern.

If you're coming from Splunk, note that the `= "*value*"` pattern doesn't translate well. Use `: "value"` instead.

### When to Use Each Operator

| Goal                    | Operator         | Example              | Performance              |
| ----------------------- | ---------------- | -------------------- | ------------------------ |
| Find token within field | `:`              | `message: "error"`   | Fast (index lookup)      |
| Exact raw field value   | `=`              | `status = "success"` | Fast (direct match)      |
| Token prefix match      | `:` with `*`     | `user: "admin*"`     | Fast (index prefix scan) |
| Raw value pattern match | `=` with `*...*` | `path = "*api*"`     | Slow (scans all values)  |

## Index Selection

Queries run against a specific index using the `@index=` syntax (see examples in [Data Exploration](https://docs.scanner.dev/scanner/using-scanner-complete-feature-reference/querying-and-analysis/data-exploration)). If you don't specify an index, Scanner runs the query across **all indexes you have permission to query**.

Smaller, focused indices are faster to query than large ones containing all your data. When Scanner executes a query, it scans data within the selected index and time range. A smaller index means less data to scan, especially for queries that can't fully leverage the token index (like leading wildcard searches).

See [Index Organization](https://docs.scanner.dev/scanner/using-scanner-complete-feature-reference/data-ingestion/index-organization) for strategies on structuring your indexes.

## Best Practices Summary

1. **Understand token boundaries** - you often don't need wildcards at all
2. **Use contains (`:`) for "find within" queries** - avoid `= "*value*"` patterns
3. **Prefer trailing wildcards** (`value*`) over leading wildcards (`*value`) when wildcards are needed
4. **Use exact match (`=`) for exact values** - it's fast for direct lookups like `status = "success"`
5. **Apply filters before aggregations** to reduce the amount of data scanned
