Scanner Collect

Learn what Scanner Collect is, why it helps you build a scalable S3-based log data lake, and how it simplifies ingestion, indexing, and detection across your SaaS and cloud logs.

Scanner Collect helps you build a log data lake in Amazon S3 with minimal setup and no custom pipeline code. In a single afternoon, you can integrate dozens of log sources.

Logs are delivered into your S3 buckets as gzipped JSON files, partitioned by date (e.g. s3://your-bucket/logs/okta/2025/07/27/). Scanner can then optionally index these logs to enable full-text search and continuous detections.

Why Build a Log Data Lake?

Most traditional logging tools and SIEMs rely on stateful search clusters, which are costly and difficult to scale. Moving to a data lake architecture in S3 offers key advantages:

  • Orders-of-magnitude lower storage cost

  • Easy to scale from GBs to PBs

  • Native support for long-term retention

  • Your logs stay in your storage and under your control

Scanner Collect simplifies the ingestion layer of this architecture - handling API polling, file formatting, delivery to S3, and accelerated full-text search and detections.

The Modern Security Perimeter Is SaaS and Cloud

Security monitoring today means more than just collecting logs from endpoints or firewalls. The modern attack surface is defined by SaaS tools, identity providers, and cloud platforms—and each of them emits valuable audit logs.

Most of these tools provide audit logs in one of three ways:

  • Pull via API — e.g., Okta System Logs, Microsoft 365 Audit Logs, Slack Audit Logs

  • Push via Webhook or Log Forwarder — e.g., Wiz alert webhooks, Fluentd or Logstash pushing application logs

  • Drop to S3 — e.g., Cloudflare HTTP/DNS, Crowdstrike FDR, AWS CloudTrail, GitHub

Collecting audit logs from the services you rely on should be simple, fast, and repeatable. That’s the goal of Scanner Collect: to remove the boilerplate and let your team connect log sources in minutes—not weeks—so you can focus on what matters: detection, investigation, and incident response.

Log Source Types

Scanner Collect supports multiple ingestion methods, depending on how your source system makes logs available:

1. API Pull Sources

Scanner integrates with API-based log sources and periodically pulls logs into S3. It handles API pagination, authentication, and deduplication internally.

Examples:

  • Okta System Logs (via /api/v1/logs)

  • Google Workspace Admin Activity Reports (via Reports API)

  • Slack Audit Logs (via auditlogs.slack.com API)

Behavior:

  • Logs are fetched periodically (typically every 1–5 minutes)

  • Delivered to S3 as .json.gz files under a daily partitioned prefix

  • Each file is a newline-delimited JSON file, where each line is a new log event

2. HTTP Push Sources

Scanner can accept logs over HTTP, useful for tools that emit webhook events or for forwarding from log shippers.

Examples:

  • Alert webhooks from Wiz, Google Alert Center, Tines, Torq

  • Logs pushed from Fluent Bit, Logstash, or Vector

Behavior:

  • Scanner exposes a secure HTTP endpoint per source

  • Incoming payloads are batched and written to S3 in gzip-compressed JSON

  • Custom parsing/enrichment pipelines can be configured as needed

3. Custom S3 Sources

If logs are already being delivered to S3 (e.g., via third-party pipelines or vendor tooling), Scanner can index them in place.

Examples:

  • Cloudflare DNS and HTTP logs via S3

  • Crowdstrike Falcon Data Replicator

  • Sublime Security email logs

Behavior:

  • Scanner receives s3:ObjectCreated notifications when new files are written to S3

  • Files must be one of the supported types:

    • JSON

    • Parquet

    • CSV

    • Plaintext

  • Scanner indexes these files for search and detection.

Forward Elsewhere (Optional)

Scanner itself does not forward logs to other systems after ingestion. However, because your logs are stored in your own S3 bucket in a clean, consistent format, you’re free to set up your own forwarding pipelines.

Common approaches include:

  • Triggering AWS Lambda functions on new S3 object creation

  • Streaming from S3 to Kinesis, then into another SIEM (e.g., Splunk, Datadog)

  • Using AWS Glue or other ETL tools to load logs into downstream systems

This flexibility is intentional. You retain full ownership of your log data and can integrate it into any part of your security stack without vendor lock-in.

Last updated

Was this helpful?