Scanner Collect
Learn what Scanner Collect is, why it helps you build a scalable S3-based log data lake, and how it simplifies ingestion, indexing, and detection across your SaaS and cloud logs.
Scanner Collect helps you build a log data lake in Amazon S3 with minimal setup and no custom pipeline code. In a single afternoon, you can integrate dozens of log sources.
Logs are delivered into your S3 buckets as gzipped JSON files, partitioned by date (e.g. s3://your-bucket/logs/okta/2025/07/27/
). Scanner can then optionally index these logs to enable full-text search and continuous detections.
Why Build a Log Data Lake?
Most traditional logging tools and SIEMs rely on stateful search clusters, which are costly and difficult to scale. Moving to a data lake architecture in S3 offers key advantages:
Orders-of-magnitude lower storage cost
Easy to scale from GBs to PBs
Native support for long-term retention
Your logs stay in your storage and under your control
Scanner Collect simplifies the ingestion layer of this architecture - handling API polling, file formatting, delivery to S3, and accelerated full-text search and detections.
The Modern Security Perimeter Is SaaS and Cloud
Security monitoring today means more than just collecting logs from endpoints or firewalls. The modern attack surface is defined by SaaS tools, identity providers, and cloud platforms—and each of them emits valuable audit logs.
Most of these tools provide audit logs in one of three ways:
Pull via API — e.g., Okta System Logs, Microsoft 365 Audit Logs, Slack Audit Logs
Push via Webhook or Log Forwarder — e.g., Wiz alert webhooks, Fluentd or Logstash pushing application logs
Drop to S3 — e.g., Cloudflare HTTP/DNS, Crowdstrike FDR, AWS CloudTrail, GitHub
Collecting audit logs from the services you rely on should be simple, fast, and repeatable. That’s the goal of Scanner Collect: to remove the boilerplate and let your team connect log sources in minutes—not weeks—so you can focus on what matters: detection, investigation, and incident response.
Log Source Types
Scanner Collect supports multiple ingestion methods, depending on how your source system makes logs available:
1. API Pull Sources
Scanner integrates with API-based log sources and periodically pulls logs into S3. It handles API pagination, authentication, and deduplication internally.
Examples:
Okta System Logs (via
/api/v1/logs
)Google Workspace Admin Activity Reports (via Reports API)
Slack Audit Logs (via
auditlogs.slack.com
API)
Behavior:
Logs are fetched periodically (typically every 1–5 minutes)
Delivered to S3 as
.json.gz
files under a daily partitioned prefixEach file is a newline-delimited JSON file, where each line is a new log event
2. HTTP Push Sources
Scanner can accept logs over HTTP, useful for tools that emit webhook events or for forwarding from log shippers.
Examples:
Alert webhooks from Wiz, Google Alert Center, Tines, Torq
Logs pushed from Fluent Bit, Logstash, or Vector
Behavior:
Scanner exposes a secure HTTP endpoint per source
Incoming payloads are batched and written to S3 in gzip-compressed JSON
Custom parsing/enrichment pipelines can be configured as needed
3. Custom S3 Sources
If logs are already being delivered to S3 (e.g., via third-party pipelines or vendor tooling), Scanner can index them in place.
Examples:
Cloudflare DNS and HTTP logs via S3
Crowdstrike Falcon Data Replicator
Sublime Security email logs
Behavior:
Scanner receives
s3:ObjectCreated
notifications when new files are written to S3Files must be one of the supported types:
JSON
Parquet
CSV
Plaintext
Scanner indexes these files for search and detection.
Forward Elsewhere (Optional)
Scanner itself does not forward logs to other systems after ingestion. However, because your logs are stored in your own S3 bucket in a clean, consistent format, you’re free to set up your own forwarding pipelines.
Common approaches include:
Triggering AWS Lambda functions on new S3 object creation
Streaming from S3 to Kinesis, then into another SIEM (e.g., Splunk, Datadog)
Using AWS Glue or other ETL tools to load logs into downstream systems
This flexibility is intentional. You retain full ownership of your log data and can integrate it into any part of your security stack without vendor lock-in.
Last updated
Was this helpful?