scanner
  • About Scanner
  • When to use it
  • Architecture
  • Getting Started
  • Playground Guide
    • Overview
    • Part 1: Search and Analysis
    • Part 2: Detection Rules
    • Wrapping Up
  • Log Data Sources
    • Overview
    • List
      • AWS
        • AWS Aurora
        • AWS CloudTrail
        • AWS CloudWatch
        • AWS ECS
        • AWS EKS
        • AWS GuardDuty
        • AWS Lambda
        • AWS Route53 Resolver
        • AWS VPC Flow
        • AWS VPC Transit Gateway Flow
        • AWS WAF
      • Cloudflare
        • Audit Logs
        • Firewall Events
        • HTTP Requests
        • Other Datasets
      • Crowdstrike
      • Custom via Fluentd
      • Fastly
      • GitHub
      • Jamf
      • Lacework
      • Osquery
      • OSSEC
      • Sophos
      • Sublime Security
      • Suricata
      • Syslog
      • Teleport
      • Windows Defender
      • Windows Sysmon
      • Zeek
  • Indexing Your Logs in S3
    • Linking AWS Accounts
      • Manual setup
        • AWS CloudShell
      • Infra-as-code
        • AWS CloudFormation
        • Terraform
        • Pulumi
    • Creating S3 Import Rules
      • Configuration - Basic
      • Configuration - Transformations
      • Previewing Imports
      • Regular Expressions in Import Rules
  • Using Scanner
    • Query Syntax
    • Aggregation Functions
      • avg()
      • count()
      • countdistinct()
      • eval()
      • groupbycount()
      • max()
      • min()
      • percentile()
      • rename()
      • stats()
      • sum()
      • table()
      • var()
      • where()
    • Detection Rules
      • Event Sinks
      • Out-of-the-Box Detection Rules
      • MITRE Tags
    • API
      • Ad hoc queries
      • Detection Rules
      • Event Sinks
      • Validating YAML files
    • Built-in Indexes
      • _audit
    • Role-Based Access Control (RBAC)
    • Beta features
      • Scanner for Splunk
        • Getting Started
        • Using Scanner Search Commands
        • Dashboards
        • Creating Custom Content in Splunk Security Essentials
      • Scanner for Grafana
        • Getting Started
      • Jupyter Notebooks
        • Getting Started with Jupyter Notebooks
        • Scanner Notebooks on Github
      • Detection Rules as Code
        • Getting Started
        • Writing Detection Rules
        • CLI
        • Managing Synced Detection Rules
      • Detection Alert Formatting
        • Customizing PagerDuty Alerts
      • Scalar Functions and Operators
        • coalesce()
        • if()
        • arr.join()
        • math.abs()
        • math.round()
        • str.uriencode()
  • Single Sign On (SSO)
    • Overview
    • Okta
      • Okta Workforce
      • SAML
  • Self-Hosted Scanner
    • Overview
Powered by GitBook
On this page
  • Problem - modern log scale can become unsustainable
  • Solution - Move high volume logs to a data lake, and index the data lake with Scanner
  • Cost improvement
  • Before Scanner
  • After Scanner
  • What are the tradeoffs?

Was this helpful?

When to use it

Once high-volume log sources start to increase costs dramatically, we recommend moving these logs to a data lake in S3 - and indexing them with Scanner for fast search.

PreviousAbout ScannerNextArchitecture

Last updated 1 month ago

Was this helpful?

Problem - modern log scale can become unsustainable

Scanner was designed to solve the problem of modern log scale. In our opinion, traditional log management tools and SIEMs become far too expensive once logs reach high volume.

If you are ingesting 100GB of logs day into a traditional SIEM, you might be spending on the order of $100k per year. This is somewhat pricey, but not too terrible.

However, as your company starts to grow, it's very easy to reach the point where you are generating 1TB of logs per day. This can cost $1M per year in traditional SIEM tools. This is extremely painful.

At this scale, teams often split their logs into two categories: low volume log sources, and high volume log sources.

Half of the total ingestion volume comes from only 3-5 high volume log sources, like web application firewall logs, VPC flow logs, CloudTrail logs, Cloudflare DNS and HTTP logs, etc.

These high volume logs tend to be less critical, but they are still incredibly helpful for investigations and detecting threats.

Solution - Move high volume logs to a data lake, and index the data lake with Scanner

Here's what we propose. Teams can continue to ingest their low volume log sources into their traditional SIEM, but they should move their high volume logs to a data lake in S3.

They can then use Scanner to index their data lake for fast search from Scanner's UI.

Cost improvement

Before Scanner

And now here is how the costs change. Before Scanner, the bill for ingesting 1TB of logs per day into the SIEM is around $1M/year. Low volume logs are responsible for $500k of costs, and high volume logs are also responsible for $500k of costs.

# of log sources
Ingest volume
Ingest cost

Low volume log sources in Traditional SIEM

25-100 log sources

500GB/day

$500k/year

High volume log sources in Traditional SIEM

3-5 log sources

500GB/day

$500k/year

Total

1TB/day

$1M/year

After Scanner

After moving high volume logs to an S3 data lake and indexing them with Scanner, the cost of high volume logs drops down 80% to $100k per year, reducing the overall cost from $1M down to $600k.

Low volume log sources in Traditional SIEM

25-100 log sources

500GB/day

$500k/year

High volume log sources in data lake indexed by Scanner

3-5 log sources

500GB/day

$100k/year

Total

1TB/day

$600k/year

By moving high volume logs to a data lake and indexing them with Scanner, overall costs are reduced by 40%, which can free up meaningful budget for other projects.

What are the tradeoffs?

When you move the high volume log sources out of a traditional SIEM and into a data lake in S3 indexed by Scanner, there can be strong cost savings, and your search speed in Scanner will continue to be fast, but there are some practical tradeoffs to consider.

  • You can run queries supported by Scanner's query language, which may not be exactly the same as the queries supported by your prior log tool. For more information about the kinds of queries supported by Scanner's query language, see:

    • Query Syntax

    • Aggregation Functions

Ingesting 1TB / day costs $1M, which can be painful
Half of the log volume comes from a small number of log sources
We propose moving high volume logs to a data lake in S3, and indexing with Scanner