stats()

stats([...fs]) by [...cols] produces an aggregation table.

stats computes the aggregations specified by [...fs] for each group of [...cols].

Any function that returns a single row and fixed number of columns can be used as an argument in fs:

  • avg()

  • count()

  • countdistinct()

  • max()

  • min()

  • percentile()

  • sum()

  • var()

Each f in fs can also be aliased with the as keyword, which will rename it in the output. If an argument is not renamed, it will be provided with a default name for legibility. This default name is not guaranteed to be stable, and queries should not depend on it. If a column needs to be referenced in a later function in the pipeline, please explicitly alias it using as.

fs may be empty. Regardless of whether any fs are provided, stats always calculates the count, e.g. stats by foo will aggregate counts by foo.

cols may be empty. If it is, stats will aggregate over the entire input datastream. For example, stats max(a), max(b) will calculate a single maximum of column a and a single maximum of column b across the whole dataset, producing a table with a single row.

Technical Notes

  • If the total size of the result set is less than 128MB, stats() will return all the groups, their exact counts, and the statistics for them.

  • If the total size of the result set is over 128MB, stats()will return a sampling of the groups, weighted for more-frequent groups, such that the returned result set is under 128MB.

    • Not all groups are guaranteed to be in the result; however, every group that is in the result is guaranteed to have its correct final value.

    • Note that the 128MB limitation is on the size of the in-memory representation, and may not correspond exactly to e.g. the size of the returned JSON in the API.

Returns

A table with one row for each distinct value of [...cols], with the following columns:

  • One column for each col of [..cols] provided, each named col and containing the value of col.

  • @q.count, containing the number of occurrences of that value.

  • At least one column for each f of [...fs], named according to the as field, or, if no as was used, named the same as f.

    • If f is a column name, or a function that returns only one column (e.g. sum, avg, count), then that column is added to the result

    • If f is a function that can return multiple columns (e.g. max), then all of those columns are added to the result

Examples

# a table with columns named `hostname`, `@q.count`, `elapsed_ms`, and `max(timestamp)`
* | stats max(elapsed_ms) as elapsed_ms, max(timestamp) by hostname

# a table with columns named `@q.count`, `mean`, `p50`, and `p90`
userIdentity.type: "IAMUser" and eventSource: "s3.amazonaws.com"
| stats count() as numReqs, userIdentity.arn by userIdentity.arn
| stats
    avg(numReqs) as mean,
    percentile(50, numReqs) as p50,
    percentile(90, numReqs) as p90

Last updated

Was this helpful?