# stats()

`stats([...fs]) by [...cols]` produces an aggregation table.

`stats` computes the aggregations specified by `[...fs]` for each group of `[...cols]`.

Any function that returns a single row and fixed number of columns can be used as an argument in `fs`:

* `avg()`
* `count()`
* `countdistinct()`
* `max()`
* `min()`
* `percentile()`
* `sum()`
* `var()`

Each `f` in `fs` can also be aliased with the `as` keyword, which will rename it in the output. If an argument is *not* renamed, it will be provided with a default name for legibility. <mark style="background-color:yellow;">**This default name is**</mark><mark style="background-color:yellow;">**&#x20;**</mark>*<mark style="background-color:yellow;">**not**</mark>*<mark style="background-color:yellow;">**&#x20;**</mark><mark style="background-color:yellow;">**guaranteed to be stable, and queries should not depend on it.**</mark> If a column needs to be referenced in a later function in the pipeline, please explicitly alias it using `as`.

`fs` may be empty. Regardless of whether any `fs` are provided, `stats` *always* calculates the count, e.g. `stats by foo` will aggregate counts by `foo`.

`cols` may be empty. If it is, `stats` will aggregate over the entire input datastream. For example, `stats max(a), max(b)` will calculate a single maximum of column `a` and a single maximum of column `b` across the whole dataset, producing a table with a single row.

## Technical Notes

* If the total size of the result set is less than 128MB (or `max_bytes` if using the API), `stats` will return all the groups, their exact counts, and the statistics for them.
* If the total size of the result set is over 128MB (or `max_bytes` if using the API), `stats`will return a sampling of the groups, weighted for more-frequent groups, such that the returned result set is under 128MB.
  * Not all groups are guaranteed to be in the result; however, every group that *is* in the result is guaranteed to have its correct final value.
  * Note that the 128MB limitation is on the size of the in-memory representation, and may not correspond exactly to e.g. the size of the returned JSON in the API.

## Returns

A table with one row for each distinct value of `[...cols]`, with the following columns:

* One column for each `col` of `[..cols]` provided, each named `col` and containing the value of `col`.
* `@q.count`, containing the number of occurrences of that value.
* At least one column for each `f` of `[...fs]`, named according to the `as` field, or, if no `as` was used, named the same as `f`.
  * If `f` is a column name, or a function that returns only one column (e.g. `sum`, `avg`, `count`), then that column is added to the result
  * If `f` is a function that can return multiple columns (e.g. `max`), then all of those columns are added to the result

## Examples

```python
# a table with columns named `hostname`, `@q.count`, `elapsed_ms`, and `max(timestamp)`
* | stats max(elapsed_ms) as elapsed_ms, max(timestamp) by hostname

# a table with columns named `@q.count`, `mean`, `p50`, and `p90`
userIdentity.type: "IAMUser" and eventSource: "s3.amazonaws.com"
| stats count() as numReqs, userIdentity.arn by userIdentity.arn
| stats
    avg(numReqs) as mean,
    percentile(50, numReqs) as p50,
    percentile(90, numReqs) as p90
```
