# Ad hoc queries

## What is an ad hoc query?

An **ad hoc query** is a search query with a **start\_time**, an **end\_time**, a **max\_rows**, and **query**. It runs asynchronously in the background, and you can poll it periodically to check for results.

An ad hoc query is basically analogous to a query you make in the **Search** tab in Scanner.

* The results of an ad hoc query are tabular, consisting of columns and rows.
* The results table is limited in size to **max\_bytes** bytes, which is 128MB by default. Some aggregation functions also have additional memory-bounding behavior, which is documented on a per-function basis. Note that this refers to its internal memory representation, not necessarily the size of the returned JSON blob, which will be larger in most cases.
* The results table can have at most **max\_rows** rows, but may have fewer due to the memory size limitation. Note that aggregations like `groupbycount` and `stats` do not support this limit; use **max\_bytes** if you need to bound the size of your result set in such an aggregation.
  * If a table exceeds **max\_bytes**, then rows may be dropped to allow it to fit, regardless of the value of **max\_rows**. *This means that any paging queries should not check if the number of rows returned is equal to **max\_rows** when determining if there are additional pages in a time range; they should instead only terminate if a returned page is empty, or the range is exhausted.*
* If the query is over log events with no aggregations, the returned log events are always guaranteed to be contiguous, and also to be either the latest or earliest log events, depending on the value of **scan\_back\_to\_front**.
* The table will contain some internal metadata under the **@scnr** namespace, e.g. `@scnr.context_fields`, `@scnr.timestamp`, etc. These fields are *not* guaranteed to be stable, and may change without notice.

There are two ways to execute an ad hoc query: asynchronous and blocking.

## How to execute an asynchronous ad hoc query

<mark style="color:green;">**`POST`**</mark> `/v1/start_query`

To execute an asynchronous ad hoc query, you first create it via `POST /v1/start_query` request. The Scanner API will return the id of the query, which you can use to poll its status with `GET /v1/query_progress` requests.

**Body**

| Name                                                  | Type    | Description                                                                                                                |
| ----------------------------------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
| `query` <mark style="color:red;">required</mark>      | string  | Query text                                                                                                                 |
| `start_time` <mark style="color:red;">required</mark> | string  | Start timestamp for the query (inclusive). The format of the timestamp is RFC 3339                                         |
| `end_time` <mark style="color:red;">required</mark>   | string  | End timestamp for the query (exclusive). The format of the timestamp is RFC 3339                                           |
| `max_rows`                                            | number  | Maximum number of rows to return. Default is 1000, max is 100000                                                           |
| `max_bytes`                                           | number  | Maximum number of bytes to allocate in memory for this query. Default and max are 134217728 (128MB), min is 1048576 (1MB). |
| `scan_back_to_front`                                  | boolean | Whether to scan from back (latest) to front (earliest). Default is `true`                                                  |

**Example**

```bash
curl $API_BASE/v1/start_query \
-H "Authorization: Bearer $SCANNER_API_KEY" \
-H "Content-Type: application/json" \
-X POST \
-d '{
  "query": "%ingest.source_type: \"aws:cloudtrail\" and sourceIPAddress: 174.23.51.122",
  "start_time": "2024-02-04T01:00:00.000Z",
  "end_time": "2024-02-04T01:30:00.000Z"
}'
```

**Response**

{% tabs %}
{% tab title="Success" %}
When the ad hoc query has been completed successfully, the response HTTP status code will be `200`, and the result will contain the ID of the ad hoc query that was just created.

```json
{ "qr_id": "37ccf932-42e7-4e2e-b21e-e9f67384bea7" }
```

{% endtab %}

{% tab title="Error" %}
If Scanner was unable to create the ad hoc query because the query parameters were invalid, the response HTTP status code will be `400`, and the response body will contain some information about the reason the query was rejected.

```json
{
  "error": "Failed to parse query: Type error at 4-7: Function missing arguments"
}
```

{% endtab %}
{% endtabs %}

## Check query progress

<mark style="color:blue;">**`GET`**</mark> `/v1/query_progress/{qr_id}`

Gets the current progress of the query with the supplied `qr_id`.

Users are expected to run `GET` requests periodically to check for query results. We recommend checking every 1 second.

**Parameters**

| Name                        | Type    | Description                                                |
| --------------------------- | ------- | ---------------------------------------------------------- |
| `show_intermediate_results` | boolean | Whether to return intermediate results. Defaults to `true` |

**Example**

<pre class="language-bash"><code class="lang-bash"><strong>curl $API_BASE/v1/query_progress/37ccf932-42e7-4e2e-b21e-e9f67384bea7?show_intermediate_results=false \
</strong>-H "Authorization: Bearer $SCANNER_API_KEY" \
-H "Content-Type: application/json"
</code></pre>

**Response**

{% tabs %}
{% tab title="In progress" %}
When the query is still in progress, the response HTTP status code will be 200, and the `is_completed` field will be `false`:

```json
{
  "is_completed": false,
  "results": {
    "column_ordering": [],
    "rows": []
  },
  "metadata": {
    "n_bytes_scanned": 8716223
  }
}
```

{% endtab %}

{% tab title="Complete" %}
When the query has completed successfully, the response HTTP status code will be 200, and the `is_completed` field will be `true`.

The `results` field will contain information you can use to render a table of results. The `columns` field is an array of the names of the columns in the results table, and the `rows` field is an array of JSON objects representing the rows.

```json
{
  "is_completed" true,
  "results": {
    "column_ordering": ["time", "@index", "raw_event"],
    "rows": [
      { "time": "2024-02-04T01:02:12.210Z", "@index": "global-cloudtrail", "raw_event": "..." },
      { "time": "2024-02-04T01:12:45.761Z", "@index": "global-cloudtrail", "raw_event": "..." },
      { "time": "2024-02-04T01:12:45.761Z", "@index": "global-cloudtrail", "raw_event": "..." },
      ...
    ]
  },
  "metadata": {
    "n_bytes_scanned": 90184761
  }
}
```

{% endtab %}
{% endtabs %}

## Cancel a query

<mark style="color:green;">**`POST`**</mark> `/v1/cancel_query/{qr_id}`

Cancels a previously-started query with the supplied `qr_id`. This stops the query from continuing to execute and frees up any resources associated with it.

**Example**

```bash
curl $API_BASE/v1/cancel_query/37ccf932-42e7-4e2e-b21e-e9f67384bea7 \
-H "Authorization: Bearer $SCANNER_API_KEY" \
-X POST
```

**Response**

{% tabs %}
{% tab title="Success" %}
Returns `204` with no body. This endpoint is idempotent; cancelling an already-cancelled, completed, or nonexistent query will also return `204`.
{% endtab %}
{% endtabs %}

## How to execute a blocking ad hoc query

<mark style="color:green;">**`POST`**</mark> `/v1/blocking_query`

To execute a blocking ad hoc query, you just issue a `POST /v1/blocking_query` request. The Scanner API will hold the request open until the query completes, or it will time out if the query takes longer than 300 seconds.

**Body**

| Name                                                  | Type    | Description                                                                                                                |
| ----------------------------------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
| `query` <mark style="color:red;">required</mark>      | string  | Query text                                                                                                                 |
| `start_time` <mark style="color:red;">required</mark> | string  | Start timestamp for the query (inclusive). The format of the timestamp is RFC 3339                                         |
| `end_time` <mark style="color:red;">required</mark>   | string  | End timestamp for the query (exclusive). The format of the timestamp is RFC 3339                                           |
| `max_rows`                                            | number  | Maximum number of rows to return. Default is 1000, max is 100000                                                           |
| `max_bytes`                                           | number  | Maximum number of bytes to allocate in memory for this query. Default and max are 134217728 (128MB), min is 1048576 (1MB). |
| `scan_back_to_front`                                  | boolean | Scan from back (latest) to front (earliest). Default is `true`                                                             |

**Example**

```bash
curl $API_BASE/v1/blocking_query \
-H "Authorization: Bearer $SCANNER_API_KEY" \
-H "Content-Type: application/json" \
-X POST \
-d '{
  "query": "%ingest.source_type: \"aws:cloudtrail\" and sourceIPAddress: 174.23.51.122",
  "start_time": "2024-02-04T01:00:00.000Z",
  "end_time": "2024-02-04T01:30:00.000Z"
}'
```

**Response**

{% tabs %}
{% tab title="Success" %}
When the query has completed successfully, the response HTTP status code will be 200, and the `is_completed` field will be `true`.

The `results` field will contain information you can use to render a table of results. The `columns` field is an array of the names of the columns in the results table, and the `rows` field is an array of JSON objects representing the rows.

```json
{
  "is_completed": true,
  "results": {
    "column_ordering": ["time", "@index", "raw_event"],
    "rows": [
      {
        "row_id": { "t": "1707010142000000000", "s": "s": 4100182917 },
        "columns": { "time": "2024-02-04T01:02:12.210Z", "@index": "global-cloudtrail", "raw_event": "..." }
      },
      {
        "row_id": { "t": "1707010143000000002", "s": 1041072064 },
        "columns": { "time": "2024-02-04T01:12:45.761Z", "@index": "global-cloudtrail", "raw_event": "..." },
      },
      {
        "row_id": { "t": "1707010143000000000", "s": 2125240946 },
        "columns": { "time": "2024-02-04T01:12:45.761Z", "@index": "global-cloudtrail", "raw_event": "..." },
      },
      ...
    ]
  },
  "metadata": {
    "n_bytes_scanned": 90184761
  }
}
```

{% endtab %}

{% tab title="Timed out" %}
When the query times out, the response HTTP status code will be 504.
{% endtab %}
{% endtabs %}
