Pipeline Stages#

Stage $group#

Example {"$group":{"group_cxs": ["c0"], "c1": "sum"}}

The $group stage will merge multiple rows together that have the same (compound-)key. Different operators are available to define how the individual values should be merged. Note that this can also be used to add new columns to the row.

  • Parameter group_cxs: (list of strings) defines the compound key columns .e.g [“c0”] or [“c0”, “c1”]

  • Parameter c0-99: (string (operator)) defines how the values should be merged, if omitted a single value of a random row will be used.

  • Merge Operator first: the default operator take the value form the first matched row (rows are not sorted).

  • Merge Operator sum: creates a sum of all number values

  • Merge Operator avg: creates a average of all number values

  • Merge Operator count: counts the merged rows

  • Merge Operator min: provides the minimum of all number values

  • Merge Operator max: provides the maximum of all number values

  • Merge Operator bool_and: and operator, all values in the column must be bools or null

  • Merge Operator bool_or: or operator, all values in the column must be bools or null

  • Merge Operator list: builds a list of the values

  • Merge Operator list_unique: builds a list of the values, but values in the list will be unique

  • Merge Operator merge: for subtables, merges the subtables into one

Stage $filter#

Example {"$filter":{"c2": "summer"}}

A filter as known. Syntax is checked, only c0-99 are allowed as keys. Allowed operators are "$eq", "$in", "$gt", "$gte", "$lt", "$lte", "$regex", "$ne"

  • Parameter match_any (bool) if set, all defined conditions in the filter are combined with an $or operator.

    • Example: match all c0 smaller than -10 or larger than +10

    • {"$filter":{"match_any": true, "c0": {"$gt": 10, "$lt": -10}}

The $filter stage allows the usage of placeholders. Placeholders are defined in the filter with double $$ signs. The pipeline_placeholders parameter must be specified to define the values:

{
    "pipeline_placeholders": [
        {
            "display_name": "area codes",
            "description": "all the area codes to export",
            "display_type": "list",
            "placeholder": "$$area_codes",
            "value": ["534", "523"]
        }
    ]
}

A filter example that uses a placeholder could look like this: {"$filter":{"c0": {"$in": "$$area_codes"}}}

  • Note: only the placeholder and value properties are used by the get_rows api, the other fields are recommendations for storage in exporter presets.

Stage $group_subtable#

This is very similar to the regular $group stage, but operates on subtables instead. The only additional parameter is:

  • Additional Parameter subtable_cx (string) the cx where the subtable ist stored.

  • Parameter subtable_col_count: (integer) The number of columns in the subtable. REQUIRED IF the subtable_cx is NOT an actual subtable of the main table, but the result of a previous stage.

Example: {"$group_subtable": {"subtable_cx": "c3", "group_cxs": ["c0", "c3"], "c1": "sum"}}

Stage $filter_subtable#

The same as a regular $filter stage, but can remove rows from subtables instead.

  • Additional Parameter subtable_cx (string) the cx where the subtable ist stored.

Example (with placeholder too): {"$filter_subtable": {"subtable_cx": "c3", "filter": {"c2": "$$product_type"}}}

Stage $expand#

Example: {"$expand": {"subtable_cx": "c4"}}

Produces ‘main rows’ from subtable rows. Takes all Subtablerows from all input rows, and returns one row per subtable row. All main row data is discarded and further processing can be done on the subtable rows as if they were regular rows.

  • Parameter subtable_cx (string) the cx of the subtable to expand.

  • Parameter subtable_col_count: (integer) The number of columns in the subtable. REQUIRED IF the subtable_cx is NOT an actual subtable of the main table, but the result of a previous stage.

Stage $collapse#

Example: {"$collapse": {"c1": "sum"}}

Very similar to $group, but will merge all input rows into one. All parameters and operators are the same, except that group_cxs doesn’t exist in this stage.

Stage $project#

Keep only some of the columns for further processing. Can be used to improve performance.

Example {"$project": {"keep": ["c0", "c1", "c2", "c3", "c4"]}}

  • Parameter keep (list of cx), these columns well be kept as-is, all other will be removed.

Stage $lookup#

Looking up documents from another table. You can define a column where the output of the lookup should reside.

Example {"$lookup": {"c3": {"foreign_table_id": "66BDC0A1336741D0C05F350311F269D9", "foreign_cx": "c0", "local_cx": "c1"}}}

  • Parameter cx: where to put the result

  • Parameter foreign_table_id: id of the other table

  • Parameter local_cx: cx in the main table that contains a key

  • Parameter foreign_cx: cx in the other table that has to match the key

  • Parameter filter: same style filter as the other filter stages, will be applied to the foreign table before matching objects to the local documents

Stage $unpack#

Retrieve a single value from a subtable and put it in the main row. The first entry in the subtable is used to extract the value. This is most useful when the subtable contains only one row and you want to use a value from that in the main row.

Example {"$unpack": {"c4":{"subtable_cx": "c99", "unpack_cx": "c2"}}} takes value c2 from the subtable’s (c99) first row and puts it in c4

  • Parameter cx: where to put the result

  • Parameter subtable_cx: where the subtable resides

  • Parameter unpack_cx: which column to take form the subtable

Stage $addFields#

Specify a source and target cx, so new field(s) will be added to the rows. Source cx may be from a main row, while target cx can be either a main row cx, or with dot notation a subtable field (cx.cy). When a subtable field is used as target, all rows in the subtable will receive the identical new value. More than one target/source pair can be added in a single stage.

Example 1 {"$addFields": {"c4.c3": "c2"} adds new field c3 to subtable at c4 with value from main rows c2

  • Parameter cx (1): where to put the result

  • Parameter cx (2): where to get the value from

addFields may also accept any mongodb expression inside an object instead of the c2 of the example. E.g. the following is valid:

Example 2:

{
    "$addFields": {
        "c4": {
            "$cond": { "if": { "$gte": [ "$c1", 250 ] }, "then": true, "else": false }
        }
    }
}

This will set c4 to true or false depending on the value of c1

Stages $multiply, $divide, $subtract, $add, $min, $max, $avg, $size, $sum#

Note

These will be executed in mongodb as opposed to the final_compute stage.

Note

Pay attention to the two different command input syntax styles, and which command allow which style.

Usage style list of colunns as input:

  • Example: {"$sum": {"c99": ["c4", "c5"]}}

  • assuming c4 and c5 are numbers, builds the sum and then puts result in c99

Usage style single column as input (this column should be a list of numbers):

  • Example: {"$sum": {"c98": "c56"}}

  • assuming c56 is an array of numbers, builds the sum of them and puts the result in c98

$multiply: exactly two values

  • Only list of colmns style is allowed. Exactly two columns must be specified.

  • Example: {"$multiply": {"c99": ["c4", "c5"]}}

$divide: exactly two values

  • Only list of colmns style is allowed. Exactly two columns must be specified.

  • Example: {"$divide": {"c99": ["c4", "c5"]}}

$subtract: exactly two values

  • Only list of colmns style is allowed. Exactly two columns must be specified.

  • Example: {"$subtract": {"c99": ["c4", "c5"]}}

$add: exactly two values

  • Only list of colmns style is allowed. Exactly two columns must be specified.

  • Example: {"$add": {"c99": ["c4", "c5"]}}

$min: finds minimum value in the input

  • BOTH types list of columns and single column are supported.

  • Example 1: {"$min": {"c99": ["c4", "c5"]}}

  • Example 2: {"$min": {"c99": "c6"}}

$max: finds maximum value in the input

  • BOTH types list of columns and single column are supported.

  • Example 1: {"$max": {"c99": ["c4", "c5"]}}

  • Example 2: {"$max": {"c99": "c6"}}

$avg: finds average value of all input numbers

  • BOTH types list of columns and single column are supported.

  • Example 1: {"$avg": {"c99": ["c4", "c5"]}}

  • Example 2: {"$avg": {"c99": "c6"}}

$sum: build the sum over all input numbers

  • BOTH types list of columns and single column are supported.

  • Example 1: {"$sum": {"c99": ["c4", "c5"]}}

  • Example 2: {"$sum": {"c99": "c6"}}

$size: returns the number of elements in a list

  • ONLY single column style supported.

  • Example: {"$size": {"c99": "c6"}}

$concatArrays: performs a horizontal merge of subtables into a new column

  • Only list of colmns style is allowed. At least two columns must be specified.

  • Example 1: {"$concatArrays": {"c99": ["c4", "c5"]}}

  • Note: c4 and c5 in the example would contain a subtable

  • Note: The target column c99 must be different that the source columns

  • Note: It’s best practice to merge only subtables of the same type or the result data will be difficult to work with

Note

In the above examples assume that: c4/c5 = numbers and c6 = list of numbers

Stage $sort_limit#

Does what you would expect. It is also possible to define the apis limit and skip parameters instead, they will be added as a pipeline step at the end of the mongodb pipeline.

Example {"$sort_limit": {"sort_by": "c0", "sort_dir": 1, "limit": 12}}

  • Parameter sort_by: sort by this cx

  • Parameter sort_dir: 1 or -1

  • Parameter limit: only return this number of documents

Stage $final_sectionize#

Build sections from the rows. The concept is very similar to the $group stage with more customizable group-key and of course rows are not removed, but a header row is added which may contain aggregated information. Is executed on the API server instead of mongodb, so it must be one of the last stages. Example:

{
    "$final_sectionize": {
        "header_position": "top",
        "group_key_expr": "c0 + $substr(c1, '[:-2]')",
        "header_row_def":{
            "c0":  {"compute": "$$headline.c3"},
            "c10": {"op": "first", "source_cx": "c0"},
            "c11": {"op": "first", "source_cx": "c1"},
            "c2":  {"op": "sum",   "source_cx": "c2"},
            "c13": {"compute": "$$group_key"},
            "c14": {"op": "count", "compute": "'Rows in this section: ' + c14"}
        },
        "process_filter": {"c41": true},
        "headline_placeholder": {
            "source_table_id": "63D3C13503252AEAAD660B877C66F171"
            "filter": {"c123": true},
            "group_key_expr": "c0 + c1"
        }
    }
}
  • Parameter header_position: either ‘top’ or ‘bottom’

  • Parameter group_key_expr: an expression, as used e.g. in the $final_compute stage. Result is used as a key to group rows together in one section.

    • Note that you can use the expression to combine multiple columns of the row to a single key

  • Parameter header_row_def: Construction of the header row. Operators are similar to the group stage:

    • op: operator defines how the rows in this section should be merged for each cx individually. (optional if compute parameter is used)

      • Operator first: take the value of the first row in the section

      • Operator sum: build the sum over a column for all rows in the section

      • Operator count: count the rows in the section

    • source_cx: take this cx from the rows in the section, merge with op. (optional if compute parameter is used)

    • compute: (optional) acts like the final_compute. Is applied after all the merging of the sections rows is completed, so any cx of the header row can be used in computations.

      • Additional Placeholder $$headline.cx this placeholder comes from the headline_placeholder definition.

      • Additional Placeholder $$group_key the group key of this section

      • Note: the target cx can be used here as well.

  • Parameter process_filter: a filter with the same syntax as the $filter stage. If a row in the section doesn’t match the filter it will be ignored by the merge operators.

    • Note: when using multiple stages this is a typical pitfall that may cause the header positions to be ‘wrong’. The table rows will be processed top-down and whenever a new group key is found a header is added. When a row is ignored the group key will not be built, so the header will be added after that row (in bottom mode). (which is correct)

  • Parameter headline_placeholder: Provides the additional placeholder $$headline for the final_compute_header parameter. A row from another table will be loaded for each section header. The rows will be matched to the individual header rows through their group key, so it acts a little similar to the lookup stage.

    • source_table_id: Table where the headline rows should be loaded from

    • filter: Mongodb filter to apply to the table.

    • group_key_expr: How the group key should be built for these headline rows. Results should match with the regular group keys.

    • This provides a placeholder $$headline which can then be used in the final_compute_header

Note

If you want to build multiple sections (sub-sub sections/main-sections) you have to add this stage multiple times to the pipeline. You may want to add a bool constant to each header with final_compute_header, so you can use this for filtering with the merge operators.

Note

The resulting header rows are treated as regular rows in any further pipeline stage.

Stage $final_compute#

This is not translated to a mongodb stage, since it is difficult to build and all the neccessary data has been loaded into the python environment anyway it is done there. The drawback is, that these stages have to be the last in the pipeline.

  • allow computations on the row data

  • can be used to add new columns

  • works on each row as a data-set (although additional data lookup operators may be implemented too)

Example {"$final_compute": {"c6": "$sum($col($filter(c3 + c4, {'c0':'spoon'}), c1))"}}

This example combines the subtables c3 and c4, applies a filter to retrieve only the items of type spoon, then takes only the c1 col as list, which is an amount, and finally builds the sum over this list.

Note

See Operator: column-values for a list of available operators for this stage.

Example for $final_compute#

Consider a table structure like this (but assume we have a Box2 in c4):

../../../_images/box_with_spoons.png
  • Question: How many spoons are in Box1 and Box2 combined?

    • $final_compute: "$sum($col($filter(c3 + c4, {'c0':'spoon'}), c1))"

  • Question: How many tableware positions are in Box1

    • $final_compute: "$count($col($filter(c3, {'c3':'tableware'}), c0))"

Stage $final_map#

Allow easy re-ordering of the result.

  • Parameter c0-99: (cx) defines mapping with source-cx: target-cx

  • Parameter exclusive: (bool, optional, default false) if true, only the mentioned colunmns will be in the output.

Example {"$final_map": {"c1": "c2", "c2": "c5", "exclusive": true}}

This example produces a c1 from previous c2 and a c2 from c5. All other columns will be excluded, due to the exclusive parameter.