Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Semistructured Columns #54864

Closed
alexey-milovidov opened this issue Sep 21, 2023 · 32 comments · Fixed by #66444
Closed

RFC: Semistructured Columns #54864

alexey-milovidov opened this issue Sep 21, 2023 · 32 comments · Fixed by #66444
Assignees
Labels

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Sep 21, 2023

(This is an alternative to the experimental JSON data type, aiming to replace it and address its drawbacks).

Implementation proposal

1. Dynamic columns.

A table (IStorage) can tell that it supports dynamic columns (bool canHaveDynamicColumns()).
If this is the case, a query can reference columns or subcolumns, not necessarily contained in the table's definition.
The table should provide a method (e.g., getColumnType) to dynamically obtain the type of a column or subcolumn (or expression?).

As a bonus and a demo, we can allow a variant of Merge table that will dynamically extend if new tables appear.
This Merge table can even have an empty table structure (no fixed columns).

2. Variant data type and column.

A new column type works as a discriminated union of nested columns. For example, Variant(Int8, Array(String)) has every value either Int8 or Array(String). It is serialized to multiple streams: a stream for every option and a stream with option numbers (discriminator).

This column can be accessed in the following ways:

  • with type conversion: c::Int8: we will read every subcolumn and convert if needed.
  • with tuple-like access: c.Int8: we will read only the requested subcolumn as Nullable (this is unneccessarily).
  • with implicit conversion: c: if the column is requested as is, we will find the least common type of the variant types (and throw an exception if it does not exist) and convert it.

3. JSON data type.

A column with JSON data type works as an opaque placeholder for dynamic subcolumns.

It is serialized into multiple streams: the first one is an index, enumerating every unique path in JSON, along with their types, and streams for every variant type for every path.

A user can insert JSON as String - then we convert it to JSON by deriving all data types of all paths as variants.

Different types inside variants are not converted on merges. Merges can only extend the variant data types by collecting more types inside them. The types are only converted on INSERT and SELECT.

If there is a JSON data type in a table, the table should tell that it supports dynamic columns. Querying any column not present in the static schema will trigger reading the "index" stream of every JSON type on query analysis. This is done lazily (on demand), but tables can subsequently cache this information inside in-memory information about data parts.

Distributed and Merge tables, as well as Views can trigger additional calls when asked about dynamic columns.

Column-level RBAC does not apply to subcolumns.

4. Explicit conversions.

When a user referenced a subcolumn from JSON with explicit type conversion, e.g. SELECT json.CounterID::UInt32 - the table will tell that the column json.CounterID::UInt32 exists and has UInt32 type without looking inside the data. If a user referenced it on table creation, e.g., CREATE TABLE ... ORDER BY json.CounterID::UInt32, the expression should be successfully resolved.

If the column json.CounterID::UInt32 is requested, it should be propagated to reading, and the table will return the data already converted to UInt32, or throw an exception.

5. Limitations on the number of subcolumns.

We will not introduce additional part formats and will use the wide or compact format as usual.

But if the number of paths in JSON is larger than the limit specified in the table setting (e.g., 1000), the remaining path prefixes will be recollected into JSON and written as String. The first paths are selected by the first-come order.

We should implement an optimization for vertical merge to support it for subcolumns, and most likely, it will be a special case in the code.

6. Hints.

The JSON data type can have parameters, specifying hints for data types of particular subpaths, or the fact that a particular subpath should not be written, e.g., JSON(http.version.major UInt8, SKIP body.raw, SKIP tls.handshake.random). We could also implement support for ALTERs.

Motivation

If we will implement only №1 and №2 and fail to implement the JSON data type, we will still get a lot.

The current implementation of the JSON data type is experimental, and it is not too late to remove it. It has the following drawbacks:

  • eager creation of subcolumns: all subcolumns have to be present during query analysis, instead of only queried ones, and it slows the queries; having dynamic subcolumns should solve this problem;
  • eager type unification: for example, if a field was an integer in JSON, and you inserted a single object where it is a string, the table will be poisoned forever; having variant data type and the support for explicit type specification should solve this problem;
  • eager type unification leads to corner cases with empty objects, nulls, unification between arrays of different dimensions, etc. - these cases are either not supported or have bugs;
  • it's not possible to use subcolumns from JSON in table indices or reference to them in other columns;
  • if the number of unique paths in JSON grows too much, it becomes unusable;
  • it duplicates the logic of schema inference, but we can reuse it more;

See also

Amazon ION data format and PartiQL query language.
Sneller query engine and its zION data format.

@Avogar
Copy link
Member

Avogar commented Oct 10, 2023

I will start working on Variant data type.

@candiduslynx
Copy link
Contributor

I'd like to point out on one if the use-cases we had in #46190, too.

Specifically, it would be great to be able to perform

INSERT INTO tbl VALUES ('b', '[{"x":{"y":"z"}}]')

Additionally, as the JSON value can represent both a map & a struct it'll be interesting to see more details on how these 2 will be reconciled. Will it be both types at the same time?

& if we already have [{"x":{"y":"z"}}]' value in the table, will writing [{"x":{"a":"b"}}]' extend the struct option for x?

@ernado
Copy link
Contributor

ernado commented Nov 30, 2023

Please also consider supporting (like column type and related functions for manipulation) binary representation of json, like PostgreSQL's jsonb or YTsaurus's YSON or MongoDB's BSON.

This is slightly off-topic, but somehow relevant and probably can improve efficency for unstructured data.

@alexey-milovidov
Copy link
Member Author

  • YSON is not a binary format for JSON, it exists only because they needed their own, different data model instead of JSON; frankly speaking, their decision is reasonable (JSON has no binary strings, no date/time data types, but it is another story)
  • BSON from MongoDB does not improve the processing efficiency or does it only marginally;
  • PostgreSQL JSONB improves the data processing in some cases (long JSONs) but it is not relevant when you still have to decompress, load the entire values in memory, and move the data around.

There is already a prior work for that:
#6814
#7426

TLDR: implementation of a binary JSON format is out of scope of this task.

@Avogar
Copy link
Member

Avogar commented Dec 18, 2023

I have a proposal with slightly different implementation of JSON type. In my opinion, it will simplify the code implementation and will be quite natural:

  1. Implement new type called Dynamic.

This new type will represent a single column with dynamic type. It means that we can insert a column with any type into Dynamic column. For example:

create table test (d Dynamic) engine=...;
insert into test select 'Hello, World!';
insert into test select 42;
insert into test select [1,2,3];
...

Inside Dynamic column we will store current data type and single Variant column. Initially it will contain empty Variant() that can store only NULL values (in my implementation of Variant it stores NULL values as a separate special discriminator). During insert into Dynamic column, we will update current type and extend this Variant column with new variants. This extension will be free, we won't need to rewrite any data (my Variant implementation supports free extension). During merge we will also just extend inner Variant column with new variants. If the number of variants will exceed the limit (255 now), we will cast the rearest types into a single String column (but I think in real use cases we won't have more than 255 different data types).

Dynamic column will support subcolumns like Variant column:
dynamic.UInt32 with type Nullable(UInt32) will read subcolumn .UInt32 from all inner Variants. If none of the Variants have this subcolumn, we will return column Nullable(UInt32) filled with NULLs.
The same as Variant we will be able to use CAST:
dynamic::UInt32, in this case we will just use CAST on each inner Variant columns.

For this new type of column we won't need to know all the inner Variant types during query analysis, because:

  • When we select Dynamic column as is or use CAST, we just use Dinamic type as is.
  • When we read subcolumns like dynamic.UInt32, we know the type of this subcolumn by its name - Nullable(UInt32).

And this new Dynamic type can be a separate feature, not related to JSON (but we can use it in JSON implementation, see below).

  1. Use Dynamic type in JSON type instead of Variant.

In JSON type we can just use Dynamic column for each path by default instead of Variant type, so in JSON column we will store just a unique set of paths (maybe 2 sets, first set of paths that are stored separately, and second set of paths that are stored together in String column in JSON, we need to think how we are going to store paths) and Dynamic column for each path. When we request some path from JSON type, it will be just column with Dynamic type, so we can request its subcolumns or use CAST.
What are the advantages of using Dynamic type:

  • We won't need to know the Variant types of paths, because it will be hidden in Dynamic type, so we don't need to look at all parts during query analysis.
  • It simplifies the implementation of JSON type a lot, because all the logic of tracking path's types/merging different Variants/inserting data with different types/etc will be inside Dynamic type. The only thing JSON type will be responsible for - tracking unique paths and converting them to String when we exceed the maximum number of paths. Also, sets of unique paths can be serialized in parts similar to LowCardinality dictionaries (and we don't need to think how to store path's types, Dynamic column will do it)
  • I think we won't need any special logic in merges for JSON columns
  • We will have one more nice feature - Dynamic column (but we can make it internal type if we don't want to allow users to use it)

I don't see any big disadvantages in this approach and at the same time I think it will simplify the implementation. There will one difference: if we request JSON path that doesn't exists in any part, we will not throw an exception that this path doesn't exists, but return column filled with NULLs. But I think if we deal with dynamic data it can be even more natural.

@alexey-milovidov WDYT?

@UnamedRus
Copy link
Contributor

The only thing JSON type will be responsible for - tracking unique paths and converting them to String when we exceed the maximum number of paths.

It's quite possible (and common), that JSON can have a lot of uniq paths (for example they use uniq user id as key value).

@CurtizJ
Copy link
Member

CurtizJ commented Dec 20, 2023

With such implementation I see two problems for now:

  1. It's unclear how to select the whole JSON column. Actually it should possible by reassembling it from existing paths, but it's may be complicated. It's more unclear how to select the intermediate object from JSON column but we can afford not to support this case.
  2. Storing of Nested arrays will be inefficient because we will store sizes of arrays for each element (path in json).

Also from your description it's unclear how to store and merge data types in Dynamic column, but probably I understood.

@Avogar
Copy link
Member

Avogar commented Dec 20, 2023

It's unclear how to select the whole JSON column. Actually it should possible by reassembling it from existing paths, but it's may be complicated. It's more unclear how to select the intermediate object from JSON column but we can afford not to support this case.

Can you explan it in more details? I don't fully understand what you mean by selecting the whole JSON column and selecting the intermediate object?
If you mean just selecting the column with type JSON like select json from my_table then what is unclear? We will have representation of this column in memory, we will have serialization for it, so we can display this column, and maybe we will have some functions working with JSON type. And how it's different from case when we use just Variant instead of Dynamic type. Maybe I don't understand something.

Storing of Nested arrays will be inefficient because we will store sizes of arrays for each element (path in json).

I guess you mean Arrays of nested JSON objects. But actually, I think we can just not store Array of JSON objects as a set of nested paths as arrays, we can store the path to this array as Dynamic column, and it will have internally type Array(JSON) inside Variant. So, if we have data {"obj" : {"arr" : [{"a" : 1, "b" : 2}, {"a" : 3, "c" : 4}]}}, path obj.arr will have type Dymanic and its subcoluns can be selected using obj.arr.Array(JSON).a. Also I think it will be better because path obj.arr can be sometimes not an Array of objects, but for example an Array of strings or numbers:

{"obj" : {"arr" : [{"a" : 1, "b" : 2}, {"a" : 3, "c" : 4}]}}
{"obj" : {"arr" : [1, 2, 3]}}

In this case it will be more natural to have path obj.arr and Dynamic type with inner type Variant(Array(Int64), Array(JSON)).
Maybe it will be not so optimal, but much easier in implementation, and so we can have less bugs with it (as we had in current Object implementation). Maybe we can improve it later.

UPD: even if we want to store arrays of objects as paths with arrays that share offsets, what is the difference betwee using Variant and Dynamic types? In first case we will have Array of Variants, in the second Array of Dynamic columns, don't see the problem here. I guess if we want to support inserting an array of JSON objects into JSON column, we should do in this way with shared offests, but I don't know yet how we are going to implement it, because as we don't use Tuples, we will need to track shared offsets inside JSON column somehow during serialization/deserialization.

Also from your description it's unclear how to store and merge data types in Dynamic column, but probably I understood.

Type will be the same across the single part, so we can just serilize its name in serializeBinaryBulkStatePrefix.
Merge will be quite easy, it will be done insice column implementation, when we insert something into a column with Dynamic type from another Dynamic column, we conpare their types and extend our type if needed and then insert. Sorry if my explanation is bad

So, I would say the only difference with approach with just Variant type for paths is that we won't need to know the Variant type and all the paths in advance to be able to read subcolumns and that we won't need to deal with different Variant types inside JSON implementation

@Avogar
Copy link
Member

Avogar commented Jan 10, 2024

As Variant type is implemented and can be merged soon (I hope), let's discuss implementation details of the JSON type.

Working with large number of paths.

One of the problem we want to solve with new JSON implementation is to be able to store JSON object with large amount of different paths. As the number of paths can be any large, we want to store only some of them as subcolumns and other inside a separate String column as a JSON string. So here we need to think about 2 topics:

What paths we will store and how we will serialize/deserialize them in parts.

First important note that we cannot store all unique paths in JSON, because this number can be even larger than the number of rows (for example if there is some unique ids used as a keys in each row). So, we can only store the set of paths that are stored separately as subcolumns (because we can control the number of these paths and limit it).
There are 2 possible ways how we can store/serialize/deserialize paths:

  1. We can have a global set of paths for the part, so all granules will share this set. It can be implemented similar to shared dictionary in LowCardinality column. So we will write/read this set of paths once during part serialization/deserialization. But we need to take in mind that during merges we cannot ensure that all resulting columns will have the same set of paths separated to subcolumns (at least without changing code of merges). So, here we have 2 scenarious:

    1.1. Allow resulting columns to have different sets of paths, for shared set in part we will use the set from the first column and will dynamicly expand it when new columns with new sets are serialized. Advantages: no need to change the code of merges. Disadvantages: need to expand global set of paths and possibly rewrite new columns to match this global set of paths, during serialization new substreams can appear and we need to support it somehow.

    1.2. Change the code of merges and implement special logic for choosing empty resulting column using all source columns. It can be done as a new static method in IColumn interface like cloneSmth that will take an array of columns and construct new empty column. In this case we can choose some algorithm how we choose resulting set of paths so it will be always the same for all resulting columns during merge. Advantages: all columns will have the same set of paths during serialization, no need to expand/rewrite smth, also we can use some statistics when we choose resulting set of paths. Disadvantages: need to modify code of merges, but it should not be a big problem.

  2. We can have a separate set of paths separated to subcolumns for each granule, but some granules can share their set of paths with other granules. Advantages: each resulting column can have different set of paths, no need to modify code of merges. Disadvantages: we can store more separated paths than the limit as we don’t control total number of paths in part, during deserialization of granules with different set of paths we will need to rewrite the column, need to upport dynamic substreams during serialization.

The better option is to have shared set of paths for part and modify the code of merges a bit by implementing special choosing of resulting empty column as described in 1.2

How we will choose what paths will be stored separately and what paths will be stored in String column.

We should choose what paths will be stored as subcolumn in 2 cases:

  • During insert from the input data.

Here we can have 2 options:

  1. Select paths by the first-come order. Advantages - simple to implement, no finalization, no any additional logic, no memory overhead. Disadvantages - it's possible that we will choose quite rare and useless paths for subcolumns if they come first in read block.
  2. During insert store all paths as subcolumns in memory and use some finalization to aggregate statistics what paths are better to store separately (for example what paths are more frequent than others). Advantages - we will choose better paths for subcolumns based on statistics. Disadvantages - need additional column finalization after reading, additional logic for statistics, converting rare subcolumns to String column on finalization, memory overhead on large number of paths.

I would prefer 1-st option as second is error-prone and the disadvantages exceed the advantages in my opinion. If there are other options/ideas - feel free to add.

  • During merge of JSON columns

During merge we will have multiple JSON columns from different parts with possible different sets of paths that are stored as subcolumns and the total number of paths can exceed the limit. In this case we should decide what paths should remain as subcolumns and what paths should be serialized into String column. This logic depends on how we are going to store and serialize paths. Let’s say we will use 1.1 option when we have shared set of paths for part and modify the code of merges a bit by implementing special choosing of resulting empty column. How we can choose from N columns with N different sets of paths what will be the result set or pahts:

  • We can just take first limit/N from each set of paths. Similar to first come order logic.
  • We can use local statistics of first columns (the frequency of each path in this column) and take the most frequent paths into the result set.
  • Along with global set of paths in part we can store global statistic (frequency of each path) across the whole part (we can calculate it during serialization and then serialize it in the end together with global set of paths). And during deserialization we can store this statistic inside each column (together with set of paths as it also the same for all deserialized columns). And we can use this global statistic to determine the resulting set of paths during merge (so we will take most frequent paths into the result set).

The third option will be the best and seems like it’s not difficult to implement it.

Querying subcolumns of JSON.

Originally we wanted to use Variant for JSON subcolumns and obtain the type of a subcolumn dynamically during query analysis. But the problem is that we are not able to store the set of all unique paths we have in JSON column (as it can be larger than the number of total rows), so we will store only subset of paths that are separated to subcolumns and the rest paths will be optimized into single String column and stored as a string JSON object. And to dynamically obtain the type for any subcolumn in worst case we will have to do a fullscan on this String column to find if we have this subcolumn or not. To solve this problem I proposed new type called Dynamic (see the proposal above). Let’s describe how it will be implemented and how it will work briefly.

New Dynamic type

This type will represent a single column with dynamic type, so it can store inside values of different types and extend it’s type dynamically. We can think of it as a dynamic Variant type. Initially it will store column with empty type Variant() and during insert or merge it will dynamically extend it. Column will store its current type and column with type Variant inside, and during extension new variants will be added to internal Variant column. During serialization we will serialize it’s typename and internal Variant column, similar to JSON we can have 2 options here: use common global typename for the whole part or serialize this typename per each granule. First will require the same modification in merges as described in JSON, second will require dynamic adding of substreams.

Dynamic column will support subcolumns as type names similar to Variant type: dynamic.Type with type Nullable(Type) and it will read subcolumn .Type from all inner Variants. If none of the Variants have this subcolumn, it will contain just NULL values. The type for such subcolumn can be determined by it’s name, so no need to read any data during query analysis. Dynamic can be querried as is, it will support text serializations so it can be displayed to the user. Also it will support CAST to any type, it will just CAST inner Variant type to the destination type.

Using Dynamic type for JSON subcolumns

We can say that any subcolumn requested from JSON column will have type Dynamic, so we won’t need to obtain it’s real type during query analysis. So if user requested json.a.b.c we will tell that it has type Dynamic so the user can do the next 3 things with this column:

  • CAST it to some type: select json.a.b.c::UInt32 - it will cast all inner Variants into UInt32 and throw an exception if it can’t
  • Select it’s subcolumn as a type: select json.a.b.c.UInt32 - it will read only UInt32 variant from all inner Variants
  • Select it as is: select json.a.b.c - in this case it will be just displayed to user in some text format.

If user will try to use subcolumn json.a.b.c in any other scanario, it will just get an exception that type Dynamic is not supported for this operation.

It may happen that user requested JSON path that doesn’t exists in the table, in this case we will just return Dynamic column filled with NULL values.

But here we have a problem: as we use Dynamic type for any JSON path without checking if it really exists, we cannot distinguish if json.a.b.UInt64 is a path in JSON with key UInt64 or it’s the subcolumn of path json.a.b with type UInt64.

Possible solutions:

  • Use another syntax for requesting type subcolumns from Variant. Examples: variant:::Type, variant.:Type, variant.::Type.
  • Consider all JSON subcolumns requested with . as path and allow to extract type subcolumns using separate function like jsonExtractType(json.a.b, 'UInt64'). We can use optimization “rewrite functions to subcolumns” to optimize it internally to read only required subcolmn. So we can leave syntax with . for Variant (and maybe for Dynamic), but not for JSON (as this problem can happen only with JSON). Or we can use another syntax only for JSON type.
  • As the problem happens only with JSON, leave syntax .Type for Variant and Dynamic, but for JSON use another syntax like json.a.b.:UInt64. So subcolumns separated with . will be treated as path and Type subcolumn can be read only using new syntax. And we will succesfully resolve smth like json.a.b.UInt64.:UInt64 as UInt64 type subcolumn of path json.a.b.UInt64.

I would prefer the 3-d variant as this problem can happen only with JSON, so we should solve it only for JSON.

Important question: if we will implement Dynamic type, should we allow users to use it as a separate type or we will use it only internally for JSON implementation

Handling arrays of JSON objects.

In JSON data some paths can contain an arrays of JSON objects and we should be able to handle it.
Let’s take an example:

“obj” : {“arr” : [[{“a” : 42}], [{“a” : [1, 2, 3]}]]}
“obj” : {“arr” : [{“a” : 24}, {“a” : “String”}]}
“obj” : {“arr” : {“a” : “Some string 1”}}

There are 2 ways we can handle arrays of JSON objects and how we can access nested paths:

  • Treat it as one of the variants for the path.

So in this case we will have 2 paths obj.arr.a with type Variant(String) and obj.arr with type Variant(Array(JSON), Array(Array(JSON)))
So if we want to access value 24 of “a” from the second row we will have to write smth like select obj.arr.Array(JSON).a.UInt64.

Advantages: no need to add extra logic of extracting paths from objects inside arrays, ability to work with arrays of objects and not with separate paths.
Disadvantages: usage of nested JSON columns, so we should be accurate with merges (but I think it should work), also we should control the number of subcolumns in nested JSON columns (we can use lower the limit for nested JSON), we can’t handle array of objects at a top level (”obj” : [{…}, {…}])

  • We can extract all paths inside nested objects in Arrays into separate paths.

In this case we have 2 suboptions:

    • Reflect array level in result path.

So we will have paths: obj.arr.a with type Variant(String) (from 3 row), obj.arr.[].a with type Array(Variant(UInt64, String)) (from row 2), obj.arr.[].[].a with type Array(Array (Variant(UInt64, Array(UInt64)))). Indicator of array level can be different [] used as an example.

Advantages: no need to use nested JSON columns, ability to separately access paths in objects from arrays in different level, we can handle array of objects at a top level
Disadvantages: need separate logic for extracting paths from arrays during insert, we have types Array(Variant(…)) and it’s not clear how to hadle them especially together with Dynamic type, this will require additional logic or usage of smth like Variant(Array(…Variant(…)…)).

    • Don’t reflect array level in result path.

So we will have sigle path obj.arr.a with type Variant(String, Array(Variant(UInt64, String)), Array(Array(Variant(UInt64, Array(UInt64))))).
So to access 24 value from the second row we will have to write smth ugly like select obj.arr.a.Array(Variant(UInt64, String)).UInt64.

Advantages: no need to use nested JSON columns, we will have single path, should be no problem with handling it’s type, we can handle array of objects at a top level
Disadvantages: need separate logic for extracting paths from arrays during insert, large resulting Variant type for path, not convinient to extract types subcolumns as types are large and not trivial.

I am actually not sure which option is the best here. I like the first one because it should be easier to implement and it looks more natural to me, but we need to discuss it.

@alexey-milovidov @CurtizJ @KochetovNicolai please add comments/questions/ideas.

@qusijun
Copy link

qusijun commented Feb 4, 2024

@Avogar hi, What is the current progress of dynamic datatype and new json? Is there an ETA?

@Avogar
Copy link
Member

Avogar commented Feb 4, 2024

I am working on Dynamic type right now. ETA - end of February/start of March. After Dynamic is done, I will start working on JSON, but can't say ETA by now.

UPD: due to other tasks, ETA for Dynamic is moving to the end of March

@orloffv
Copy link
Contributor

orloffv commented Apr 2, 2024

@Avogar any updates here?

@Avogar
Copy link
Member

Avogar commented Apr 2, 2024

The Dynamic implementation is almost ready, but the PR is delayed, because during March we put most of our priority and efforts to fixing bugs/fuzzing issuse/flaky tests, so I didn't have time to finish it. But don't worry, the PR with Dynamic will be soon and I will start working on JSON type after that.

By now I can show some demo of Dynamic type:

select d, dynamicType(d) from format(JSONEachRow, 'd Dynamic', $$
{"d" : [1, 2, 3]}
{"d" : "Hello"}
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
{"d" : null}
$$)
   ┌─d──────────┬─dynamicType(d)──────────┐
1. │ [1,2,3]    │ Array(Int64)            │
2. │ Hello      │ String                  │
3. │ 42         │ Int64                   │
4. │ 2020-01-01 │ Date                    │
5. │ (1,2)      │ Tuple(a Int64, b Int64) │
6. │ ᴺᵁᴸᴸ       │ None                    │
   └────────────┴─────────────────────────┘
create table test (d Dynamic) engine=MergeTree order by tuple();
insert into test select * from format(JSONEachRow, 'd Dynamic', $$
{"d" : [1, 2, 3]}
{"d" : "Hello"}
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
{"d" : null}
$$);
select d, d.Int64, d.`Tuple(a Int64, b Int64)`.a from test;
   ┌─d──────────┬─d.Int64─┬─d.Tuple(a Int64, b Int64).a─┐
1. │ [1,2,3]    │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
2. │ Hello      │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
3. │ 42         │      42 │                        ᴺᵁᴸᴸ │
4. │ 2020-01-01 │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
5. │ (1,2)      │    ᴺᵁᴸᴸ │                           1 │
6. │ ᴺᵁᴸᴸ       │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
   └────────────┴─────────┴─────────────────────────────┘
create table test (map Map(String, Array(Dynamic))) engine=MergeTree order by tuple();
insert into test select map('key1', ['str'::Dynamic, 42::Int64::Dynamic], 'key2', ['2020-01-01'::Date::Dynamic, [1,2,3]::Array(UInt64)::Dynamic]);
insert into test select map('key3', [43::Int64::Dynamic], 'key4', [tuple(1, 2)::Tuple(a Int64, b Int64)::Dynamic, 'str_2'::Dynamic]);
select map.values.`Int64`, map.values.`String`, map.values.`Tuple(a Int64, b Int64)`.a from test;
   ┌─map.values.Int64────────┬─map.values.String──────────┬─map.values.Tuple(a Int64, b Int64).a─┐
1. │ [[NULL,42],[NULL,NULL]] │ [['str',NULL],[NULL,NULL]] │ [[NULL,NULL],[NULL,NULL]]            │
   └─────────────────────────┴────────────────────────────┴──────────────────────────────────────┘
   ┌─map.values.Int64───┬─map.values.String───────┬─map.values.Tuple(a Int64, b Int64).a─┐
2. │ [[43],[NULL,NULL]] │ [[NULL],[NULL,'str_2']] │ [[NULL],[1,NULL]]                    │
   └────────────────────┴─────────────────────────┴──────────────────────────────────────┘

@Avogar
Copy link
Member

Avogar commented May 21, 2024

As Dynamic type is ready and almost merged, let's discuss some implementation details about JSON type.

Handling arrays of JSON objects.

We still didn't discuss how we want to handle arrays of JSON objects. See possible implementations in one of the comments above.

Schema inference

During parsing of JSON object we will need to infer the data type for each path and we need to determine what data types we want to support. Here we I see 2 possible ways:

  1. Support only types Bool, Int64, Float64, String, Array(...) and maybe unnamed Tuple for arrays with different values.
  2. Except for types from above, also try to infer other types from String like Date, DateTime, IPv4, IPv6, UUID.

I would do 1. by default and all other types from 2. under settings that will be enabled by default (very similar to schema inference from input formats, maybe we can even reuse the settings).

Also good to discuss how want to handle arrays of values with different data types, should we use unnamed tuple for it how it's done in JSON data formats or we want to handle it differently (for example, as array of Variant or Dynamic types).

How to store paths and values in String column when max number of separately stored paths is exceeded

One of the main goal of the new implementation is to be able to handle large amount of different paths in JSON, and to implement it we will limit the number of separately stored paths and when this limit is exceeded all other paths will be stored in a single String column. We need to decide in what format we want to store these paths with values in String column. Here I see several possible ways:

Store paths as flattened JSON object

Very simple implementations is to store all paths as a flattened JSON object in string representation. Let's see an example:

Let's have these JSON objects:

{"a" : 1, "b" : {"c" : [1], "d" : "str_1"}, {"e" : {"f" : {"g" : true}}}, "h" : 1.1}
{"a" : 2, "b" : {"c" : [1, 2], "d" : "str_2"}, {"e" : {"f" : {"h" : false}}}, "i" : 2.2}
{"a" : 3, "b" : {"c" : [1, 2, 3], "d" : "str_3"}, {"e" : {"f" : {"i" : true}}}, "k" : 3.3}

And consider we have limit=2 on separetely stored path. Let's see how the storage can look like:

                                    JSON column
           /                             |                             \
(path "a", Int64 column)   (path "b.c", Array(Int64) column)   (other paths, String column)
         1                           [1]                   '{"b.d":"str_1","e.f.g":true,"h":1.1}'
         2                           [1, 2]                '{"b.d":"str_2","e.f.h":false,"i":2.2}' 
         3                           [1, 2, 3]             '{"b.d":"str_3","e.f.i":true,"k":3.3}'

Advantage: parsing and insertion from input format are easy to implement.
Disadvantage: reading separeate path from such string will be not effective (for each row we will need to parse this JSON, find the requested path, infer the type of value, parse the value according to the type and insert into resulting column).

Use some special data structure for fast extracting of separate paths

These ideas are outdated.

To optimize the reading of separate paths from the single String column we can store paths in a special way that can optimize finding and reading of separate paths from it. Here I can suggest sevaral data structures (from the most simple, to complex):

1. Store paths first and values second

<number_of_paths><path_1>...<path_N><index_of_value_1>...<index_of_value_N><value_1>...<value_N>

In this case when we need to read some path, we will do the following:
Read the list of paths and determine if this structure contains our path or not. If doesn't contain, insert default value to destination column and don't read anything else. If contains, read the corresponding index of the value, read the value by the index, infer the data type, parse it and insert into the destination column.

If we take the above example, it can look like this:

<b.d><e.f.g><h><0><7><11><"str_1"><true><1.1>
<b.d><e.f.h><i><0><7><12><"str_2"><false><2.2>
<b.d><e.f.i><k><0><7><11><"str_3><true><3.3>

Paths and values probably can be stored as binary strings (size + data), or as JSON text with , as separator.

Advantages:

  • More or less fast determinition if row contains the path (we will read only list of paths).
  • If row contains requested path, we will read only corresponding value by index.

Disadvantages:

  • Complicates data parsing and insertion - we should create this structure for each row.
  • We will store additional data - indexes of values.

2. Store hashes of paths first, then pairs (path, value)

<number_of_paths><hash_of_path_1>...<hash_of_path_N><index_of_path_1>...<index_of_path_N><path_and_value_1>...<path_and_value_N>

In this case when we need to read some path, we will do the following:
Calculate the hash of requested path in advance. Read the list of hashes, check if it contains our hash or not. If not, insert default value to destination column and don't read anything else. If contains, read corresponding index, read the path, check if path is the same as requested (because we have possibility of collision), if they are not the same, continue reading hashes, if the same, read corresponding value, determine the data type, parse it and insert into the destination column.

If we take the above example, it can look like this:

<hash_of_b.d><hash_of_e.f.g><hash_of_h><0><13><19><"b.d":"str_1"><"e.f.g":true><"h":1.1>
<hash_of_b.d><hash_of_e.f.h><hash_of_i><0><13><20><"b.d":"str_2"><"e.f.h":false><"i":2.2>
<hash_of_b.d><hash_of_e.f.i><hash_of_k><0><13><19><"b.d":"str_3"><"e.f.i":true><"k":3.3>

Advantages:

  • Faster determinition if row contains the path (we will read only list of hashes), it should be faster than reading and comparing paths itself.
  • If row contains requested path, we will read only corresponding value by index.

Disadvantages:

  • Complicates data parsing and insertion - we should create this structure for each row.
  • We will store additional data - hashes of paths and indexes of pairs (path, values).

3. Store sorted hashes and use binary search

We can improve the previous approach and store hashes in sorted order, so we can take this chunk of memory, reinterpret it as UInt64 * and use binary search to find if it contains requested hash or not:

<number_of_paths><first_hash_in_sorted_order>...<N-th_hash_in_sorted_order><index_of_path_corresponding_to_first_hash_in_sorted_order>...<index_of_path_corresponding_to_N-th_hash_in_sorted_order><path_and_value_1>...<path_and_value_N>

If we take the above example, it can look like this:

<hash_of_e.f.g><hash_of_h><hash_of_b.d><13><19><0><"b.d":"str_1"><"e.f.g":true><"h":1.1>
<hash_of_b.d><hash_of_i><hash_of_e.f.h><0><20><13><"b.d":"str_2"><"e.f.h":false><"i":2.2>
<hash_of_e.f.i><hash_of_b.d><hash_of_k><13><0><19><"b.d":"str_3"><"e.f.i":true><"k":3.3>

Advantages:

  • Even faster determinition if row contains the path (binary search over hashes). Don't know if it can be faster than that.
  • If row contains requested path, we will read only corresponding value by index.

Disadvantages:

  • The same as previous approach.

4. Store types of values together with values

Determining the data type of each value during reading of the path can be quite expensive, we can determine the type of the value during parsing and insertion and store the type together with the value:

<number_of_paths><first_hash_in_sorted_order>...<N-th_hash_in_sorted_order><index_of_path_corresponding_to_first_hash_in_sorted_order>...<index_of_path_corresponding_to_N-th_hash_in_sorted_order><path_type_and_value_1>...<path_type_and_value_N>

If we take the above example, it can look like this:

<hash_of_e.f.g><hash_of_h><hash_of_b.d><13><19><0><"b.d","String","str_1"><"e.f.g","Bool",true><"h","Float64",1.1>
<hash_of_b.d><hash_of_i><hash_of_e.f.h><0><20><13><"b.d","String","str_2"><"e.f.h","Bool",false><"i","Float64",2.2>
<hash_of_e.f.i><hash_of_b.d><hash_of_k><13><0><19><"b.d","String","str_3"><"e.f.i","Bool",true><"k","Float64",3.3>

As we know the data types, we can store values not in JSON text format, but in binary format. Also, we can avoid storing data types as type names (it's quite expensive) and instead create a binary encoding for our data types and use it (one byte per simple types, 2 bytes for Array(SimpleType), etc.).

So, the most effective in my opinion will be the next structure:

<number_of_paths><first_hash_in_sorted_order>...<N-th_hash_in_sorted_order><index_of_path_corresponding_to_first_hash_in_sorted_order>...<index_of_path_corresponding_to_N-th_hash_in_sorted_order><path_1><encoded_type_1><binary_value_1>...<path_N><encoded_type_N><binary_value_N>

Advantages:

  • Fast determinition if row contains the path (binary search over hashes).
  • If row contains requested path, we will read only corresponding value by index.
  • Types of the values are known, no need to use schema inference during reading.
  • Values are stored in binary format, parsing is faster and it uses less space.

Disadvantages:

  • Complicates data parsing and insertion - we should create this structure for each row, determine the type for each value, parse it and convert to binary representation.
  • We will store additional data - hashes of paths, indexes, data types.

Maybe it's overcomplicated, but I think reading of a separate path from single String column should be optimized and we can sacrifice the efficiency of data parsing and insertion. Maybe suggested structures doesn't make sense and won't improve anything. So, please, add your comments/opinions about it.

Use Map columnd for paths and values

We can store paths and values in Map column with sorted paths. So in each row we will have an array of paths and an array of corresponding values. Paths will be sorted, so we can use binary search to check if path exist or not in specific row. Values will be stored in binary format together with binary encoded data types. This is how it will be stored in memory and in compact parts in MergeTree.

For wide parts in MergeTree there will be one addition: before each granule we will write the sorted list of all paths in this granule and replace column with paths with column with indexes in this sorted list (so, we will be able to skip the granule by looking at the list of paths and won't iterate over each row to find requested path, it will be effecient especially when we read more than 1 path and can use substreams cache).

@bvt123
Copy link

bvt123 commented May 21, 2024

Why don't count the approach used for the Map type—store path and values in two separate array columns? Probably, we don't need to read the values column in many cases.

@debugmiller
Copy link

If we want to allow for querying of parent paths, hashing may not work. In your example consider a query for path b. It should return values for both b.c and b.d. But if it searches for hash of b it will find nothing and it will not know to search specifically for hash of b.d.

An alternative could be to store the paths in sorted order which would still allow for binary searching.

If we want to use hashes we could also hash each part of the path instead. For example <hash of b>.<hash of d> instead of <hash of b.d>

@Avogar
Copy link
Member

Avogar commented May 22, 2024

If we want to allow for querying of parent paths

I think we won't allow querying of parent paths as subcolumns, because it will lead to fullscan on each subcolumn request. We don't know all the paths in the part in advance, only paths that were separated to subcolumns (and this number will be limited), and it means that we can't say if requested path is a parent path or not, so we will have to check all paths with such prefix both in a set of separated paths and in all other paths stored together in some data structure. So, when user requests json.a.b.c we will return only data from path a.b.c not including data from a.b.c.d and etc.

For reading parent paths we can implement special function.

Or we can think of introducing special syntaxis for reading sub paths as JSON columns. Like json.^a.b that will return column with type JSON and all nested paths without a.b prefix.

@Avogar
Copy link
Member

Avogar commented May 22, 2024

We also think that we will implement 2 data structures for paths stored together - one for in memory representation, one for serialization in granules in MergeTree. The second one can be more complex and collect data for the whole granule, maybe for each granule we will write something like a tree with all paths in this granule. We are discussing it. My simple structure with hashes may be used for in memory representation, or we will come up with something better

@abraithwaite
Copy link

Thanks for sharing this RFC! It sounds like you're working towards the right use cases.

For us, we want to use this feature for querying semistructured JSON logs. As a result, the implementation detail that's most important to us is to be able to efficiently filter rows at query time, so storing the type with the type-specific binary row format is highly useful to our use case.

I can't speak for others, but querying parents of subcolumns is not a high priority for us.

The way we plan to use this feature is to have both a JSON column that infers the data types and can be used for fast filtering, while still storing the raw data in a separate column as a string, so if we want to query a substructure instead of a leaf of the JSON object, we can fallback to that raw JSON string column.

Thanks again for sharing the direction and thoughts on this! Very excited for this feature! 🎉

@debugmiller
Copy link

So, when user requests json.a.b.c we will return only data from path a.b.c not including data from a.b.c.d and etc.

It would be nice to not have special syntax to select objects. For example if I inserted the following data:

{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}

It would feel weird to me if SELECT json.d returned

42
"2020-01-01"
NULL

I would only know I am missing data if I instead query for SELECT json.^d. And unless I can guarantee through some other means that my data does not have nested columns I would always have to query that way.

@debugmiller
Copy link

debugmiller commented May 23, 2024

The syntax is less important, what I am trying to say is that I think the case of querying for a parent and also retrieving the children should not be ignored when considering the data structure.

@Avogar
Copy link
Member

Avogar commented May 23, 2024

It would feel weird to me if SELECT json.d returned

It may feel weird, yes, but it will work like this. The whole point of the new JSON implementation is to be able to query separate paths fast enough without reading most of the data. Reading all children paths for requested path will require fullscan of all the data in worst case, even if it doesn't have any children (becase we won't know all paths stored in the part in advance, only paths that are stored as separate subcolumns in separate streams).

In your example:

{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}

We will store it as 3 subcolumns (some of them may be stored as separate subcolumns with individual stream for fast reading, some may be stored together with some other paths in a single stream in some data structure)

"d" : [42, "2020-01-01", NULL]
"d.a" : [NULL, NULL, 1]
"d.b" : [NULL, NULL, 2] 

And in this approach, reading "d" path as [42, "2020-01-01", NULL] will be quite fast, and that's the point. Maybe later we will introduce a new mode or syntax for reading of subcolumns that will also check for all childrens, but in this mode reading will be not so efficient, so it won't the default behaviour for sure

@debugmiller
Copy link

For my use case this will be less helpful but I can understand that it is a development trade off you are making and this is all experimental anyways :)

@debugmiller
Copy link

Reading all children paths for requested path will require fullscan of all the data in worst case

If you store the paths sorted, you only need to binary search them which of course is worst case log n.

@ahmed-adly-khalil
Copy link

It would feel weird to me if SELECT json.d returned

It may feel weird, yes, but it will work like this. The whole point of the new JSON implementation is to be able to query separate paths fast enough without reading most of the data. Reading all children paths for requested path will require fullscan of all the data in worst case, even if it doesn't have any children (becase we won't know all paths stored in the part in advance, only paths that are stored as separate subcolumns in separate streams).

In your example:

{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}

We will store it as 3 subcolumns (some of them may be stored as separate subcolumns with individual stream for fast reading, some may be stored together with some other paths in a single stream in some data structure)

"d" : [42, "2020-01-01", NULL]
"d.a" : [NULL, NULL, 1]
"d.b" : [NULL, NULL, 2] 

And in this approach, reading "d" path as [42, "2020-01-01", NULL] will be quite fast, and that's the point. Maybe later we will introduce a new mode or syntax for reading of subcolumns that will also check for all childrens, but in this mode reading will be not so efficient, so it won't the default behaviour for sure

@Avogar Thanks for the detailed explanation, is it possible to share some ETA around the new JSON type? thanks!

@Avogar
Copy link
Member

Avogar commented Jun 10, 2024

@ahmed-adly-khalil Hi!

is it possible to share some ETA around the new JSON type?

Right now ETA is the first half of July, so most likely new JSON type will be available in 24.7 version.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Jun 16, 2024

How to store paths and values in String column when max number of separately stored paths is exceeded

As I understood, we need to be able to quickly check if the path exists.
For this purpose, we can use Map(String, ...) where the key is path.
It could be sorted.

As a value, we can use String with the serialized content.

@Avogar
Copy link
Member

Avogar commented Jun 16, 2024

For this purpose, we can use Map(String, ...) where the key is path.
It could be sorted.

Yes, I already understood that my suggestion is overcomplicated and Map with sorted paths and binary serialized values will be enough. But with one addition: for serialization in wide parts before each granule I will write the sorted list of all paths in this granule and replace column with paths with column with indexes in this sorted list (so, we will be able to skip the granule by looking at the list of paths and won't iterate over each row to find requested path, it will be effecient especially when we read more than 1 path and can use substreams cache).

And I am already working on it

@wilberding
Copy link

@Avogar I saw your timeline for JSON Object is first half of July. Do we have an updated timeline?

@Avogar
Copy link
Member

Avogar commented Jun 25, 2024

The timeline didn't change. I plan to create a PR in around 2 weeks

@melvynator
Copy link
Member

melvynator commented Jul 3, 2024

There is currently an alias JSON for Object('JSON); Given the new datatype name will also be JSON, we might need to put it behind a compatibility setting.

cc: @Avogar

@Avogar
Copy link
Member

Avogar commented Jul 31, 2024

The feature is postponed to 24.8. The review process takes longer than usual because we are making a lot of efforts to stabilize the CI and fix existing bugs right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.