Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: enabling grammar typing for sequences via deserialize #578

Open
Seamooo opened this issue Mar 15, 2023 · 4 comments
Open
Labels
enhancement serde Issues related to mapping from Rust types to XML

Comments

@Seamooo
Copy link

Seamooo commented Mar 15, 2023

This issue is to gauge interest in being able to validate the variadic layout of a sequence via tuple elements that are Option

I have a use case to generate schema definitions with sequences defined by a grammar. Given that it's already possible to enforce order, and serializing has builtin flattening of options and sequences, there are two key features for deserializing missing that would fully enable building types to validate a sequence grammar.

Deserializing end of sequence for a tuple should try to deserialize as visiting None for each remaining element of the tuple sequence.

Deserializing a sequence of sequences (of sequences...*) should recursively flatten.

Both these features are already present implicitly in the serializer, this would just be making the deserializer match.

This would allow for examples like below to work

use quick_xml::{de::from_str, se::to_string};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, PartialEq, Eq, Debug)]
enum ABChoice {
    A,
    B,
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Debug)]
enum CDChoice {
    C,
    D,
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Debug)]
struct Root {
    #[serde(rename = "$value")]
    content: (Vec<ABChoice>, Vec<CDChoice>, Option<Vec<ABChoice>>),
}
let root = Root {
    content: (
        vec![ABChoice::A, ABChoice::B],
        vec![CDChoice::C],
        Some(vec![ABChoice::A]),
    ),
};
let serialized = to_string(&root).unwrap();
dbg!(&serialized);
let deserialized = from_str(&serialized).unwrap();
assert_eq!(root, deserialized);

Currently serializing this produces the string

<Root><A/><B/><C/><A/></Root>

Mirroring this in the deserializer is desirable.

I've managed to get a hacky implementation of this to work sending all cases that would stop iteration to first try and deserialize as the DeserializeSeed with a deserializer that can only call visit_none, however this would lead to the case where a vector of Options produces an infinite loop so there's still some design work to be done there.

With regards to recursive sequence flattening, I suspect the offender is <SeqItemDeserializer as de::Deserializer>::deserialize_seq, however just making this recursive seems to have other potential ramifications associated. As such, design input would be appreciated there as well.

I'm very happy to do the work myself. However, I would like to consult, firstly, if this would be something that can be upstreamed, and secondly if there are any pitfalls in this area I should be aware of from a design standpoint, particularly that would prevent it getting upstreamed.

@Mingun Mingun added enhancement serde Issues related to mapping from Rust types to XML labels Mar 15, 2023
@Mingun
Copy link
Collaborator

Mingun commented Mar 15, 2023

I'm always open in increasing consistency of serializer and deserializer. I suspect, however, that theoretical constructions are very hard to understand, so the best way is to check your theories by trying to implement your suggestions. I must warn that it will not always be possible.

quick-xml already has a large set of cases (which I was able to imagine) in serde-de and serde-se tests. If you think, that you could improve some of them and give a rationale why we should do that, go ahead. I recommend also imagine how the same structs would be serialized in JSON and how the result XML would look if you would convert from serialized JSON to XML. The other important thing is to keep composability: usually the result shouldn't depend on the wrapping structs. I other words: serialization result of a wrapper around type should be equal to wrapping of serialization result of a type, as you could see in this example:

struct X;
struct Wrap {
  element: X,
}
<wrap>
  <!-- serialized X (with name "element") inserted into <wrap> element -->
  <element/>
</wrap>

<!-- serialized Wrap (with name "wrap") -->
<wrap>
  <element/>
</wrap>

@Seamooo
Copy link
Author

Seamooo commented Mar 15, 2023

The other important thing is to keep composability: usually the result shouldn't depend on the wrapping structs. I other words: serialization result of a wrapper around type should be equal to wrapping of serialization result of a type, as you could see in this example

The parts of the deserializer I'm looking to touch are specific to tags on the same level in a sequence. Composability should still be done in the same way via the non-sequence deserializer associated functions, but for untagged collections I'm of the opinion that they should be deserialized from the same level, as is required by the implementation, and done by the serializer. The plan is to extend this behaviour to be a recursive flattening for untagged sequences.

I suspect, however, that theoretical constructions are very hard to understand, so the best way is to check your theories by trying to implement your suggestions.

Absolutely agreed. It's far easier to talk to an example implementation rather than in theory what it could do. Mainly what I'm looking for from this issue are cases of undesirable behaviour in this area as I'd noticed places like:

quick-xml/src/de/map.rs

Lines 783 to 785 in 642de0a

// This is a sequence element. We cannot treat it as another flatten
// sequence if type will require `deserialize_seq` We instead forward
// it to `xs:simpleType` implementation
Acknowledging that there is a reason that it's done this way, and it is intentional, so getting the background on that intention would be very helpful.

I don't expect to be able to exhaustively cover much until a PR is submitted, but any general criteria for expected behaviour / behaviour that would yield a rejection would be very useful.

An example reusing the preamble in the original issue description

#[derive(Serialize, Deserialize, PartialEq, Eq, Debug)]
struct Root2 {
    #[serde(rename = "$value")]
    content: (Vec<ABChoice>, Vec<ABChoice>)
}
let root2 = Root2 {
    content: (
        vec![ABChoice::A],
        vec![ABChoice::B, ABChoice::A],
    ),
};
let serialized2 = to_string(&root).unwrap();
let deserialized2 = from_str(&serialized).unwrap();

Could deserialize to

Root2 {
    content: (
        vec![ABChoice::A, ABChoice::B],
        vec![ABChoice::A],
    ),
}

If there's been a previous design decision on ambiguities such as this, it would be helpful to have some context for it.

@Mingun
Copy link
Collaborator

Mingun commented Mar 16, 2023

Acknowledging that there is a reason that it's done this way, and it is intentional, so getting the background on that intention would be very helpful.

The reason is very simple: in XML you cannot nest one sequence into another directly except the case when the second sequence is an xs:list type. Nesting 3 and more lists are not possible at all. This is because there are not dedicated syntax for sequences in XML, like [] in JSON. We try to be as more explicit as possible to prevent non-obvious mapping.

Could deserialize to

I would say, that if that worked that way, the only possible output would

Root2 {
    content: (
        vec![ABChoice::A, ABChoice::B, ABChoice::A],
        vec![], // always empty
    ),
}

As for your initial example (side note: the playground with quick-xml), the deserialization error is due to #567. If worked, I think, is should be deserialized to (when overlapped-lists feature is active)

Root {
    content: (
        vec![ABChoice::A, ABChoice::B, ABChoice::A],
        vec![CDChoice::C],
        None, // not sure, maybe will be Some(vec![])
    ),
};

This is fundamental restriction of XML -- not all roundtrips are possible.

So probably your desired behavior is already implemented, but with bugs 😃 I did some preliminary investigation of that bug, I think that the problem in passing the root deserializer in the line 2723 here:

quick-xml/src/de/mod.rs

Lines 2715 to 2725 in 642de0a

fn next_element_seed<T>(&mut self, seed: T) -> Result<Option<T::Value>, Self::Error>
where
T: DeserializeSeed<'de>,
{
match self.peek()? {
DeEvent::Eof => Ok(None),
// Start(tag), End(tag), Text
_ => seed.deserialize(&mut **self).map(Some),
}
}

@Seamooo
Copy link
Author

Seamooo commented Mar 17, 2023

The reason is very simple: in XML you cannot nest one sequence into another directly except the case when the second sequence is an xs:list type. Nesting 3 and more lists are not possible at all. This is because there are not dedicated syntax for sequences in XML, like [] in JSON. We try to be as more explicit as possible to prevent non-obvious mapping.

I think this is effectively the crux of what I wanted to discuss by opening this issue. The feature request is a proposal to standardise how this could be represented, drawing from how it is represented by the serializer as both reference and inspiration.

XML, without a tag for the collection, can only represent such a collection as a single sequence, as it has no reserved marker for marking collection start and end points. As such, serializing such a collection with arbitrary, nested, untagged collections will likely be inherently lossy. However, there is a way to be potentially more robust about this in the form of using untagged collections as a form of grammar specification for the sequence, where tuples indicate required order, options indicate an optional match for the grammar, and vectors represent non-zero length, unbounded, variable length sequences of the nested type.

Potentially for the above vectors can be read as anything that implements FromIterator / IntoIterator, but this may break determinism for the serialized xml string.

A further addition to the robustness of specifying these sequences by grammar, could be to add a proc_macro derive as an opt-in feature to check the robustness of the grammar such that it would create a non-public const that runs a compile-time check for grammar ambiguities, but I believe that to be well beyond the scope of this issue. I would envision that this would effectively solve the problem of the deserializer not always being able to mirror the serializer, if this check passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

No branches or pull requests

2 participants