Refactoring serde_arrow - Christopher Prohm

Currently, I am in the midst of a major refactoring effort for serde_arrow. Prompted by this comment by v1gnesh on the related PR, I would like to summarize the ideas behind the changes and outline future plans.

Although the public API of serde_arrow has remained quite stable over recent releases, the underlying implementation has been changing rapidly: First, serde_arrow has begun using the Serde API directly without intermediate representations. Second, serde_arrow has introduced a common array abstraction to separate the use of the arrow API from the core logic. In the following, I will discuss into these changes and the reasoning behind them.

Initially, serde_arrow interpreted Rust objects and Arrow arrays in terms of a stream of events, modelled after JSON. For example, consider the following Rust struct:

#[derive(Serialize, Deserialize)]
struct Example {
    float: f32,
    int: i64,
}

A sequence of instances of this struct was translated into a series of events:

[
    Event::StartSequence,
    Event::StartStruct,
    Event::Str("float"),
    Event::F32(1.0),
    Event::Str("int"),
    Event::I64(2),
    Event::EndStruct,
    // ..
    Event::EndSequence,
]

When serializing Rust objects into Arrow arrays, serde_arrow generated this event sequence and then interpreted them to construct the corresponding arrays. This approach, inspired by Serde’s docs on writing data formats, used JSON as a model for the intermediate representation.

Following this comment by Ten0 on GitHub, I investigated whether maintaining such an intermediate representation is truly beneficial. It turns out it both complicated the code and decreased performance by about 2x. Starting with version 0.11, serde_arrow now uses the Serializer and Deserializer traits of Serde directly, with calls being forwarded to the relevant array builders. Notably, nested structures are implemented by using child array builders as serializers for their corresponding child fields.

The second significant change, soon to be released, introduces a common array abstraction. This change aims to decouple serde_arrow's array builders from the construction of Arrow arrays.

In version 0.11, builders are directly converted into Arrow array data, for example:

impl Utf8Builder {
    fn into_arrow(self) -> Result<arrow::array::ArrayData> {
        // ..
    }
}

Starting with version 0.12, serde_arrow introduces an internal Array struct that is constructed first before converting it into Arrow array data:

impl Utf8Builder {
    fn into_array(self) -> Result<serde_arrow::internal::Array> {
        // ..
    }
}

impl serde_arrow::internal::Array {
    fn into_arrow(self) -> Result<arrow::array::ArrayData> {
        // ..
    }
}

This additional indirection separates the array-building process into two phases: first, filling the necessary buffers, and then constructing the array objects from these buffers.

This change aims to achieve two objectives:

Support additional methods for buffer construction from Rust objects
Support multiple versions of an Arrow crate simultaneously

In particular, supporting different Arrow versions simultaneously allows to address a long-standing issue with the serde_arrow API: the crate features are non-additive. Currently, it’s not possible to use the crate with both arrow=50 and arrow=51 simultaneously. This behavior deviates from Rust’s guidelines for crate features but was previously preferrable due to the large amount of code required to support different Arrow versions. The new approach reduces the conversion code in essence to copying buffers to the relevant structs of the Arrow crate.

While the 0.12 release will introduce this new array abstraction and refine the overall API, additive features will not be included until the 0.13 release. If the lack of additive features is a concern for you today, please feel free to guide the design by commenting on the design issue.