Currently, I am in the midst of a major refactoring effort for serde_arrow
.
Prompted by this comment by v1gnesh on the related PR, I would like to
summarize the ideas behind the changes and outline future plans.
Although the public API of serde_arrow
has remained quite stable over recent releases, the
underlying implementation has been changing rapidly: First, serde_arrow
has begun using the
Serde API directly without intermediate representations. Second, serde_arrow
has
introduced a common array abstraction to separate the use of the
arrow
API from the core logic. In the following, I will discuss into these changes and
the reasoning behind them.
Initially, serde_arrow
interpreted Rust objects and Arrow arrays in terms of a stream of events,
modelled after JSON. For example, consider the following Rust struct:
#[derive(Serialize, Deserialize)]
struct Example {
float: f32,
int: i64,
}
A sequence of instances of this struct was translated into a series of events:
[
Event::StartSequence,
Event::StartStruct,
Event::Str("float"),
Event::F32(1.0),
Event::Str("int"),
Event::I64(2),
Event::EndStruct,
// ..
Event::EndSequence,
]
When serializing Rust objects into Arrow arrays, serde_arrow
generated this event sequence and
then interpreted them to construct the corresponding arrays. This approach, inspired by Serde’s
docs on writing data formats, used JSON as a model for the intermediate
representation.
Following this comment by Ten0 on GitHub, I investigated whether maintaining
such an intermediate representation is truly beneficial. It turns out it both complicated the code
and decreased performance by about 2x. Starting with version 0.11
, serde_arrow
now uses the
Serializer and Deserializer traits of Serde directly, with
calls being forwarded to the relevant array builders. Notably, nested structures are implemented by
using child array builders as serializers for their corresponding child fields.
The second significant change, soon to be released, introduces a common array
abstraction. This change aims to decouple serde_arrow
's array builders from the
construction of Arrow arrays.
In version 0.11
, builders are directly converted into Arrow array data, for
example:
impl Utf8Builder {
fn into_arrow(self) -> Result<arrow::array::ArrayData> {
// ..
}
}
Starting with version 0.12
, serde_arrow
introduces an internal Array
struct that is
constructed first before converting it into Arrow array data:
impl Utf8Builder {
fn into_array(self) -> Result<serde_arrow::internal::Array> {
// ..
}
}
impl serde_arrow::internal::Array {
fn into_arrow(self) -> Result<arrow::array::ArrayData> {
// ..
}
}
This additional indirection separates the array-building process into two phases: first, filling the necessary buffers, and then constructing the array objects from these buffers.
This change aims to achieve two objectives:
- Support additional methods for buffer construction from Rust objects
- Support multiple versions of an Arrow crate simultaneously
In particular, supporting different Arrow versions simultaneously allows to address a long-standing
issue with the serde_arrow
API: the crate features are non-additive. Currently,
it’s not possible to use the crate with both arrow=50
and arrow=51
simultaneously. This behavior
deviates from Rust’s guidelines for crate features but was previously
preferrable due to the large amount of code required to support different Arrow versions. The new
approach reduces the conversion code in essence to copying buffers to the relevant structs of the
Arrow crate.
While the 0.12
release will introduce this new array abstraction and refine the overall API,
additive features will not be included until the 0.13
release. If the lack of additive features is
a concern for you today, please feel free to guide the design by commenting on the design
issue.