Skip to content

fix(arrow/array): Parity with Arrow C++ when reading/writing null to/from JSON#833

Open
serramatutu wants to merge 5 commits into
apache:mainfrom
serramatutu:serramatutu/JSON-nulls-v2
Open

fix(arrow/array): Parity with Arrow C++ when reading/writing null to/from JSON#833
serramatutu wants to merge 5 commits into
apache:mainfrom
serramatutu:serramatutu/JSON-nulls-v2

Conversation

@serramatutu

@serramatutu serramatutu commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

I raised an issue at the Arrow Community meeting regarding how Arrow Go allows the user to construct invalid record batches by reading JSON with nulls in it even if the schema says it's not null. It is also possible to construct invalid batches by calling AppendNull() in the builders, and currently there is no way to validate that. I was instructed to look at how Arrow C++/PyArrow do it, and replicate it here.

TL;DR:

  1. Arrow C++ raises errors when reading JSON data that does not conform to the schema
  2. Arrow C++ never JSON-encodes values as null if the parent field is non-nullable

Here's a link to my full investigation, including the code so you can run it yourself: https://github.com/serramatutu/arrow-internal-nulls

What changes are included in this PR?

You can review commit by commit.

To address Issue 1:

  • Make RecordBuilder.UnmarshalOne() return an error if a null is present for a non-nullable field, or if the value is not given.
  • Same as above for StructBuilder.UnmarshalOne(), so that it works recursively.
  • RecordBuilder.UnmarshalOne() now validates first and appends later, so that even if it needs to raise an error, the user can keep trying to read new records without compromising the state of the reader. Before it could be the case that it would only partially append a row, leaving the builder in an undefined state before reading the next row. I didn't do the validate then append change for StructBuilder since it relies on field.UnmarshalOne(), and there is no way of unmarshaling without appending. This is a TODO for improvement.

To address Issue 2:

  • Added a check for dtype.Field(i).Nullable in the Struct and Record serializers to encode things as nil if arr.IsNull(i).
  • PROBLEM: most GetOneForMarshal() implementations do a if a.IsNull(I { return nil } check, so this is somewhat pointless. I could change GetOneForMarshal() to mean "get one for marshal assuming it's non-nullable", but it changes one of the invariants of using this public method.

I had to change a bunch of tests that implicitly depended on fields being nullable, even if they didn't declare the fields as such.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, this makes the JSON decoder less lenient and validation stricter. So this could be a breaking change for users who are relying on the leniency of the JSON decoder.

@serramatutu serramatutu requested a review from zeroshade as a code owner June 1, 2026 12:13
@serramatutu serramatutu changed the title Parity with Arrow C++ when reading/writing null to/from JSON fix(arrow/array): Parity with Arrow C++ when reading/writing null to/from JSON Jun 1, 2026
@zeroshade

Copy link
Copy Markdown
Member

Is it possible for us to do this without a breaking change? I'd rather avoid the breaking change of changing a public API method if at all possible.

This commit introduces 2 changes to `RecordBuilder.UnmarshalOne()`:
- Now it checks for integrity first and only appends to internal
  builders last. This makes it possible to get an error when
  deserializing one JSON row without corrupting the state of the
  builder. I.e it is possible to get an error for a specific row and
  keep building the batch without that row.
- When dealing with required fields, it checks for explicit `null`s
  in the input, and for missing required fields. This is in line with
  Arrow C++.
@serramatutu serramatutu force-pushed the serramatutu/JSON-nulls-v2 branch 2 times, most recently from aaacba9 to b8f92fa Compare June 4, 2026 05:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

arrow/array/record.go:424

  • The UnmarshalOne docstring says it “receives an already-configured json.Decoder, so options such as UseNumber set by the caller are honored”, but the implementation now buffers values as RawMessage and re-decodes them with a fresh decoder that always calls UseNumber. This no longer honors caller decoder configuration (and in particular forces UseNumber even if the caller didn’t enable it).
// UnmarshalOne reads one row (a JSON object) from the supplied decoder and
// appends a value to each field in the RecordBuilder.
//
// Unlike UnmarshalJSON, this method receives an already-configured
// json.Decoder, so options such as UseNumber set by the caller are honored
// for nested field decoding. This is critical for preserving large integer
// values (>2^53) that cannot be represented exactly as float64.

Comment thread arrow/array/struct.go
Comment on lines 209 to 218
tmp := make(map[string]interface{})
fieldList := a.data.dtype.(*arrow.StructType).Fields()
dtype := a.data.dtype.(*arrow.StructType)
fieldList := dtype.Fields()
for j, d := range a.fields {
tmp[fieldList[j].Name] = d.GetOneForMarshal(i)
if dtype.Field(j).Nullable && a.IsNull(i) {
tmp[fieldList[j].Name] = nil
} else {
tmp[fieldList[j].Name] = d.GetOneForMarshal(i)
}
}
Comment thread arrow/array/struct.go
Comment on lines +489 to 496
if bytes.Equal(next, []byte("null")) && !dtype.Field(idx).Nullable {
return fmt.Errorf("field '%s' is non-nullable but got null", dtype.Field(idx).Name)
}

valDec := json.NewDecoder(bytes.NewReader(next))
valDec.UseNumber()
if err := b.fields[idx].UnmarshalOne(valDec); err != nil {
return err
Comment on lines +525 to +526
err = b.UnmarshalJSON([]byte(`{"f1-i32": 6, "f2-f64-notnull": 6.6, "map": [{"key": "4": "value": "d"}]}`))
assert.NoError(t, err)
Comment on lines +561 to +564
roundtripped, _, err := array.FromJSON(mem, arr.DataType(), bytes.NewReader(jsonStr))
defer roundtripped.Release()
assert.NoError(t, err)
assert.Truef(t, array.Equal(arr, roundtripped), "JSON round trip returns different array: got=%q, want=%d", arr, roundtripped)
Comment on lines +505 to 519
defer func() {
e := recover()
if e == nil && tc.panicErr != nil {
t.Fatalf("did not panic, expected panic: %v", tc.panicErr)
} else if e != nil && tc.panicErr == nil {
t.Fatalf("unexpected panic: %v", e)
} else if e != nil && tc.panicErr != nil && fmt.Errorf("%s", e).Error() != tc.panicErr.Error() {
t.Fatalf("invalid error. got=%v, want=%v", e, tc.panicErr.Error())
}
}()

err := sb.UnmarshalJSON([]byte(tc.jsonInput))
if err != nil {
t.Fatal(err)
panic(err)
}
Comment thread arrow/array/record.go
Comment on lines +481 to +490
// at this point we know there are no integrity errors, append values to field builders
for key, val := range keylist {
valDec := json.NewDecoder(bytes.NewReader(val))
valDec.UseNumber()

indices := b.schema.FieldIndices(key)
if err := b.fields[indices[0]].UnmarshalOne(valDec); err != nil {
return err
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants