fix(arrow/array): Parity with Arrow C++ when reading/writing null to/from JSON#833
Open
serramatutu wants to merge 5 commits into
Open
fix(arrow/array): Parity with Arrow C++ when reading/writing null to/from JSON#833serramatutu wants to merge 5 commits into
null to/from JSON#833serramatutu wants to merge 5 commits into
Conversation
null to/from JSONnull to/from JSON
Member
|
Is it possible for us to do this without a breaking change? I'd rather avoid the breaking change of changing a public API method if at all possible. |
This commit introduces 2 changes to `RecordBuilder.UnmarshalOne()`: - Now it checks for integrity first and only appends to internal builders last. This makes it possible to get an error when deserializing one JSON row without corrupting the state of the builder. I.e it is possible to get an error for a specific row and keep building the batch without that row. - When dealing with required fields, it checks for explicit `null`s in the input, and for missing required fields. This is in line with Arrow C++.
aaacba9 to
b8f92fa
Compare
b8f92fa to
d15839d
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
arrow/array/record.go:424
- The UnmarshalOne docstring says it “receives an already-configured json.Decoder, so options such as UseNumber set by the caller are honored”, but the implementation now buffers values as RawMessage and re-decodes them with a fresh decoder that always calls UseNumber. This no longer honors caller decoder configuration (and in particular forces UseNumber even if the caller didn’t enable it).
// UnmarshalOne reads one row (a JSON object) from the supplied decoder and
// appends a value to each field in the RecordBuilder.
//
// Unlike UnmarshalJSON, this method receives an already-configured
// json.Decoder, so options such as UseNumber set by the caller are honored
// for nested field decoding. This is critical for preserving large integer
// values (>2^53) that cannot be represented exactly as float64.
Comment on lines
209
to
218
| tmp := make(map[string]interface{}) | ||
| fieldList := a.data.dtype.(*arrow.StructType).Fields() | ||
| dtype := a.data.dtype.(*arrow.StructType) | ||
| fieldList := dtype.Fields() | ||
| for j, d := range a.fields { | ||
| tmp[fieldList[j].Name] = d.GetOneForMarshal(i) | ||
| if dtype.Field(j).Nullable && a.IsNull(i) { | ||
| tmp[fieldList[j].Name] = nil | ||
| } else { | ||
| tmp[fieldList[j].Name] = d.GetOneForMarshal(i) | ||
| } | ||
| } |
Comment on lines
+489
to
496
| if bytes.Equal(next, []byte("null")) && !dtype.Field(idx).Nullable { | ||
| return fmt.Errorf("field '%s' is non-nullable but got null", dtype.Field(idx).Name) | ||
| } | ||
|
|
||
| valDec := json.NewDecoder(bytes.NewReader(next)) | ||
| valDec.UseNumber() | ||
| if err := b.fields[idx].UnmarshalOne(valDec); err != nil { | ||
| return err |
Comment on lines
+525
to
+526
| err = b.UnmarshalJSON([]byte(`{"f1-i32": 6, "f2-f64-notnull": 6.6, "map": [{"key": "4": "value": "d"}]}`)) | ||
| assert.NoError(t, err) |
Comment on lines
+561
to
+564
| roundtripped, _, err := array.FromJSON(mem, arr.DataType(), bytes.NewReader(jsonStr)) | ||
| defer roundtripped.Release() | ||
| assert.NoError(t, err) | ||
| assert.Truef(t, array.Equal(arr, roundtripped), "JSON round trip returns different array: got=%q, want=%d", arr, roundtripped) |
Comment on lines
+505
to
519
| defer func() { | ||
| e := recover() | ||
| if e == nil && tc.panicErr != nil { | ||
| t.Fatalf("did not panic, expected panic: %v", tc.panicErr) | ||
| } else if e != nil && tc.panicErr == nil { | ||
| t.Fatalf("unexpected panic: %v", e) | ||
| } else if e != nil && tc.panicErr != nil && fmt.Errorf("%s", e).Error() != tc.panicErr.Error() { | ||
| t.Fatalf("invalid error. got=%v, want=%v", e, tc.panicErr.Error()) | ||
| } | ||
| }() | ||
|
|
||
| err := sb.UnmarshalJSON([]byte(tc.jsonInput)) | ||
| if err != nil { | ||
| t.Fatal(err) | ||
| panic(err) | ||
| } |
Comment on lines
+481
to
+490
| // at this point we know there are no integrity errors, append values to field builders | ||
| for key, val := range keylist { | ||
| valDec := json.NewDecoder(bytes.NewReader(val)) | ||
| valDec.UseNumber() | ||
|
|
||
| indices := b.schema.FieldIndices(key) | ||
| if err := b.fields[indices[0]].UnmarshalOne(valDec); err != nil { | ||
| return err | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
I raised an issue at the Arrow Community meeting regarding how Arrow Go allows the user to construct invalid record batches by reading JSON with nulls in it even if the schema says it's not null. It is also possible to construct invalid batches by calling
AppendNull()in the builders, and currently there is no way to validate that. I was instructed to look at how Arrow C++/PyArrow do it, and replicate it here.TL;DR:
nullif the parent field is non-nullableHere's a link to my full investigation, including the code so you can run it yourself: https://github.com/serramatutu/arrow-internal-nulls
What changes are included in this PR?
You can review commit by commit.
To address Issue 1:
RecordBuilder.UnmarshalOne()return an error if anullis present for a non-nullable field, or if the value is not given.StructBuilder.UnmarshalOne(), so that it works recursively.RecordBuilder.UnmarshalOne()now validates first and appends later, so that even if it needs to raise an error, the user can keep trying to read new records without compromising the state of the reader. Before it could be the case that it would only partially append a row, leaving the builder in an undefined state before reading the next row. I didn't do the validate then append change forStructBuildersince it relies onfield.UnmarshalOne(), and there is no way of unmarshaling without appending. This is a TODO for improvement.To address Issue 2:
dtype.Field(i).Nullablein theStructandRecordserializers to encode things asnilifarr.IsNull(i).GetOneForMarshal()implementations do aif a.IsNull(I { return nil }check, so this is somewhat pointless. I could changeGetOneForMarshal()to mean "get one for marshal assuming it's non-nullable", but it changes one of the invariants of using this public method.I had to change a bunch of tests that implicitly depended on fields being nullable, even if they didn't declare the fields as such.
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, this makes the JSON decoder less lenient and validation stricter. So this could be a breaking change for users who are relying on the leniency of the JSON decoder.