feat: support binary arguments for StringConcat operator#21883
feat: support binary arguments for StringConcat operator#21883theirix wants to merge 5 commits intoapache:mainfrom
Conversation
Binary arguments are supported for concat UDFs, but not for the pipe operator. Support them by providing specialized kernels for pure binary operations. Avoid support of mixed string/binary arguments as it doesn't match behaviour of major DBs, except of Postgres.
| /// 1. At least one side of lhs and rhs should be string type (Utf8 / LargeUtf8) | ||
| /// 2. Data type of the other side should be able to cast to string type | ||
| /// 3. Binary and string types cannot be mixed | ||
| fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { |
There was a problem hiding this comment.
Before the change, Binary || Utf8 returned a Utf8 string, and Binary || Binary also returned Utf8. Now the first errors out, and the second returns Binary, wouldn't this be a breaking change for users who are hitting this in a query?
There was a problem hiding this comment.
It is worth noting, as we decided to ban mixing according to this analysis.
I'll add api change label and put an upgrade notice shortly.
I intend to put a separate PR to harmonise concat UDF behaviour (it allows mixing) with the pipe operator's behaviour to avoid confusion
| /// | ||
| /// # Errors | ||
| /// - Returns an error if the input arrays have different lengths. | ||
| /// - Panics if any concatenated string exceeds `T::Offset::MAX` in length. |
There was a problem hiding this comment.
shoudl we move it to Panic section? also the kernel below concat_elements_binary_view_array returns an error on overflow while this one panics. Same overflow class, different behavior. Either pick one and apply it to both, or add a comment explaining why they differ.
| for dt in [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View] { | ||
| assert!( | ||
| BinaryTypeCoercer::new(&DataType::Binary, &Operator::StringConcat, &dt,) | ||
| .get_input_types() | ||
| .is_err(), | ||
| "{}", | ||
| dt | ||
| ); | ||
| assert!( | ||
| BinaryTypeCoercer::new(&dt, &Operator::StringConcat, &DataType::Binary,) | ||
| .get_input_types() | ||
| .is_err(), | ||
| "{}", | ||
| dt | ||
| ); |
There was a problem hiding this comment.
it only loops over string types paired with DataType::Binary. The coercion rule actually covers 9 combinations (Binary/LargeBinary/BinaryView × Utf8/LargeUtf8/Utf8View). Could you extend the loop to cover all of them? Something like a nested loop over both lists would do it. Right now LargeBinary || Utf8 and BinaryView || LargeUtf8 are not directly tested. wdyt?
There was a problem hiding this comment.
Agree. By the way, it tests only coercion, the actual operation concat, as Jeffrey spotted for LargeBinary, is done via slt
| buffer.clear(); | ||
| buffer.extend_from_slice(l); | ||
| buffer.extend_from_slice(r); | ||
| // No try-version of append_value because it panics on overflow |
There was a problem hiding this comment.
nit: we could improve the comment
|
|
||
| # Mixed binary and text provide actual UTF-8 sequence, not '\xc3a9' as in PostgreSQL | ||
| query T | ||
| SELECT concat(x'c3a9', 'hello'); |
There was a problem hiding this comment.
concat(x'c3a9', 'hello') works and returns éhello, whereas x'c3a9' || 'hello' would errors, should we track this in an issue?
There was a problem hiding this comment.
concat is more capable as in #20787. As per the other comment, concat could be harmonised.
| ---- | ||
| 636166c3a968656c6c6f | ||
|
|
||
| # Byte pipe operator is forbidden for mixed binary and text |
There was a problem hiding this comment.
What about mixed binary?
1. query failed: DataFusion error: Error during planning: Cannot infer common string type for string concat operation Binary || LargeBinary
[SQL] SELECT x'636166c3a9' || arrow_cast(x'68656c6c6f', 'LargeBinary');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/binary.slt:331Is this intentional?
There was a problem hiding this comment.
No, I slipped it, thanks!. Added unit and SLT tests
| /// # Errors | ||
| /// - Returns an error if the input arrays have different lengths. | ||
| /// - Panics if any concatenated string exceeds `T::Offset::MAX` in length. | ||
| pub fn concat_elements_binary_array<T: OffsetSizeTrait>( |
There was a problem hiding this comment.
We can follow what we do for strings and just use concat_element_binary
datafusion/datafusion/physical-expr/src/expressions/binary.rs
Lines 944 to 951 in 22bb4e6
https://docs.rs/arrow/latest/arrow/compute/kernels/concat_elements/fn.concat_element_binary.html
There was a problem hiding this comment.
Thanks for pointing that out. I am reinventing the wheel. Changed to these implementations, leaving only view concats. Is there any reason we don't have StringViewArray/BinaryViewArray/GenericByteViewArray in arrow-string? PR if so?
There was a problem hiding this comment.
No particular reason I think, just haven't added support for it
There was a problem hiding this comment.
Got it. If you don't mind, I will add it
| SELECT split_part(CAST(binary AS VARCHAR), 'o', 2) FROM t WHERE binary = X'466f6f'; | ||
| ---- | ||
| (empty) | ||
|
|
There was a problem hiding this comment.
Could we add a test with fixedsizebinary too?
Which issue does this PR close?
BYTEA,Binary) concatenation #12709.Rationale for this change
Binary arguments are supported for concat UDFs, but not for the pipe operator (
||), which supports only text.What changes are included in this PR?
concat_elements_binary_view_arraykernelbinary_coercionto support symmetric BinaryLike + BinaryLike - required for the new codeflowConcat UDFs are out of scope and supported separately.
Are these changes tested?
binary.sltAre there any user-facing changes?
Concatenation
||operator now allows binary+binary concatenation (SELECT x'636166c3a9' || x'68656c6c6f'), but denies mixed string+binary concatenationSELECT x'636166c3a9' || 'hello'