Skip to content

feat: support binary arguments for StringConcat operator#21883

Open
theirix wants to merge 5 commits intoapache:mainfrom
theirix:binary-concat-fixes
Open

feat: support binary arguments for StringConcat operator#21883
theirix wants to merge 5 commits intoapache:mainfrom
theirix:binary-concat-fixes

Conversation

@theirix
Copy link
Copy Markdown
Contributor

@theirix theirix commented Apr 27, 2026

Which issue does this PR close?

Rationale for this change

Binary arguments are supported for concat UDFs, but not for the pipe operator (||), which supports only text.

What changes are included in this PR?

  • Support binary concat by providing specialised kernels for pure binary operations. Avoid support of mixed string/binary arguments as it doesn't match the behaviour of major DBs, except for Postgres (see the table in the linked ticket).
  • Add concat_elements_binary_view_array kernel
  • Refactor private binary_coercion to support symmetric BinaryLike + BinaryLike - required for the new codeflow

Concat UDFs are out of scope and supported separately.

Are these changes tested?

  • Existing SLTs
  • Moved a few tests to a more appropriate binary.slt
  • Added new unit tests

Are there any user-facing changes?

Concatenation || operator now allows binary+binary concatenation (SELECT x'636166c3a9' || x'68656c6c6f'), but denies mixed string+binary concatenation SELECT x'636166c3a9' || 'hello'

Binary arguments are supported for concat UDFs,
but not for the pipe operator. Support them
by providing specialized kernels for pure binary
operations. Avoid support of mixed string/binary arguments
as it doesn't match behaviour of major DBs, except of Postgres.
@github-actions github-actions Bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels Apr 27, 2026
@theirix theirix marked this pull request as ready for review April 27, 2026 21:26
/// 1. At least one side of lhs and rhs should be string type (Utf8 / LargeUtf8)
/// 2. Data type of the other side should be able to cast to string type
/// 3. Binary and string types cannot be mixed
fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, Binary || Utf8 returned a Utf8 string, and Binary || Binary also returned Utf8. Now the first errors out, and the second returns Binary, wouldn't this be a breaking change for users who are hitting this in a query?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth noting, as we decided to ban mixing according to this analysis.

I'll add api change label and put an upgrade notice shortly.

I intend to put a separate PR to harmonise concat UDF behaviour (it allows mixing) with the pipe operator's behaviour to avoid confusion

///
/// # Errors
/// - Returns an error if the input arrays have different lengths.
/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudl we move it to Panic section? also the kernel below concat_elements_binary_view_array returns an error on overflow while this one panics. Same overflow class, different behavior. Either pick one and apply it to both, or add a comment explaining why they differ.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went away

Comment on lines +944 to +958
for dt in [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View] {
assert!(
BinaryTypeCoercer::new(&DataType::Binary, &Operator::StringConcat, &dt,)
.get_input_types()
.is_err(),
"{}",
dt
);
assert!(
BinaryTypeCoercer::new(&dt, &Operator::StringConcat, &DataType::Binary,)
.get_input_types()
.is_err(),
"{}",
dt
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only loops over string types paired with DataType::Binary. The coercion rule actually covers 9 combinations (Binary/LargeBinary/BinaryView × Utf8/LargeUtf8/Utf8View). Could you extend the loop to cover all of them? Something like a nested loop over both lists would do it. Right now LargeBinary || Utf8 and BinaryView || LargeUtf8 are not directly tested. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. By the way, it tests only coercion, the actual operation concat, as Jeffrey spotted for LargeBinary, is done via slt

buffer.clear();
buffer.extend_from_slice(l);
buffer.extend_from_slice(r);
// No try-version of append_value because it panics on overflow
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could improve the comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went away


# Mixed binary and text provide actual UTF-8 sequence, not '\xc3a9' as in PostgreSQL
query T
SELECT concat(x'c3a9', 'hello');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat(x'c3a9', 'hello') works and returns éhello, whereas x'c3a9' || 'hello' would errors, should we track this in an issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat is more capable as in #20787. As per the other comment, concat could be harmonised.

Comment thread datafusion/sqllogictest/test_files/spark/string/concat.slt Outdated
----
636166c3a968656c6c6f

# Byte pipe operator is forbidden for mixed binary and text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about mixed binary?

1. query failed: DataFusion error: Error during planning: Cannot infer common string type for string concat operation Binary || LargeBinary
[SQL] SELECT x'636166c3a9' || arrow_cast(x'68656c6c6f', 'LargeBinary');
at /Users/jeffrey/Code/datafusion/datafusion/sqllogictest/test_files/binary.slt:331

Is this intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I slipped it, thanks!. Added unit and SLT tests

/// # Errors
/// - Returns an error if the input arrays have different lengths.
/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.
pub fn concat_elements_binary_array<T: OffsetSizeTrait>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can follow what we do for strings and just use concat_element_binary

DataType::Utf8 => Arc::new(concat_elements_utf8(
left.as_string::<i32>(),
right.as_string::<i32>(),
)?),
DataType::LargeUtf8 => Arc::new(concat_elements_utf8(
left.as_string::<i64>(),
right.as_string::<i64>(),
)?),

https://docs.rs/arrow/latest/arrow/compute/kernels/concat_elements/fn.concat_element_binary.html

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I am reinventing the wheel. Changed to these implementations, leaving only view concats. Is there any reason we don't have StringViewArray/BinaryViewArray/GenericByteViewArray in arrow-string? PR if so?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason I think, just haven't added support for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. If you don't mind, I will add it

SELECT split_part(CAST(binary AS VARCHAR), 'o', 2) FROM t WHERE binary = X'466f6f';
----
(empty)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test with fixedsizebinary too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary string (BYTEA, Binary) concatenation

3 participants