Skip to content

Conversation

@jhrotko
Copy link

@jhrotko jhrotko commented Oct 22, 2025

What's Changed

This PR simplifies extension type writer creation by moving from a factory-based pattern to a type-based pattern. Instead of passing ExtensionTypeWriterFactory instances through multiple API layers, extension types now provide their own writers via a new getNewFieldWriter() method on ArrowType.ExtensionType.

  • Added getNewFieldWriter(ValueVector) abstract method to ArrowType.ExtensionType
  • Removed ExtensionTypeWriterFactory interface and all implementations
  • Removed factory parameters from ComplexCopier, PromotableWriter, and TransferPair APIs
  • Updated UnionWriter to support extension types (previously threw UnsupportedOperationException)
  • Simplified extension type implementations (UuidType, OpaqueType)

The factory pattern didn't scale well. Each new extension type required creating a separate factory class and passing it through multiple API layers. This was especially painful for external developers who had to maintain two classes per extension type and manage factory parameters everywhere.

The new approach follows the same pattern as MinorType, where each type knows how to create its own writer. This reduces boilerplate, simplifies the API, and makes it easier to implement custom extension types outside arrow-java.

Closes #891 .

@github-actions

This comment has been minimized.

@jhrotko jhrotko force-pushed the GH-891 branch 2 times, most recently from 67334a6 to 7eba2c1 Compare October 22, 2025 21:09
@jhrotko jhrotko marked this pull request as ready for review October 22, 2025 21:13
@jhrotko
Copy link
Author

jhrotko commented Oct 23, 2025

Hello, @lidavidm! Could you take a look at this PR? Also, I don't have permissions to change the label

@lidavidm lidavidm added the enhancement PRs that add or improve features. label Oct 23, 2025
@github-actions github-actions bot added this to the 18.4.0 milestone Oct 23, 2025
@jbonofre
Copy link
Member

@jhrotko I will take a look on this one as soon as the CI is green (it should be good very soon).

@jhrotko jhrotko requested a review from laurentgo October 30, 2025 09:39
Copy link
Contributor

@laurentgo laurentgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with arrow vectors to be honest, but I wonder why writers aren't discovered at the same time the extension is being registered as a type? wouldn't that make things simpler from an API/usability perspective?


</#list></#list>

public void copyAsValue(StructWriter writer, ExtensionTypeWriterFactory writerFactory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is okay to remove a public method because there has been no release yet?

Maybe we should discuss it on the mailing list as it seems we haven't found the right pattern yet

@jhrotko
Copy link
Author

jhrotko commented Nov 7, 2025

This PR changes how we handle extension type writers in Arrow Java. Instead of using factories that get passed around everywhere, we now let the ArrowType.ExtensionType itself provide the writer implementation. This makes the API simpler and easier to work with, especially if you're implementing custom extension types outside arrow-java.

Problem

In Arrow's type system, each MinorType (INT, FLOAT, VARCHAR, etc.) has its own writer implementation. Extension types are trickier though, because they all share the same MinorType.EXTENSIONTYPE, but each extension type (UUID, Opaque, custom types) needs its own writer implementation. We needed some way to figure out which writer to use for a given extension type.

The previous implementation (commits 34060eb4, 7a7e4edd, 7fe36d70, 8663ffc6) used an ExtensionTypeWriterFactory pattern:

// Usage in ComplexCopier
writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
writer.writeExtension(value);

In this pattern, each extension type had a separate factory class (like UuidWriterFactory) that was passed around as a parameter for copy methods. The custom extension writers stored these factories and used them to create the appropriate writer.
The TransferPair interface implementations also needed to carry factories, which polluted other ValueVector classes such as IntVectors and so on.

Why the factory pattern wasn't working well

For developers implementing extension types outside of arrow-java, the situation was even more painful. You had to create and manage two separate classes: one for the type itself (MyCustomType extends ExtensionType) and another for the factory (MyCustomWriterFactory implements ExtensionTypeWriterFactory).

The factory pattern had several issues that made it difficult to scale at this point. Specially if you wanted to use Extension Arrow-java types mixed with out of arrow-java extension types which is something that might happen more often in the future.

The API also got cluttered with factory parameters. Methods like ComplexCopier.copy(reader, writer, extensionTypeWriterFactory), writer.addExtensionTypeWriterFactory(factory), and TransferPair.makeTransferPair(target, factory) all needed these extra parameters. This made the API harder to use and understand.

Finally, the factory pattern created tight coupling between the type definition, the writer implementation, the factory that connects them, and all the code that needs to pass factories around. This made it harder to change any one piece without affecting the others.

The new approach: Let types provide their own writers

I added one abstract method to ArrowType.ExtensionType:

public abstract class ExtensionType extends ArrowType {
    // NEW METHOD
    public abstract FieldWriter getNewFieldWriter(ValueVector vector);

   // Other methods...
}
public class UuidType extends ExtensionType {
    @Override
    public FieldWriter getNewFieldWriter(ValueVector vector) {
        return new UuidWriterImpl((UuidVector) vector);
    }
    
    // Other methods...
}

The new approach is simpler because you only need one class per extension type now, not two. The type knows how to create its own writer. This also means the API is cleaner since there are no more factory parameters cluttering everything. For example, ComplexCopier.copy(reader, writer) and writer.writeExtension(value, type) are much more straightforward, and the type provides the writer internally through extensionType.getNewFieldWriter(vector).

This approach is also consistent with how MinorType already works. The existing pattern for MinorType has each enum constant override getNewFieldWriter() to return its specific writer implementation. Extension types now follow the same pattern:

// MinorType enum (existing pattern)
public enum MinorType {
    INT(new Int(...)) {
        @Override
        public FieldWriter getNewFieldWriter(ValueVector vector) {
            return new IntWriterImpl((IntVector) vector);
        }
    },
    // ...
}

// ExtensionType (new pattern - same idea)
public class UuidType extends ExtensionType {
    @Override
    public FieldWriter getNewFieldWriter(ValueVector vector) {
        return new UuidWriterImpl((UuidVector) vector);
    }
}

Finally, there's less coupling overall. Writers don't need to store or manage factories anymore, TransferPair implementations are simpler, and the type information just flows naturally through the ArrowType object.

ComplexCopier got simpler

// OLD: Required factory parameter
case EXTENSIONTYPE:
    if (extensionTypeWriterFactory == null) {
        throw new IllegalArgumentException("Must provide ExtensionTypeWriterFactory");
    }
    writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
    writer.writeExtension(value);
    break;

// NEW: Type provides the writer
case EXTENSIONTYPE:
    if (reader.isSet()) {
        Object value = reader.readObject();
        if (value != null) {
            writer.writeExtension(value, reader.getField().getType());
        }
    }
    break;

@jhrotko
Copy link
Author

jhrotko commented Nov 7, 2025

@lidavidm @xxlaykxx could you also take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change enhancement PRs that add or improve features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ExtensionTypeWriterFactory to TransferPair

4 participants