Skip to content

[native_doc_dartifier] Public Signature Extraction Can Be Too Large #2396

@marshelino-maged

Description

@marshelino-maged

the extracted public signatures can be very large, sometimes including the entire generated code for a full package. However, when translating a Java snippet, we usually only need a small portion of those public signatures.


Approach 1: Use Retrieval-Augmented Generation (RAG)

Steps:

  1. Split the public AST into smaller units — like methods or classes.
  2. Generate embeddings for each unit and store them in a vector database.
  3. For any Java snippet to be translated:
    • Generate embeddings for that snippet.
    • Search the vector database using a similarity function (e.g., cosine similarity).
    • Return the top N (e.g., 3) most similar classes or methods.

Pros:

  • Can detect similar or renamed functions (e.g., add() will match add$1(), plus(), etc.).
  • Works even if methods/classes are renamed, as long as the names are semantically similar.

Cons:

  • Might fail if the names are changed too much or have no semantic similarity.

Approach 2: Ask LLM to Extract Used Signatures (Hossein idea)

Steps:

  1. Provide the Java snippet to an LLM.
  2. Ask the LLM to return a list of all used Java method/class signatures in the snippet.
  3. Search the public AST to retrieve only those relevant parts.

Ways to extract signatures from the generated code:

  • From comments above each method/field/class:
/// from: `public void <init>(java.lang.String string, java.lang.String string1)`
  • From id variables:
static final _id_new$1 = _class.constructorId(
    r'(Ljava/lang/String;Ljava/lang/String;)V',
);
  • From a structured JSON file (that will be available soon).

Pros:

  • Produces a very compact and precise public context.

Cons:

  • More complex logic is required to extract and map the signatures from the generated code.
  • The LLM should output all the signature correctly, so we can make exact string match

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions