-
Notifications
You must be signed in to change notification settings - Fork 75
Open
Description
the extracted public signatures can be very large, sometimes including the entire generated code for a full package. However, when translating a Java snippet, we usually only need a small portion of those public signatures.
Approach 1: Use Retrieval-Augmented Generation (RAG)
Steps:
- Split the public AST into smaller units — like methods or classes.
- Generate embeddings for each unit and store them in a vector database.
- For any Java snippet to be translated:
- Generate embeddings for that snippet.
- Search the vector database using a similarity function (e.g., cosine similarity).
- Return the top N (e.g., 3) most similar classes or methods.
Pros:
- Can detect similar or renamed functions (e.g.,
add()
will matchadd$1()
,plus()
, etc.). - Works even if methods/classes are renamed, as long as the names are semantically similar.
Cons:
- Might fail if the names are changed too much or have no semantic similarity.
Approach 2: Ask LLM to Extract Used Signatures (Hossein idea)
Steps:
- Provide the Java snippet to an LLM.
- Ask the LLM to return a list of all used Java method/class signatures in the snippet.
- Search the public AST to retrieve only those relevant parts.
Ways to extract signatures from the generated code:
- From comments above each method/field/class:
/// from: `public void <init>(java.lang.String string, java.lang.String string1)`
- From id variables:
static final _id_new$1 = _class.constructorId(
r'(Ljava/lang/String;Ljava/lang/String;)V',
);
- From a structured JSON file (that will be available soon).
Pros:
- Produces a very compact and precise public context.
Cons:
- More complex logic is required to extract and map the signatures from the generated code.
- The LLM should output all the signature correctly, so we can make exact string match
Metadata
Metadata
Assignees
Labels
No labels