Skip to content

Conversation

yuvimittal
Copy link
Member

Pull Request description

Adds casting numeric type to string type.

#85

How to test these changes

  • ...

Pull Request checklists

This PR is a:

  • bug-fix
  • new feature
  • maintenance

About this PR:

  • it includes tests.
  • the tests are executed on CI.
  • the tests generate log file(s) (path).
  • pre-commit hooks were executed locally.
  • this PR requires a project documentation update.

Author's checklist:

  • I have reviewed the changes and it contains no misspelling.
  • The code is well commented, especially in the parts that contain more
    complexity.
  • New and old tests passed locally.

Additional information

Reviewer's checklist

Copy and paste this template for your review's note:

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch
- [ ] I managed to test the new changes locally
- [ ] I confirm that the issues mentioned were fixed/resolved .

Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

Good progress adding numeric-to-string casting and centralizing sprintf/format-string handling. A few important issues and suggested improvements:

Correctness and safety

  • Returning pointers to stack-allocated buffers: You alloca a 64-byte array and return its i8* (buf_ptr). If the cast result escapes the current function or is stored for later use, this becomes a dangling pointer. If your STRING_TYPE is a managed/string object (not a raw char*), returning i8* is likely wrong as well. Recommendations:
    • Restrict this path to ASCII_STRING_TYPE (if that truly is char* semantics) and raise for STRING_TYPE, or
    • Convert the buffer into your runtime string object (copy into a heap-allocated string) before returning when target_type is STRING_TYPE, or
    • Allocate in a memory space with sufficient lifetime (e.g., heap via a runtime allocator), or insert lifetime.start/lifetime.end intrinsics if you keep stack buffers and ensure uses are strictly local.
  • Use snprintf instead of sprintf. sprintf is unsafe; if future format strings become dynamic or longer, 64 bytes may overflow. Prefer:
    • int snprintf(char* str, size_t size, const char* fmt, ...);
    • Pass the buffer length explicitly to prevent overflow.
  • Integer width/sign handling:
    • Truncating >32-bit integers to i32 will silently corrupt values. For 64-bit integers, pass i64 and use the correct length modifier.
    • Signedness is ignored. You use sext for <32-bit and "%d" unconditionally. If the source is actually unsigned, negative values will be printed incorrectly. LLVM IntType doesn’t encode signedness, so you’ll need language-level type info to choose between "%d"/"%u" and between sext/zext.
    • Suggested approach:
      • If width <= 32: extend to i32 (sext or zext depending on signedness) and use "%d" or "%u".
      • If width <= 64: cast to i64 and use "%lld" or "%llu".
      • For wider ints (i128), either implement "%lld" splitting or route to a runtime helper.
  • Format string naming via Python hash(): Python’s hash() is randomized per process (PYTHONHASHSEED), so names will be non-deterministic. Use a stable scheme:
    • e.g., use hashlib.sha1(fmt.encode()).hexdigest()[:8] or a module-local monotonic counter, and include the length to avoid collisions: f"__fmt.{len(fmt)}.{digest}".
  • Potential type mismatch for STRING_TYPE: You return i8*, not a string object. Ensure target_type fits the produced value. If STRING_TYPE is a different LLVM type, this is a bug.

API/IR best practices

  • Prefer declaring snprintf over sprintf:
    • size_t depends on target; if you have a SIZE_T_TYPE in your LLVM helpers, use it. Otherwise, derive it from the data layout or use pointer-sized int.
  • Add unnamed_addr = True on the format string globals; it helps the optimizer deduplicate and treat them as address-insensitive constants.
  • Consider adding explicit alignment to the global format strings (e.g., align 1) and optionally to the stack buffer if your target requires it.
  • Use lifetime intrinsics for stack buffers for better optimization:
    • llvm.lifetime.start/end around buf_ptr with size 64.
  • Varargs promotions:
    • Your float path correctly promotes float/half to double. Good.
    • For small ints you manually promote to i32. Good. Fix signedness as noted.

Code structure and style

  • Avoid duplication by factoring helpers:
    • _alloca_cstr_buffer(size: int, name: str) -> (buf, buf_ptr).
    • _get_or_create_format_ptr(fmt: str) -> i8* (return the gep directly), to avoid repeating GEP constants.
  • Magic numbers:
    • Replace 64 with a named constant (e.g., NUM_TO_STR_BUF_SIZE) at class/module level rather than a local INT32_WIDTH constant for width and magic buffer size literals.
  • Use consistent constant construction:
    • Consider helpers for common constants (e.g., const_i32(0)) to avoid repeated ir.Constant calls.
  • Docstrings:
    • _get_or_create_format_global: clarify that it returns the array GlobalVariable (not the i8* pointer) and that callers must GEP to element 0.
    • Add a brief comment in the Cast visitor branch documenting the lifetime of the returned pointer and the constraints (local use only vs converted to owning string).

Portability/behavior

  • Locale: "%f" is locale-dependent; if you need locale-invariant formatting, consider using "%f" under C locale or a dedicated runtime helper.
  • Consider supporting precision control or scientific notation for floats as needed.

Illustrative changes (sketch):

  • Use stable format symbol name:
    • name = f"__fmt.{len(fmt)}.{hashlib.sha1(fmt.encode()).hexdigest()[:8]}"
  • Use snprintf declaration:
    • int snprintf(char*, size_t, const char*, ...);
  • Width/signedness aware formatting:
    • For <=32-bit signed: sext to i32 and "%d"
    • For <=32-bit unsigned: zext to i32 and "%u"
    • For <=64-bit signed: sext/zext to i64 and "%lld"/"%llu"
  • Return proper type for STRING_TYPE by constructing a runtime string object, not a raw i8*.

Tests to add

  • Casting:
    • 8/16/32-bit signed/unsigned extremes to string (including negatives).
    • 64-bit integers near limits.
    • Floats: inf, -inf, NaN, large/small magnitudes.
    • Verify no truncation and correct sign.
  • Lifetime:
    • Ensure no use-after-return if string escapes the block/function; add a test that copies into an owning string.

If you address the width/signedness handling, use snprintf, ensure proper lifetime/ownership, and stabilize the format global naming, this will be robust.


Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

Thanks for the update—this is a useful addition. A few important correctness and robustness issues to address, plus some style and portability improvements.

Major correctness and safety issues

  • Stack-allocated buffer lifetime: You alloca a fixed-size array and return a pointer to it. If the cast result escapes the current function (e.g., returned, stored in a longer-lived structure, or used after the current function returns), this becomes a dangling pointer and undefined behavior. If your language semantics require the string to outlive the current expression, allocate on the heap or in a managed string object (via runtime) and ensure proper ownership.
  • Use of sprintf with fixed buffer: Using sprintf is unsafe and may overflow the 64-byte buffer for large integers/floats. Prefer snprintf to bound writes. Ideally:
    • First call snprintf(NULL, 0, fmt, ...) to get the exact length needed.
    • Allocate len + 1 bytes (via malloc or language runtime allocator).
    • Call snprintf(buf, len + 1, fmt, ...).
      This also avoids choosing a fixed magic buffer size.
  • Integer width and signedness correctness:
    • Always formatting with "%d" and forcing/truncating to i32 can change values when the original is wider than 32 bits, and it loses unsignedness semantics. For 64-bit ints, use "%lld"/"%llu" and pass i64. For smaller-than-int, promote to 32-bit with the correct extension:
      • Signed small ints: sign-extend to i32 and use "%d".
      • Unsigned small ints: zero-extend to i32 and use "%u".
    • llvmlite IntType has no signedness; you likely need to carry signedness from the source AST/type system to choose between sext/zext and "%d"/"%u".
  • Float formatting and precision:
    • Using "%f" can produce very long strings and is not round-trip safe. Consider "%.17g" for doubles (and "%.7g" for floats) to balance precision and length. At minimum, allow a configurable precision.
  • Varargs promotion:
    • You correctly promote float/half to double. For integers, ensure small widths are promoted to int (i32) as you do, but also select the correct format string/type for >32-bit integers as above.

Portability and linkage

  • Prefer snprintf over sprintf to avoid buffer overflow risks. You’ll need a declaration for snprintf:
    • int snprintf(char *str, size_t size, const char *fmt, ...);
    • Choose size_t type per target data layout (32-bit vs 64-bit). If you have a SIZE_T_TYPE in your LLVM wrapper, use that; otherwise derive from target data (pointer size).
  • On Windows/MSVC, symbol differences can matter. Declaring snprintf is generally portable, but be mindful if you later need _snprintf or other platform-specific adjustments.

Global format string creation

  • Deterministic global names: abs(hash(fmt)) is not stable across Python processes and may lead to non-deterministic modules and potential collisions. Use a stable hash:
    • import hashlib; digest = hashlib.sha1(fmt.encode("utf8")).hexdigest()[:12]; name = f"_fmt{digest}"
  • Consider setting unnamed_addr = True to allow the optimizer to merge identical constants:
    • gv.unnamed_addr = True
  • Minor: The docstring says “Create a constant global format string” but method name is _get_or_create... Consider clarifying docstring to reflect the “get or create” behavior.

Alloca placement and optimization

  • It’s usually best practice to place allocas for fixed-size buffers in the function entry block (helps optimizers and avoids dynamic stack pointer movement mid-block). If you have a helper to create allocas at entry, use it. Otherwise, temporarily set the insertion point to the entry block for these allocas and restore it afterward.

Type matching for returned value

  • You append buf_ptr (i8*). Ensure this matches the expected LLVM type of ASCII_STRING_TYPE/STRING_TYPE. If those are different pointer types, insert a bitcast to the exact target type to avoid mismatches.

Code duplication and style

  • Factor common code between integer and float branches:
    • Helper to get i8* pointer to the first element of an alloca’d array (GEP 0,0).
    • Helper to get constant i32 0 to avoid repeating ir.Constant(self._llvm.INT32_TYPE, 0).
    • Single helper that takes (fmt, value, value_ir_type) and handles promotion and call.
  • Replace magic number 64 with a named constant (e.g., DEFAULT_NUM_TO_STR_BUF = 64) if you keep the fixed buffer approach (though snprintf-based sizing is strongly recommended).
  • Consider extending float handling to other types if relevant (e.g., FP128/X86FP80) or explicitly document unsupported types.

Suggested code adjustments (sketch)

  • Stable format globals:
    • Use hashlib to generate deterministic names.
    • gv.unnamed_addr = True
  • snprintf declaration:
    • Use size_t from target layout; set var_arg=True.
  • Safe conversion flow:
    • For ints:
      • Choose format and value width dynamically:
        • width <= 32: promote to i32 (sext/zext based on signedness), fmt "%d"/"%u".
        • width <= 64: zext/sext to i64 (as needed), fmt "%lld"/"%llu".
      • If width > 64: raise or route to BigInt formatting if supported.
    • For floats:
      • Promote to double.
      • Use "%.17g" (or configurable).
    • Size calculation + allocation:
      • len = snprintf(NULL, 0, fmt, value_promoted)
      • buf = malloc(len + 1)
      • snprintf(buf, len + 1, fmt, value_promoted)
      • Push buf as the result

Example changes for _get_or_create_format_global:

  • Use stable hash and unnamed_addr:
    • name = f"_fmt{hashlib.sha1(fmt.encode('utf8')).hexdigest()[:12]}"
    • gv.unnamed_addr = True

Error handling

  • Improve exception messages to include whether integer is unsigned/signed and width (if available), so debugging is easier.

Tests to add

  • Casting various integer widths (8/16/32/64, signed/unsigned) to string; verify correct output and no truncation/sign issues.
  • Float edge cases (NaN, inf, very large/small values); precision expectations.
  • Ensure no buffer overflow with very long numbers if still using fixed buffer (or switch to snprintf-based sizing which eliminates this).
  • Verify lifetime: result remains valid after call sites where the casted string is consumed.

Overall, the direction is good, but please fix the lifetime and safety concerns (sprintf + stack buffer) and handle width/signedness correctly for integers. Once those are addressed, the implementation will be much more robust and portable.


@yuvimittal yuvimittal marked this pull request as draft September 25, 2025 12:06
@yuvimittal yuvimittal force-pushed the feature-castingIntToString branch from 760fdb7 to e98401d Compare September 29, 2025 08:08
@yuvimittal yuvimittal marked this pull request as ready for review September 29, 2025 09:11
Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

No content returned by model.


src/irx/system.py

No content returned by model.


tests/test_cast.py

No content returned by model.


@xmnlab xmnlab marked this pull request as draft September 29, 2025 10:59
@xmnlab xmnlab marked this pull request as ready for review September 29, 2025 10:59
Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

No content returned by model.


src/irx/system.py

  • Broadening message from LiteralUTF8String to AST is too permissive and can admit non-expression nodes (e.g., statements), causing runtime failures when constructing a print call and increasing attack surface if arbitrary nodes are accepted. Restrict to astx.Expr and fail fast with a runtime check. (L.21, L.24)

Suggested change:
def init(self, message: astx.Expr) -> None:
"""Validate and store the expression to be printed."""
if not isinstance(message, astx.Expr):
raise TypeError(f"PrintExpr.message must be an astx.Expr, got {type(message).name}")
self.message = message

  • Align the attribute annotation with the constructor to maintain invariants. (L.21)

Suggested change:

message node to be printed

message: astx.Expr


tests/test_cast.py

No content returned by model.


@xmnlab
Copy link
Contributor

xmnlab commented Sep 29, 2025

@yuvimittal , thanks for working on that, I will review that later today.

in the meantime, this is the output from the ai-reviewer executed locally.
I hope that next time it will work well on the ci, please rebase your branch on top of the upstream main, so we can check the results here as well.

src/irx/builders/llvmliteir.py

  • Critical: The cast-to-string returns a pointer to a stack-allocated 64-byte buffer (alloca) as the string value. If that pointer escapes the expression (e.g., stored in a variable, returned, or captured), it becomes dangling after function return. You likely need an owned/heap-managed string or a runtime helper that allocates and returns an owned buffer. Apply fix in both int and float branches (e.g., replace stack buffer with a heap/RT-allocated buffer) (L.1575, L.1630).

  • Security/Correctness: Using sprintf on a fixed 64-byte buffer risks overflow for large values; replace with snprintf and pass the buffer length, and handle truncation or size reallocation if needed (L.1604, L.1646).

  • Correctness: For integers you always normalize to i32 and print with "%d". This truncates i64 (or wider) and misrepresents unsigned values. Choose format specifier and cast based on bit-width (and signedness if your type system distinguishes it), e.g., i64 -> "%lld" and extend/trunc accordingly; unsigned -> "%u"/"%llu" and zext instead of sext (L.1589-L.1603).

  • Correctness: _get_or_create_format_global uses abs(hash(fmt)) for naming. Python hash is salted per process and can collide across different format strings, causing incorrect reuse. Use a deterministic map from fmt->GlobalVariable or a monotonic counter stored on the builder (L.1478).

  • Potential breaking change: Allocating "string" as "stringascii" silently changes the representation. If STRING_TYPE is not ABI-compatible with ASCII_STRING_TYPE, this will break existing code paths. Consider inserting an explicit conversion or asserting equivalence to avoid subtle bugs (L.1455).


src/irx/system.py

  • Broadening message from LiteralUTF8String to AST allows non-expression nodes (e.g., stmt/Module), which can later crash or generate invalid code. Restrict to expressions and validate at runtime. (L.18, L.21–22)
    def __init__(self, message: astx.Expr) -> None:
        """Initialize the PrintExpr."""
        if not isinstance(message, astx.Expr):
            raise TypeError("message must be an expression node")
        self.message = message
  • This change also introduces a behavioral/security shift: non-literal expressions can now be evaluated with side effects. If this isn’t intended, either keep literals only or add explicit gating/documentation. At minimum, update the attribute annotation to astx.Expr (not astx.AST) to avoid accepting statements. (L.18, L.21)

tests/test_cast.py

LGTM!

@xmnlab
Copy link
Contributor

xmnlab commented Sep 29, 2025

I didn't have time to read the output from the ai reviewer ... so be careful with that .. maybe there are wrong suggestions

@yuvimittal
Copy link
Member Author

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

No content returned by model.

src/irx/system.py

  • Broadening message from LiteralUTF8String to AST is too permissive and can admit non-expression nodes (e.g., statements), causing runtime failures when constructing a print call and increasing attack surface if arbitrary nodes are accepted. Restrict to astx.Expr and fail fast with a runtime check. (L.21, L.24)

Suggested change: def init(self, message: astx.Expr) -> None: """Validate and store the expression to be printed.""" if not isinstance(message, astx.Expr): raise TypeError(f"PrintExpr.message must be an astx.Expr, got {type(message).name}") self.message = message

  • Align the attribute annotation with the constructor to maintain invariants. (L.21)

Suggested change:

message node to be printed

message: astx.Expr

tests/test_cast.py

No content returned by model.

These are the comments i got and i made the changes accordingly! It was a small issue and i changed the annotation of message from astx.AST to astx.Expr

@yuvimittal yuvimittal marked this pull request as draft September 30, 2025 08:16
Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

  • Critical: You’re returning a pointer to a stack-allocated buffer from a numeric-to-string cast. The alloca’d 64-byte array will be invalid after the function returns, leading to UAF if the string escapes (L.1589, L.1631). Allocate on the heap or use a language/runtime string object.

  • Critical: Using sprintf into a fixed 64-byte stack buffer can overflow for large integers/floats, causing memory corruption (L.1586, L.1628). Use snprintf and size-based allocation.

  • Correctness: 64-bit integers are truncated to 32-bit before formatting with "%d", producing wrong results (L.1577-L.1584). Choose width-appropriate formats and argument types (e.g., "%lld" with i64).

  • Correctness: Format-global name uses abs(hash(fmt)) which is non-deterministic across processes and can collide, returning the wrong format string if two different formats hash to the same value (L.1473). Use a deterministic collision-resistant scheme.

  • Potential type mismatch: You store an initializer typed as “string” into an alloca typed as “stringascii” (L.1456-L.1463). Ensure the initializer and the alloca element type are identical to avoid IR verifier errors.

Suggested changes:

  • Replace sprintf/buffer logic with width-safe, heap-backed snprintf. Example helpers:
    def _create_snprintf_decl(self) -> ir.Function:
    """Declare the external snprintf function (varargs)."""
    def _malloc_i8(self, size: ir.Value) -> ir.Value:
    """malloc size bytes and return i8*."""
    def _num_to_cstring(self, value: ir.Value, is_float: bool) -> ir.Value:
    """Format numeric value with snprintf to a heap buffer and return i8*."""

Apply in the int/float cast branches to:

  • First call snprintf(NULL, 0, fmt, ...) to get required length.

  • malloc(n + 1).

  • Call snprintf(buf, n + 1, fmt, ...).

  • Return buf (i8*), not a stack pointer. (L.1567-L.1640)

  • Handle integer widths correctly:
    def _int_fmt_and_arg(self, val: ir.Value) -> tuple[str, ir.Value]:
    """Return printf-style format and properly extended arg for val."""
    Use "%d" + i32 for <=32-bit, "%lld" + i64 for >32-bit. (L.1577-L.1584)

  • Make format global naming deterministic and collision-free:
    def get_or_create_format_global(self, fmt: str) -> ir.GlobalVariable:
    """Create a constant global format string."""
    import hashlib
    name = "fmt
    " + hashlib.sha1(fmt.encode("utf8")).hexdigest() # (L.1473)

  • Ensure initializer type matches storage for InlineVariableDeclaration:
    def visit(self, node: astx.InlineVariableDeclaration) -> None:
    """Fix initializer type to match storage type."""
    storage_type_str = "stringascii" if type_str == "string" else type_str
    init_val = ir.Constant(self._llvm.get_data_type(storage_type_str), 0)
    alloca = self.create_entry_block_alloca(node.name, storage_type_str)
    ... (L.1452-L.1463)


src/irx/system.py

  • This widens the accepted type and likely breaks any downstream code that accessed message as a LiteralUTF8String (e.g., using .value). To avoid runtime AttributeError and keep consistent semantics, normalize non-literal inputs to a string-producing expression and never rely on literal-only APIs. Replace the direct assignment with a normalizer (L.22):

    def _normalize_message(self, message: astx.Expr) -> astx.Expr:
    """Coerce non-string expressions into a string-producing expression."""
    if isinstance(message, astx.LiteralUTF8String):
    return message
    # Ensure downstream never expects .value from a literal
    return astx.Call(func=astx.Name(id="str"), args=[message], keywords=[])

    def init(self, message: astx.Expr) -> None:
    """Initialize the PrintExpr."""
    self.message = self._normalize_message(message)
    self.name = f"print_msg{next(PrintExpr._counter)}"

  • If any consumer still needs to know when the message is a literal, expose a safe accessor to prevent type errors (L.15):

    @Property
    def message_literal(self) -> typing.Optional[astx.LiteralUTF8String]:
    """Return the message if it is a literal; otherwise None."""
    return self.message if isinstance(self.message, astx.LiteralUTF8String) else None


tests/test_cast.py

  • Potential flakiness: exact stdout assertions for floats are locale- and implementation-dependent. If PrintExpr uses the C library, decimal separator and precision can vary. Consider forcing the C locale in the test to stabilize output (e.g., after module creation) or make the assertion format-agnostic. For example:
    • Add at the start of each test: import locale; locale.setlocale(locale.LC_ALL, "C") (L.128, L.177).
  • Verify whether PrintExpr appends a newline. If it does, expected_output should include it to avoid brittle comparisons: "42\n" and "42.000000\n" (L.156, L.205).

@yuvimittal yuvimittal force-pushed the feature-castingIntToString branch from d935088 to 16bcee1 Compare September 30, 2025 08:17
@xmnlab xmnlab marked this pull request as ready for review September 30, 2025 21:39
Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

  • Security/correctness: Using sprintf into a fixed 64-byte stack buffer can overflow for floats with large magnitude when using "%f" (can emit hundreds of digits) and will silently truncate 64-bit integers to 32-bit (you truncate/sext to i32 for "%d"). Replace with snprintf and pass the buffer length; also use a safer format for floats to bound output length (e.g., "%.17g"). (L.1468, L.1559, L.1624)
  • Correctness: Casting integers wider than 32 bits to string currently truncates to 32 bits. Use 64-bit formatting when width > 32 (e.g., "%lld"/"%llu") and avoid lossy truncation. Also consider unsigned vs signed handling; sign-ext is wrong for unsigned. (L.1547-L.1558)
  • Correctness/lifetime: You return a pointer to a stack-allocated buffer (alloca) as the string value. If the result escapes the expression (stored or returned), this becomes a dangling pointer. For managed STRING_TYPE, you should materialize a runtime-managed string and copy the buffer into it. For ASCII_STRING_TYPE, ensure the pointer never escapes or allocate on the heap. (L.1565, L.1630)

Suggested changes:

  • Replace sprintf with snprintf and safer float formatting:
    def _create_snprintf_decl(self) -> ir.Function:
    """Declare (or return) the external snprintf function (varargs)."""
    name = "snprintf"
    if name in self._llvm.module.globals:
    return self._llvm.module.get_global(name)
    size_t_ty = ir.IntType(64)
    snprintf_ty = ir.FunctionType(
    self._llvm.INT32_TYPE,
    [
    self._llvm.INT8_TYPE.as_pointer(),
    size_t_ty,
    self._llvm.INT8_TYPE.as_pointer(),
    ],
    var_arg=True,
    )
    fn = ir.Function(self._llvm.module, snprintf_ty, name=name)
    fn.linkage = "external"
    return fn
    (L.1468)

    • Use "%.17g" for floats and pass the buffer length constant when calling:
      buf_len = ir.Constant(ir.IntType(64), 64)
      self._llvm.ir_builder.call(snprintf, [buf_ptr, buf_len, fmt_ptr, value_prom])
      (L.1624)
  • Avoid lossy int-to-i32 conversion; pick format by width (and signedness if available):
    def _int_to_string(self, value: ir.Value) -> tuple[ir.Value, ir.Value]:
    """Return (fmt_ptr, casted_value) for integer-to-string formatting."""
    width = value.type.width
    if width <= 32:
    fmt_gv = self._get_or_create_format_global("%d")
    casted = self._llvm.ir_builder.sext(value, self._llvm.INT32_TYPE)
    elif width <= 64:
    fmt_gv = self._get_or_create_format_global("%lld")
    casted = self._llvm.ir_builder.sext(value, ir.IntType(64))
    else:
    raise Exception("Integer width > 64 bits not supported for string cast")
    fmt_ptr = self._llvm.ir_builder.gep(
    fmt_gv, [ir.Constant(self._llvm.INT32_TYPE, 0), ir.Constant(self._llvm.INT32_TYPE, 0)], inbounds=True
    )
    return fmt_ptr, casted
    (L.1534)

    • Use the helper above instead of trunc/sext to i32 and "%d". (L.1547-L.1558)
  • Prevent dangling stack pointers for STRING_TYPE by constructing a managed string:
    def _string_from_cstr(self, cstr_ptr: ir.Value) -> ir.Value:
    """Wrap a C string (i8*) into a runtime-managed STRING_TYPE."""
    # Example: declare and call external runtime helper: irx_string_from_cstr(i8*) -> STRING_TYPE
    fn = self._llvm.get_or_declare_runtime_fn(
    "irx_string_from_cstr",
    ir.FunctionType(self._llvm.STRING_TYPE, [self._llvm.INT8_TYPE.as_pointer()]),
    )
    return self._llvm.ir_builder.call(fn, [cstr_ptr])
    (L.1565)

    • In the cast branch, when target is STRING_TYPE, call _string_from_cstr(buf_ptr) and push that instead of buf_ptr. (L.1565, L.1630)

src/irx/system.py

  • Broadening message to astx.Expr is a behavior change that can introduce side effects at print time and may break any downstream codegen that assumed a literal (e.g., accessing .value or hoisting into a constant via _name). If non-literals are ever hoisted or cached, this can change evaluation timing and duplicate side effects.

  • To make this safe and explicit for codegen, track whether the message is a literal and avoid treating non-literals as constants. Suggested changes:

    • After assignment in init, add a literal flag and only assign a constant name for literals (L.23–L.24):
      def init(self, message: astx.Expr) -> None:
      """Initialize the PrintExpr."""
      self.message = message
      self._is_literal: bool = isinstance(message, astx.LiteralUTF8String)
      self.name: str | None = f"print_msg{next(PrintExpr._counter)}" if self._is_literal else None

    • Add a helper to guide codegen (place inside the class after init) (L.27):
      def needs_runtime_eval(self) -> bool:
      """Return True if the message is not a literal and must be evaluated at runtime."""
      return not self._is_literal

  • If any existing code assumes the message is a literal (e.g., uses message.value), add an assertion to fail fast (L.23):
    def init(self, message: astx.Expr) -> None:
    """Initialize the PrintExpr."""
    self.message = message
    if hasattr(self.message, "value") and not isinstance(self.message, astx.LiteralUTF8String):
    raise TypeError("PrintExpr.message was expected to be a LiteralUTF8String in this context")


tests/test_cast.py

  • Potential mismatch in test oracle: prior tests used expected_output to assert the program exit code; these new tests rely on stdout while returning 0. If check_result still interprets expected_output as exit code, these will fail or be misleading. Consider updating check_result to accept explicit expected_stdout and expected_exit_code (or ensure it prioritizes stdout when present) to avoid ambiguity and future regressions.
  • Be careful with trailing newline/whitespace: PrintExpr may emit a newline. If check_result compares raw stdout, assert accordingly or normalize output to avoid brittle failures.
  • Float formatting stability: asserting exactly "42.000000" can be locale/implementation-dependent. If formatting relies on C/printf defaults, this could be flaky. Prefer enforcing a deterministic, locale-independent format in the implementation or relax the assertion (e.g., normalized formatting or a regex match).

Copy link
Contributor

@xmnlab xmnlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvimittal , I am still reviewing your PR, but just sharing a few notes here for now.

init_val = ir.Constant(self._llvm.get_data_type(type_str), 0)

alloca = self.create_entry_block_alloca(node.name, type_str)
if type_str == "string":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really necessary?

could we just keep the previous code?

alloca = self.create_entry_block_alloca(node.name, type_str)

why we need to make a special case for string?


self.result_stack.append(init_val)

def _create_sprintf_decl(self) -> ir.Function:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvimittal , to be honest I don't have the expertise this this part, so I checked it if gpt, and checking the results, it makes sense. Please take a look into the review:


the risk isn’t in the declaration itself, but in how you use it. Declaring sprintf is fine, but calling it on a fixed stack buffer is unsafe and the review is right about two separate issues:

  1. Buffer overflow with %f on large-magnitude floats (unbounded output).
  2. Width mismatch when printing 64-bit integers with "%d" (32-bit) and/or promoting args incorrectly for C varargs.

Below is a tight, actionable plan for IRx with llvmlite.

What to change

1) Declare and use snprintf, not sprintf

Add a declaration for snprintf and switch all formatting call sites to it.

def _create_snprintf_decl(self) -> ir.Function:
    """Declare (or return) the external snprintf (varargs)."""
    name = "snprintf"
    if name in self._llvm.module.globals:
        return self._llvm.module.get_global(name)

    # int snprintf(char *str, size_t size, const char *fmt, ...);
    size_t_ty = getattr(self._llvm, "SIZE_T_TYPE", None)
    if size_t_ty is None:
        # Fallback: assume LP64/LLP64 by pointer width
        size_t_ty = (
            self._llvm.INT64_TYPE if self._llvm.POINTER_BITS == 64
            else self._llvm.INT32_TYPE
        )

    snprintf_ty = ir.FunctionType(
        self._llvm.INT32_TYPE,
        [
            self._llvm.INT8_TYPE.as_pointer(),  # char *str
            size_t_ty,                           # size_t size
            self._llvm.INT8_TYPE.as_pointer(),  # const char *fmt
        ],
        var_arg=True,
    )
    fn = ir.Function(self._llvm.module, snprintf_ty, name=name)
    fn.linkage = "external"
    return fn

Then, wherever you previously called sprintf(buf, fmt, …), do:

snprintf(buf, buf_len, fmt, …)

and check the return value (n >= buf_len means truncation).

2) Use width-correct format specifiers and ABI-correct promotions

When building varargs calls in LLVM IR you must mirror C’s default promotions:

  • i8/i16sign-extend or zero-extend to i32 (match signedness).
  • floatfpext to double.
  • i32 → pass as i32.
  • i64 → pass as i64, and use "%lld"/"%llu" (portable across LP64/LLP64).
  • Pointers/strings: pass as i8*.

Map your format strings to the IR types you’re passing. For 64-bit ints do not use "%ld" (breaks on Windows LLP64); prefer "%lld"/"%llu" universally.

3) Bound float output

Prefer a bounded specifier for doubles to avoid huge strings:

  • General: "%.17g" (round-trip-safe for double, compact, bounded).
  • If fixed decimals are desired: "%.6f" (or configurable precision).

This mitigates pathological lengths even with snprintf, and reduces truncation.

4) Consider a tiny C runtime helper (optional but cleanest)

If you need an exact-sized string (no truncation and no guesswork), add a
runtime function you link against and call from IR:

// irx_runtime.c
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>

char* irx_format(const char* fmt, ...) {
    va_list ap;
    va_start(ap, fmt);
    int n = vsnprintf(NULL, 0, fmt, ap);
    va_end(ap);
    if (n < 0) return NULL;

    char* s = (char*)malloc((size_t)n + 1);
    if (!s) return NULL;

    va_start(ap, fmt);
    vsnprintf(s, (size_t)n + 1, fmt, ap);
    va_end(ap);
    return s;
}

Then expose a typed declaration in IR and avoid manual varargs at codegen call
sites. This sidesteps promotions/width issues entirely.

Minimal tests to add

  • Large-magnitude doubles (e.g., 1e308, 1e-308) with "%.17g": no crash, no UB, sensible string.
  • i64 min/max values printed with "%lld"/"%llu": full value round-trips.
  • Truncation path: small buffer, confirm return >= capacity and that you handle it.
  • Mixed types in one call (int, double, string) verifying all promotions.

@yuvimittal yuvimittal marked this pull request as draft October 2, 2025 15:53
@yuvimittal yuvimittal force-pushed the feature-castingIntToString branch from 1c9828a to d34d4ca Compare October 15, 2025 06:22
@yuvimittal yuvimittal marked this pull request as ready for review October 16, 2025 15:38
Copy link

OSL ChatGPT Reviewer

NOTE: This is generated by an AI program, so some comments may not make sense.

src/irx/builders/llvmliteir.py

  • ABI/correctness: SIZE_T_TYPE and POINTER_BITS are hardcoded to 64-bit. On 32-bit targets this misdeclares snprintf(size_t) and will corrupt arguments at call sites. Derive these from the module/target data instead of hardcoding (L.163-165).
    Suggested change:
    def _init_target_data(self) -> None:
    """Initialize target-dependent integer sizes."""
    import llvmlite.binding as llvm
    triple = llvm.get_default_triple()
    tm = llvm.Target.from_triple(triple).create_target_machine()
    dl = llvm.create_target_data(tm.target_data)
    ptr_bits = dl.get_pointer_size() * 8
    self._llvm.POINTER_BITS = ptr_bits
    self._llvm.SIZE_T_TYPE = (self._llvm.INT64_TYPE if ptr_bits == 64 else self._llvm.INT32_TYPE)

    Call _init_target_data() in initialize() after creating the module (L.150-170).

  • Correctness: Using abs(hash(fmt)) for global names is non-deterministic across runs and collision-prone. Use a deterministic encoding (e.g., hex of bytes) (L.1476).
    Suggested change:
    def mangle_const_name(self, prefix: str, data: bytes) -> str:
    """Deterministic name for constant data."""
    import hashlib
    h = hashlib.sha1(data).hexdigest()[:16]
    return f"{prefix}
    {h}"

    And in _get_or_create_format_global():
    data = (fmt + "\0").encode("utf8")
    name = self._mangle_const_name("fmt", data)

  • Safety/lifetime: Casting number->string returns a pointer to a stack alloca in the current block. If the cast result escapes the function (returned/stored globally), this is a dangling pointer. At minimum, allocate in the entry block; ideally, forbid escape or heap-allocate when it does (L.1571).
    Suggested change:
    def _alloca_in_entry(self, ty: ir.Type, name: str) -> ir.instructions.AllocaInstr:
    """Alloca in entry block for full function lifetime."""
    ib = self._llvm.ir_builder
    fn = ib.function
    with ib.goto_entry_block():
    return ib.alloca(ty, name=name)

    Replace buf = self._llvm.ir_builder.alloca(...) with buf = self._alloca_in_entry(buf_arr_ty, "fmt_buf") (L.1571).

  • Correctness: Integer sign handling in formatting. You use sext for widths < 32, which will print small unsigned values incorrectly (e.g., 0xFF -> -1). You need to respect the source signedness (L.1592-L.1604).
    Suggested change:
    def _zext_or_sext_to_i32(self, val: ir.Value, is_signed: bool) -> ir.Value:
    """Extend integer to i32 preserving signedness."""
    if val.type.width >= 32:
    return self._llvm.ir_builder.trunc(val, self._llvm.INT32_TYPE) if val.type.width > 32 else val
    return (self._llvm.ir_builder.sext(val, self._llvm.INT32_TYPE) if is_signed
    else self._llvm.ir_builder.zext(val, self._llvm.INT32_TYPE))

    Use is_signed from the AST/type system when preparing arg for %d (L.1592-L.1604).

  • Robustness: 64-byte buffer with "%.6f" can truncate large magnitudes (hundreds of digits). At least check snprintf’s return and grow if needed (L.1618, L.1644).
    Suggested change:
    def _ensure_fmt_capacity(self, snprintf_fn: ir.Function, buf_ptr: ir.Value, cap: ir.Value, fmt_ptr: ir.Value, arg: ir.Value) -> ir.Value:
    """Call snprintf; if needed size > cap, allocate larger buffer and retry."""
    n = self._llvm.ir_builder.call(snprintf_fn, [buf_ptr, cap, fmt_ptr, arg])
    # If n >= cap, allocate n+1 and call again (left as TODO if you lack control flow utilities)
    return n

    At minimum, assert n < cap or switch to "%g" for floats to cap length.

  • Portability: On some Windows toolchains, snprintf may be named _snprintf. Consider weak-declaring both or honoring the target triple to pick the correct symbol (L.1460).


src/irx/system.py

  • High-risk correctness: Broadening message from LiteralUTF8String to Expr will crash later if codegen/lowering still assumes string-literal-specific fields. Add a fail-fast guard (or coercion) in init until non-string lowering is implemented. (L.22)

Suggested patch:
def init(self, message: astx.Expr) -> None:
"""Initialize the PrintExpr."""
if not isinstance(message, astx.LiteralUTF8String):
raise TypeError("PrintExpr.message must be a LiteralUTF8String or implement Expr->string lowering")
self.message = message
self.name = f"print_msg{next(PrintExpr._counter)}"


tests/test_cast.py

LGTM!


@yuvimittal yuvimittal marked this pull request as draft October 16, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants