Skip to content

Translate C integer types to portable Rust types #1266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

fw-immunant
Copy link
Contributor

@fw-immunant fw-immunant commented Jul 5, 2025

This substantially improves our fidelity of type translation for common integral types. We now recognize types like size_t and uint8_t and convert them to the appropriate platform-independent Rust types. This is nontrivial because C-the-language does not have fixed-size types and instead defines them via platform-dependent mappings to types with platform-dependent sizes, such as unsigned long and unsigned char. We have to preëmpt Clang to introduce a distinction between fixed-size types and "native" C ones when we convert the AST after exporting; to make effective use of this type information in translation we need to pass the expected type of the translated expression downward through conversion of expressions and statements.

@fw-immunant
Copy link
Contributor Author

Looks like I need to learn how to update snapshot tests.

@fw-immunant fw-immunant force-pushed the fw/typedefs branch 5 times, most recently from b2a0b2f to bd83003 Compare July 7, 2025 00:07
Copy link
Contributor

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I need to learn how to update snapshot tests.

cargo test (or cargo test -p c2rust-transpile for now, as it skips a bunch of slow tests) and then cargo insta review (or INSTA_UPDATE=always cargo test and then review the *.rs diffs from git).

@fw-immunant fw-immunant force-pushed the fw/typedefs branch 2 times, most recently from 2457c6c to ae1aacf Compare July 7, 2025 05:53
@fw-immunant fw-immunant linked an issue Jul 7, 2025 that may be closed by this pull request
@fw-immunant fw-immunant force-pushed the fw/typedefs branch 3 times, most recently from 1ca5371 to 79a07c9 Compare July 9, 2025 21:02
@fw-immunant
Copy link
Contributor Author

I'm seeing testsuite failures (type errors caused by differing primitive types in arithmetic) that I can't reproduce locally or on donna. I assume this has to do with type translation failing on Ubuntu 22.04 for some reason.

Copy link
Contributor

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing testsuite failures (type errors caused by differing primitive types in arithmetic) that I can't reproduce locally or on donna. I assume this has to do with type translation failing on Ubuntu 22.04 for some reason.

Are they using different LLVM/clang versions?

@fw-immunant fw-immunant force-pushed the fw/typedefs branch 6 times, most recently from e218703 to 3056379 Compare July 10, 2025 06:36
@fw-immunant fw-immunant force-pushed the fw/typedefs branch 3 times, most recently from 0da2644 to de5a710 Compare July 10, 2025 19:03
@fw-immunant
Copy link
Contributor Author

fw-immunant commented Jul 10, 2025

The remaining testsuite failures appear to be panics from (expected) arithmetic overflows that get caught in debug mode.

EDIT: solved by #1273, will rebase out of this series when it lands.

fw-immunant and others added 14 commits July 25, 2025 17:03
by moving child generation into the trait, we avoid explicit reference to the TypedAstContext

this is simpler because the only impl of this trait already stores the TypedAstContext itself, and by avoiding such aliasing we enable later implementations of mutable visitors

this also allows impls of the trait to choose between all-types and only-exprs node children
this is necessary to normalize AST types between Clang 15 and below, which resolve typedefs in expression types, and Clang 16 and above, which preserve typedefs in these positions

because we reify some typedef types such as `size_t`, we always want the latter behavior
this clarifies which variables refer to the same quantity in different functions

this commit should have no functional changes
@fw-immunant
Copy link
Contributor Author

@kkysen @ahomescu Can you give this another round of review when you get the chance? I think I've addressed everything from the first round.

@kkysen kkysen self-requested a review July 25, 2025 21:55
Comment on lines +1873 to +1880
// Avoid getting params from an implicit decl if a subsequent non-implicit decl exists.
// Implicit decls will not have names for params, but more importantly, they will never
// reference header-declared typedefs, so we would miss the fact that e.g. malloc is
// declared to accept `size_t` in its stdlib.h declaration, while its implicit declaration
// accepts the built-in `unsigned long`.
if (FD->isImplicit()) {
paramsFD = FD->getMostRecentDecl();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implicit declaration the untyped declaration that you get when you don't declare a function prototype? Or is clang doing something special for functions like malloc here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter; in the Clang AST we see a declaration node that has types but does not use any of the typedefs that the standard headers might define.

auto targetDependentMacro = [&]() -> std::optional<std::string> {
TypeSourceInfo *TSI = D->getTypeSourceInfo();
if (!TSI) {
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: std::nullopt is probably a bit clearer here, and for the other return {}s.

Comment on lines +489 to +491
if let kind @ CDeclKind::Function { .. } = &located_kind.kind {
let new_kind = self.typed_context.fn_decl_ty_with_declared_args(kind);
if self.typed_context.type_for_kind(&new_kind).is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would be a bit clearer with early continues instead of extra indentation.

Comment on lines +483 to +490
// Function declarations' types look through typedefs, but we want to use the types with
// typedefs intact in some cases during translation. To ensure that these types exist in the
// `TypedAstContext`, iterate over all function decls, compute their adjusted type using
// argument types from arg declarations, and then ensure the existence of both this type and
// the type of pointers to it.
for (_decl_id, located_kind) in self.typed_context.c_decls.iter() {
if let kind @ CDeclKind::Function { .. } = &located_kind.kind {
let new_kind = self.typed_context.fn_decl_ty_with_declared_args(kind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, are we referring to both parameters and arguments as the same thing here?

Comment on lines +723 to +726
let pointer_ty = CTypeKind::Pointer(CQualTypeId::new(CTypeId(new_id)));
let new_id = self.id_mapper.fresh_id();
self.add_type(new_id, not_located(pointer_ty));
self.processed_nodes.insert(new_id, OTHER_TYPE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this doing exactly? Adding a fn ptr type? And how come a fresh new_id is created here shadowing the original one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll add a comment explaining what/why. The second new_id is because we're also adding a second type; each type needs its own ID.

let mut typ = self.visit_qualified_type(typ_old);

// Clang injects definitions of the form `#define SOME_MACRO unsigned int` into
// compilation units based on the target. (See lib/Frontend/InitPreprocessor.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we could refer to a particular implementation of something in clang, in which case a permalink is more useful. It's useful to know if that's the intent or were purposefully targeting multiple versions at once (I know we always do that, but some things in clang we're not expecting to change, and would need to patch things if they did).

Comment on lines +1990 to +2014
typ = target_dependent_macro
.as_deref()
.and_then(|macro_name| {
let kind = match macro_name {
// Match names in the order Clang defines them.
"__INTMAX_TYPE__" => CTypeKind::IntMax,
"__UINTMAX_TYPE__" => CTypeKind::UIntMax,
"__PTRDIFF_TYPE__" => CTypeKind::PtrDiff,
"__INTPTR_TYPE__" => CTypeKind::IntPtr,
"__SIZE_TYPE__" => CTypeKind::Size,
"__WCHAR_TYPE__" => CTypeKind::WChar,
// __WINT_TYPE__ is defined by Clang but has no obvious translation
// __CHARn_TYPE__ for n ∈ {8, 16, 32} also lack obvious translation
"__UINTPTR_TYPE__" => CTypeKind::UIntPtr,
_ => {
log::debug!("Unknown target-dependent macro {macro_name}!");
return None;
}
};
log::trace!("Selected kind {kind} for typedef {name}");
Some(CQualTypeId::new(
self.typed_context.type_for_kind(&kind).unwrap(),
))
})
.unwrap_or(typ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd help readability to pull a bunch of these out into separate functions. This function is 2000 lines long.

/// Try to match the target-dependent macro to a platform-independent type.
///
/// Clang injects definitions of the form `#define __SIZE_TYPE__ unsigned int`
/// into compilation units based on the target.
/// (See `lib/Frontend/InitPreprocessor.cpp` in Clang).
///
/// Later, headers contain defns like: `typedef __SIZE_TYPE__ size_t;`.
///
/// We detect these `typedef`s and alter them by replacing the type to which the macro
/// expands with a synthetic, portable `CTypeKind` chosen from the macro's name.
///
/// This allows us to generate platform-independent Rust types like [`u64`] or [`usize`]
/// despite the C side internally working with a target-specific type.
fn platform_independent_type_of_target_dependent_macro(macro_name: &str) -> Option<CTypeKind> {
    let kind = match macro_name {
        // Match names in the order Clang defines them.
        "__INTMAX_TYPE__" => CTypeKind::IntMax,
        "__UINTMAX_TYPE__" => CTypeKind::UIntMax,
        "__PTRDIFF_TYPE__" => CTypeKind::PtrDiff,
        "__INTPTR_TYPE__" => CTypeKind::IntPtr,
        "__SIZE_TYPE__" => CTypeKind::Size,
        "__WCHAR_TYPE__" => CTypeKind::WChar,
        // __WINT_TYPE__ is defined by Clang but has no obvious translation
        // __CHARn_TYPE__ for n ∈ {8, 16, 32} also lack obvious translation
        "__UINTPTR_TYPE__" => CTypeKind::UIntPtr,
        _ => {
            log::debug!("Unknown target-dependent macro {macro_name}!");
            return None;
        }
    };
    Some(kind)
}

/// Try to match the `typedef` to a platform-independent type.
///
/// Other platform-independent types are defined by
/// standard headers and their transitive includes without special compiler involvement
/// (see [`platform_independent_type_of_target_dependent_macro`]).
/// In these contexts, we recognize these `typedef`s by their name.
fn platform_independent_type_of_typedef(typedef: &str) -> Option<CTypeKind> {
    let kind = match typedef {
        "intmax_t" => CTypeKind::IntMax,
        "uintmax_t" => CTypeKind::UIntMax,
        "intptr_t" => CTypeKind::IntPtr,
        "uintptr_t" => CTypeKind::UIntPtr,
        // unlike `size_t`, `ssize_t` does not have a clang-provided `#define`.
        "ssize_t" => CTypeKind::SSize,
        // macOS defines `ssize_t` from `__darwin_ssize_t` in many headers,
        // so we should handle `__darwin_ssize_t` itself.
        "__darwin_ssize_t" => CTypeKind::SSize,
        "__uint8_t" => CTypeKind::UInt8,
        "__uint16_t" => CTypeKind::UInt16,
        "__uint32_t" => CTypeKind::UInt32,
        "__uint64_t" => CTypeKind::UInt64,
        "__uint128_t" => CTypeKind::UInt128,
        "__int8_t" => CTypeKind::Int8,
        "__int16_t" => CTypeKind::Int16,
        "__int32_t" => CTypeKind::Int32,
        "__int64_t" => CTypeKind::Int64,
        "__int128_t" => CTypeKind::Int128,
        // macOS stdint.h typedefs [u]intN_t directly:
        "int8_t" => CTypeKind::Int8,
        "int16_t" => CTypeKind::Int16,
        "int32_t" => CTypeKind::Int32,
        "int64_t" => CTypeKind::Int64,
        "uint8_t" => CTypeKind::UInt8,
        "uint16_t" => CTypeKind::UInt16,
        "uint32_t" => CTypeKind::UInt32,
        "uint64_t" => CTypeKind::UInt64,
        _ => {
            log::debug!("Unknown platform-independent typedef {typedef}!");
            return None;
        }
    };
    Some(kind)
}

/// Check if `src` is a standard header that contains platform-independent `typedef`s.
fn is_platform_independent_type_header(src: &SrcFile) -> bool {
    let filename = src
        .path
        .as_ref()
        .and_then(|path| path.file_name())
        .and_then(|filename| filename.to_str())
        .unwrap_or("");
    // `_types.h`, `_int*`, `_uint*` are for macOS.
    matches!(filename, "stdint.h" | "types.h" | "_types.h")
        || filename.starts_with("__stddef__")
        || filename.starts_with("_int")
        || filename.starts_with("_uint")
}

/// Try to recognize a platform-independent type given the `typedef` name,
/// its `target_dependent_macro` (see [`platform_independent_type_of_target_dependent_macro`]),
/// and the `src` file it was defined in.
fn platform_independent_type(
    typedef: &str,
    target_dependent_macro: Option<&str>,
    src: &SrcFile,
) -> Option<CTypeKind> {
    if let Some(macro_name) = target_dependent_macro {
        if let Some(ty) = platform_independent_type_of_target_dependent_macro(macro_name) {
            return Some(ty);
        }
    }
    if is_platform_independent_type_header(src) {
        if let Some(ty) = platform_independent_type_of_typedef(typedef) {
            return Some(ty);
        }
    }
    None
}

and then use it like this:

                    let target_dependent_macro: Option<String> = from_value(node.extras[2].clone())
                        .expect("Expected to find optional target-dependent macro name");

                    let file = self
                        .typed_context
                        .files
                        .get(self.typed_context.file_map[node.loc.fileid as usize])
                        .expect("missing SrcFile for typedef");
                    if let Some(kind) =
                        platform_independent_type(&name, target_dependent_macro.as_deref(), file)
                    {
                        log::trace!("Selected kind {kind} for typedef {name}");
                        typ = CQualTypeId::new(self.typed_context.type_for_kind(&kind).unwrap());
                    }

Also, I think platform-independent is a more accurate descriptor than fixed-width. Some of these types aren't fixed-width, like size_t. What we care about is that they are platform-independent but that C doesn't treat them internally as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR isn't the place for general refactors, and the comments here already explain what's happening as well as splitting out separate functions would.

With respect to naming, this handling covers both fixed-width, platform-independent types like uint16_t and ones that are platform-dependent and not fixed width like ssize_t. So platform-independent isn't really accurate for describing the types themselves (though none of the types handled by the first block here are fixed-size/platform-independent). We are adding logic to ensure that these types are translated in a platform-independent way, and to names that are not platform-dependent. But I don't know if it's worthwhile to bikeshed names here too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR isn't the place for general refactors

Not sure how this is a general refactor. This is all new code.

and the comments here already explain what's happening as well as splitting out separate functions would.

That's more subjective, and I don't think it does. I think it's better to err on the side of being clearer and the reviewer when this doesn't really cost anything.

With respect to naming, this handling covers both fixed-width, platform-independent types like uint16_t and ones that are platform-dependent and not fixed width like ssize_t. So platform-independent isn't really accurate for describing the types themselves (though none of the types handled by the first block here are fixed-size/platform-independent). We are adding logic to ensure that these types are translated in a platform-independent way, and to names that are not platform-dependent. But I don't know if it's worthwhile to bikeshed names here too much.

ssize_t isn't technically platform-dependent, but it basically is, and is for platforms we support.

I think consistent naming is really critical to being able to understand code. Referring to the same and different things with slightly different names all over the place is very confusing. And fixed-width is definitely inaccurate for many of these types, while platform-independent is largely correct and captures the intent of what we're trying to do: emit code that is platform-independent.

/// Visit nodes in pre-order traversal. Returns true if we should traverse
/// Return the immediate child nodes of a given node.
///
/// Generally implemented via `immediate_children[_all_types]`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Generally implemented via `immediate_children[_all_types]`.
/// Generally implemented via `immediate_children`
/// or [`immediate_children_all_types`].

Comment on lines +434 to +435
fn children(&mut self, _id: SomeId) -> Vec<SomeId>;
/// Visits nodes in pre-order traversal. Returns true if we should traverse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave newlines between the methods if they have docs?

Suggested change
fn children(&mut self, _id: SomeId) -> Vec<SomeId>;
/// Visits nodes in pre-order traversal. Returns true if we should traverse
fn children(&mut self, _id: SomeId) -> Vec<SomeId>;
/// Visits nodes in pre-order traversal. Returns true if we should traverse

Comment on lines +467 to +471
pub fn type_for_kind(&self, kind: &CTypeKind) -> Option<CTypeId> {
self.c_types
.iter()
.find_map(|(id, k)| if kind == &k.kind { Some(*id) } else { None })
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to be doing an O(n) reverse search for this? Can we just build a reverse c_type_ids: HashMap<CType, CTypeKind> once? Perhaps at the end of fn ConversionContext::new or fn ConversionContext::into_typed_context?

for p in parameters {
match self.index(*p).kind {
CDeclKind::Variable { typ, .. } => param_tys.push(CQualTypeId::new(typ.ctype)),
_ => return None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ever fail? What does it mean if it does? One call is just .unwrap()ed and the other doesn't seem to ever be None in our tests.

If we're not sure yet how to handle this case, it'd probably be better to panic (like the assert! in fn function_declref_decl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it panics instead, we can also rewrite this more simply with an iterator:

    pub fn tys_of_params(&self, parameters: &[CDeclId]) -> Vec<CQualTypeId> {
        parameters
            .iter()
            .map(|p| match &self.index(*p).kind {
                CDeclKind::Variable { typ, .. } => CQualTypeId::new(typ.ctype),
                _ => panic!(),
            })
            .collect()
    }


/// Return the id of the most precise possible type for the function referenced by the given
/// expression.
pub fn function_declref_ty_with_declared_args(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn function_declref_ty_with_declared_args(
pub fn fn_declref_ty_with_declared_args(

For consistency with fn_decl_ty_with_declared_args.

@@ -486,6 +492,71 @@ impl TypedAstContext {
self.index(resolved_typ_id)
}

/// Extract decl of referenced function.
/// Looks for ImplicitCast(FunctionToPointerDecay, DeclRef(function_decl))
pub fn function_declref_decl(&self, func_expr: CExprId) -> Option<&CDeclKind> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn function_declref_decl(&self, func_expr: CExprId) -> Option<&CDeclKind> {
pub fn fn_declref_decl(&self, fn_expr: CExprId) -> Option<&CDeclKind> {

For consistency with fn_decl_ty_with_declared_args.

Comment on lines +3513 to +3521
if let Some(func_decl) = self.ast_context.function_declref_decl(expr) {
let kind_with_declared_args =
self.ast_context.fn_decl_ty_with_declared_args(func_decl);
let func_ty = self
.ast_context
.type_for_kind(&kind_with_declared_args)
.unwrap_or_else(|| {
panic!("no type for kind {kind_with_declared_args:?}")
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as fn function_declref_ty_with_declared_args? Can we just call that here instead?

&self,
func_expr: CExprId,
) -> Option<CQualTypeId> {
if let Some(func_decl @ CDeclKind::Function { .. }) = self.function_declref_decl(func_expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn function_declref_decl already assert!s that this is a CDeclKind::Function. We don't have variant types in Rust, but we shouldn't need the None branch here, which is just confusing if it's unreachable. Cleanest would be to have a FunctionDecl type in CDeclKind::Function instead of the fields directly, but I'm not sure how much churn that would cause elsewhere. Maybe just an assert! or unreachable! suffices here.

Comment on lines +755 to +759
/// In Clang 15 and below, the Clang AST resolves typedefs in the expression type of unary and
/// binary expressions. For example, a BinaryExpr node adding two `size_t` expressions will be
/// given an `unsigned long` type rather than the `size_t` typedef type. This behavior changed
/// in Clang 16. This method adjusts AST node types to match those produced by Clang 16 and
/// newer; on these later Clang versions, it should have no effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need any of this if we just didn't support clang <= 15?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though it wasn't obvious that this would be the case initially. Clang doesn't make any documented promises about behavior here w/r/t typedefs, so it was possible we would need logic like this in all cases. It seems that Clang 16 and later automatically do all the bubbling-up of typedef types we need, but it's possible there are edge cases where even new versions of Clang do not.

@@ -0,0 +1,2 @@
#include <stdint.h>
uint32_t var;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come there's a separate test for just uint32_t since it's covered in types.c already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to unify the Linux and macOS versions? This looks good enough for now, but I'm wondering how difficult do you think it'd be to unify them? For, say uint32_t, can we make that always typedef to u32 instead of whatever clang is saying it is (often an OS-dependent intermediate)?

Also, maybe this should be named platform_independent_types.c?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test off_t, which caused issues on rav1d (memorysafety/rav1d#24)?

Also, we should test function declarations like fseeko, etc, whose signatures will get platform-dependent types. I'm not sure if this fixes that (hopefully it does 🙏).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

size_t is translated to libc::c_ulong instead of libc::size_t
3 participants