-
Notifications
You must be signed in to change notification settings - Fork 270
Translate C integer types to portable Rust types #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Looks like I need to learn how to update snapshot tests. |
b2a0b2f
to
bd83003
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I need to learn how to update snapshot tests.
cargo test
(or cargo test -p c2rust-transpile
for now, as it skips a bunch of slow tests) and then cargo insta review
(or INSTA_UPDATE=always cargo test
and then review the *.rs
diffs from git
).
2457c6c
to
ae1aacf
Compare
1ca5371
to
79a07c9
Compare
I'm seeing testsuite failures (type errors caused by differing primitive types in arithmetic) that I can't reproduce locally or on donna. I assume this has to do with type translation failing on Ubuntu 22.04 for some reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing testsuite failures (type errors caused by differing primitive types in arithmetic) that I can't reproduce locally or on donna. I assume this has to do with type translation failing on Ubuntu 22.04 for some reason.
Are they using different LLVM/clang versions?
e218703
to
3056379
Compare
0da2644
to
de5a710
Compare
The remaining testsuite failures appear to be panics from (expected) arithmetic overflows that get caught in debug mode. EDIT: solved by #1273, will rebase out of this series when it lands. |
by moving child generation into the trait, we avoid explicit reference to the TypedAstContext this is simpler because the only impl of this trait already stores the TypedAstContext itself, and by avoiding such aliasing we enable later implementations of mutable visitors this also allows impls of the trait to choose between all-types and only-exprs node children
this is necessary to normalize AST types between Clang 15 and below, which resolve typedefs in expression types, and Clang 16 and above, which preserve typedefs in these positions because we reify some typedef types such as `size_t`, we always want the latter behavior
this clarifies which variables refer to the same quantity in different functions this commit should have no functional changes
Co-authored-by: Khyber Sen <[email protected]>
// Avoid getting params from an implicit decl if a subsequent non-implicit decl exists. | ||
// Implicit decls will not have names for params, but more importantly, they will never | ||
// reference header-declared typedefs, so we would miss the fact that e.g. malloc is | ||
// declared to accept `size_t` in its stdlib.h declaration, while its implicit declaration | ||
// accepts the built-in `unsigned long`. | ||
if (FD->isImplicit()) { | ||
paramsFD = FD->getMostRecentDecl(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this implicit declaration the untyped declaration that you get when you don't declare a function prototype? Or is clang doing something special for functions like malloc here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter; in the Clang AST we see a declaration node that has types but does not use any of the typedefs that the standard headers might define.
auto targetDependentMacro = [&]() -> std::optional<std::string> { | ||
TypeSourceInfo *TSI = D->getTypeSourceInfo(); | ||
if (!TSI) { | ||
return {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: std::nullopt
is probably a bit clearer here, and for the other return {}
s.
if let kind @ CDeclKind::Function { .. } = &located_kind.kind { | ||
let new_kind = self.typed_context.fn_decl_ty_with_declared_args(kind); | ||
if self.typed_context.type_for_kind(&new_kind).is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would be a bit clearer with early continue
s instead of extra indentation.
// Function declarations' types look through typedefs, but we want to use the types with | ||
// typedefs intact in some cases during translation. To ensure that these types exist in the | ||
// `TypedAstContext`, iterate over all function decls, compute their adjusted type using | ||
// argument types from arg declarations, and then ensure the existence of both this type and | ||
// the type of pointers to it. | ||
for (_decl_id, located_kind) in self.typed_context.c_decls.iter() { | ||
if let kind @ CDeclKind::Function { .. } = &located_kind.kind { | ||
let new_kind = self.typed_context.fn_decl_ty_with_declared_args(kind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, are we referring to both parameters and arguments as the same thing here?
let pointer_ty = CTypeKind::Pointer(CQualTypeId::new(CTypeId(new_id))); | ||
let new_id = self.id_mapper.fresh_id(); | ||
self.add_type(new_id, not_located(pointer_ty)); | ||
self.processed_nodes.insert(new_id, OTHER_TYPE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this doing exactly? Adding a fn ptr type? And how come a fresh new_id
is created here shadowing the original one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll add a comment explaining what/why. The second new_id
is because we're also adding a second type; each type needs its own ID.
let mut typ = self.visit_qualified_type(typ_old); | ||
|
||
// Clang injects definitions of the form `#define SOME_MACRO unsigned int` into | ||
// compilation units based on the target. (See lib/Frontend/InitPreprocessor.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes we could refer to a particular implementation of something in clang, in which case a permalink is more useful. It's useful to know if that's the intent or were purposefully targeting multiple versions at once (I know we always do that, but some things in clang we're not expecting to change, and would need to patch things if they did).
typ = target_dependent_macro | ||
.as_deref() | ||
.and_then(|macro_name| { | ||
let kind = match macro_name { | ||
// Match names in the order Clang defines them. | ||
"__INTMAX_TYPE__" => CTypeKind::IntMax, | ||
"__UINTMAX_TYPE__" => CTypeKind::UIntMax, | ||
"__PTRDIFF_TYPE__" => CTypeKind::PtrDiff, | ||
"__INTPTR_TYPE__" => CTypeKind::IntPtr, | ||
"__SIZE_TYPE__" => CTypeKind::Size, | ||
"__WCHAR_TYPE__" => CTypeKind::WChar, | ||
// __WINT_TYPE__ is defined by Clang but has no obvious translation | ||
// __CHARn_TYPE__ for n ∈ {8, 16, 32} also lack obvious translation | ||
"__UINTPTR_TYPE__" => CTypeKind::UIntPtr, | ||
_ => { | ||
log::debug!("Unknown target-dependent macro {macro_name}!"); | ||
return None; | ||
} | ||
}; | ||
log::trace!("Selected kind {kind} for typedef {name}"); | ||
Some(CQualTypeId::new( | ||
self.typed_context.type_for_kind(&kind).unwrap(), | ||
)) | ||
}) | ||
.unwrap_or(typ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd help readability to pull a bunch of these out into separate functions. This function is 2000 lines long.
/// Try to match the target-dependent macro to a platform-independent type.
///
/// Clang injects definitions of the form `#define __SIZE_TYPE__ unsigned int`
/// into compilation units based on the target.
/// (See `lib/Frontend/InitPreprocessor.cpp` in Clang).
///
/// Later, headers contain defns like: `typedef __SIZE_TYPE__ size_t;`.
///
/// We detect these `typedef`s and alter them by replacing the type to which the macro
/// expands with a synthetic, portable `CTypeKind` chosen from the macro's name.
///
/// This allows us to generate platform-independent Rust types like [`u64`] or [`usize`]
/// despite the C side internally working with a target-specific type.
fn platform_independent_type_of_target_dependent_macro(macro_name: &str) -> Option<CTypeKind> {
let kind = match macro_name {
// Match names in the order Clang defines them.
"__INTMAX_TYPE__" => CTypeKind::IntMax,
"__UINTMAX_TYPE__" => CTypeKind::UIntMax,
"__PTRDIFF_TYPE__" => CTypeKind::PtrDiff,
"__INTPTR_TYPE__" => CTypeKind::IntPtr,
"__SIZE_TYPE__" => CTypeKind::Size,
"__WCHAR_TYPE__" => CTypeKind::WChar,
// __WINT_TYPE__ is defined by Clang but has no obvious translation
// __CHARn_TYPE__ for n ∈ {8, 16, 32} also lack obvious translation
"__UINTPTR_TYPE__" => CTypeKind::UIntPtr,
_ => {
log::debug!("Unknown target-dependent macro {macro_name}!");
return None;
}
};
Some(kind)
}
/// Try to match the `typedef` to a platform-independent type.
///
/// Other platform-independent types are defined by
/// standard headers and their transitive includes without special compiler involvement
/// (see [`platform_independent_type_of_target_dependent_macro`]).
/// In these contexts, we recognize these `typedef`s by their name.
fn platform_independent_type_of_typedef(typedef: &str) -> Option<CTypeKind> {
let kind = match typedef {
"intmax_t" => CTypeKind::IntMax,
"uintmax_t" => CTypeKind::UIntMax,
"intptr_t" => CTypeKind::IntPtr,
"uintptr_t" => CTypeKind::UIntPtr,
// unlike `size_t`, `ssize_t` does not have a clang-provided `#define`.
"ssize_t" => CTypeKind::SSize,
// macOS defines `ssize_t` from `__darwin_ssize_t` in many headers,
// so we should handle `__darwin_ssize_t` itself.
"__darwin_ssize_t" => CTypeKind::SSize,
"__uint8_t" => CTypeKind::UInt8,
"__uint16_t" => CTypeKind::UInt16,
"__uint32_t" => CTypeKind::UInt32,
"__uint64_t" => CTypeKind::UInt64,
"__uint128_t" => CTypeKind::UInt128,
"__int8_t" => CTypeKind::Int8,
"__int16_t" => CTypeKind::Int16,
"__int32_t" => CTypeKind::Int32,
"__int64_t" => CTypeKind::Int64,
"__int128_t" => CTypeKind::Int128,
// macOS stdint.h typedefs [u]intN_t directly:
"int8_t" => CTypeKind::Int8,
"int16_t" => CTypeKind::Int16,
"int32_t" => CTypeKind::Int32,
"int64_t" => CTypeKind::Int64,
"uint8_t" => CTypeKind::UInt8,
"uint16_t" => CTypeKind::UInt16,
"uint32_t" => CTypeKind::UInt32,
"uint64_t" => CTypeKind::UInt64,
_ => {
log::debug!("Unknown platform-independent typedef {typedef}!");
return None;
}
};
Some(kind)
}
/// Check if `src` is a standard header that contains platform-independent `typedef`s.
fn is_platform_independent_type_header(src: &SrcFile) -> bool {
let filename = src
.path
.as_ref()
.and_then(|path| path.file_name())
.and_then(|filename| filename.to_str())
.unwrap_or("");
// `_types.h`, `_int*`, `_uint*` are for macOS.
matches!(filename, "stdint.h" | "types.h" | "_types.h")
|| filename.starts_with("__stddef__")
|| filename.starts_with("_int")
|| filename.starts_with("_uint")
}
/// Try to recognize a platform-independent type given the `typedef` name,
/// its `target_dependent_macro` (see [`platform_independent_type_of_target_dependent_macro`]),
/// and the `src` file it was defined in.
fn platform_independent_type(
typedef: &str,
target_dependent_macro: Option<&str>,
src: &SrcFile,
) -> Option<CTypeKind> {
if let Some(macro_name) = target_dependent_macro {
if let Some(ty) = platform_independent_type_of_target_dependent_macro(macro_name) {
return Some(ty);
}
}
if is_platform_independent_type_header(src) {
if let Some(ty) = platform_independent_type_of_typedef(typedef) {
return Some(ty);
}
}
None
}
and then use it like this:
let target_dependent_macro: Option<String> = from_value(node.extras[2].clone())
.expect("Expected to find optional target-dependent macro name");
let file = self
.typed_context
.files
.get(self.typed_context.file_map[node.loc.fileid as usize])
.expect("missing SrcFile for typedef");
if let Some(kind) =
platform_independent_type(&name, target_dependent_macro.as_deref(), file)
{
log::trace!("Selected kind {kind} for typedef {name}");
typ = CQualTypeId::new(self.typed_context.type_for_kind(&kind).unwrap());
}
Also, I think platform-independent is a more accurate descriptor than fixed-width. Some of these types aren't fixed-width, like size_t
. What we care about is that they are platform-independent but that C doesn't treat them internally as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR isn't the place for general refactors, and the comments here already explain what's happening as well as splitting out separate functions would.
With respect to naming, this handling covers both fixed-width, platform-independent types like uint16_t
and ones that are platform-dependent and not fixed width like ssize_t
. So platform-independent isn't really accurate for describing the types themselves (though none of the types handled by the first block here are fixed-size/platform-independent). We are adding logic to ensure that these types are translated in a platform-independent way, and to names that are not platform-dependent. But I don't know if it's worthwhile to bikeshed names here too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR isn't the place for general refactors
Not sure how this is a general refactor. This is all new code.
and the comments here already explain what's happening as well as splitting out separate functions would.
That's more subjective, and I don't think it does. I think it's better to err on the side of being clearer and the reviewer when this doesn't really cost anything.
With respect to naming, this handling covers both fixed-width, platform-independent types like
uint16_t
and ones that are platform-dependent and not fixed width likessize_t
. So platform-independent isn't really accurate for describing the types themselves (though none of the types handled by the first block here are fixed-size/platform-independent). We are adding logic to ensure that these types are translated in a platform-independent way, and to names that are not platform-dependent. But I don't know if it's worthwhile to bikeshed names here too much.
ssize_t
isn't technically platform-dependent, but it basically is, and is for platforms we support.
I think consistent naming is really critical to being able to understand code. Referring to the same and different things with slightly different names all over the place is very confusing. And fixed-width is definitely inaccurate for many of these types, while platform-independent is largely correct and captures the intent of what we're trying to do: emit code that is platform-independent.
/// Visit nodes in pre-order traversal. Returns true if we should traverse | ||
/// Return the immediate child nodes of a given node. | ||
/// | ||
/// Generally implemented via `immediate_children[_all_types]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Generally implemented via `immediate_children[_all_types]`. | |
/// Generally implemented via `immediate_children` | |
/// or [`immediate_children_all_types`]. |
fn children(&mut self, _id: SomeId) -> Vec<SomeId>; | ||
/// Visits nodes in pre-order traversal. Returns true if we should traverse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave newlines between the methods if they have docs?
fn children(&mut self, _id: SomeId) -> Vec<SomeId>; | |
/// Visits nodes in pre-order traversal. Returns true if we should traverse | |
fn children(&mut self, _id: SomeId) -> Vec<SomeId>; | |
/// Visits nodes in pre-order traversal. Returns true if we should traverse |
pub fn type_for_kind(&self, kind: &CTypeKind) -> Option<CTypeId> { | ||
self.c_types | ||
.iter() | ||
.find_map(|(id, k)| if kind == &k.kind { Some(*id) } else { None }) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a good idea to be doing an O(n)
reverse search for this? Can we just build a reverse c_type_ids: HashMap<CType, CTypeKind>
once? Perhaps at the end of fn ConversionContext::new
or fn ConversionContext::into_typed_context
?
for p in parameters { | ||
match self.index(*p).kind { | ||
CDeclKind::Variable { typ, .. } => param_tys.push(CQualTypeId::new(typ.ctype)), | ||
_ => return None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this ever fail? What does it mean if it does? One call is just .unwrap()
ed and the other doesn't seem to ever be None
in our tests.
If we're not sure yet how to handle this case, it'd probably be better to panic (like the assert!
in fn function_declref_decl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it panics instead, we can also rewrite this more simply with an iterator:
pub fn tys_of_params(&self, parameters: &[CDeclId]) -> Vec<CQualTypeId> {
parameters
.iter()
.map(|p| match &self.index(*p).kind {
CDeclKind::Variable { typ, .. } => CQualTypeId::new(typ.ctype),
_ => panic!(),
})
.collect()
}
|
||
/// Return the id of the most precise possible type for the function referenced by the given | ||
/// expression. | ||
pub fn function_declref_ty_with_declared_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn function_declref_ty_with_declared_args( | |
pub fn fn_declref_ty_with_declared_args( |
For consistency with fn_decl_ty_with_declared_args
.
@@ -486,6 +492,71 @@ impl TypedAstContext { | |||
self.index(resolved_typ_id) | |||
} | |||
|
|||
/// Extract decl of referenced function. | |||
/// Looks for ImplicitCast(FunctionToPointerDecay, DeclRef(function_decl)) | |||
pub fn function_declref_decl(&self, func_expr: CExprId) -> Option<&CDeclKind> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn function_declref_decl(&self, func_expr: CExprId) -> Option<&CDeclKind> { | |
pub fn fn_declref_decl(&self, fn_expr: CExprId) -> Option<&CDeclKind> { |
For consistency with fn_decl_ty_with_declared_args
.
if let Some(func_decl) = self.ast_context.function_declref_decl(expr) { | ||
let kind_with_declared_args = | ||
self.ast_context.fn_decl_ty_with_declared_args(func_decl); | ||
let func_ty = self | ||
.ast_context | ||
.type_for_kind(&kind_with_declared_args) | ||
.unwrap_or_else(|| { | ||
panic!("no type for kind {kind_with_declared_args:?}") | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as fn function_declref_ty_with_declared_args
? Can we just call that here instead?
&self, | ||
func_expr: CExprId, | ||
) -> Option<CQualTypeId> { | ||
if let Some(func_decl @ CDeclKind::Function { .. }) = self.function_declref_decl(func_expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn function_declref_decl
already assert!
s that this is a CDeclKind::Function
. We don't have variant types in Rust, but we shouldn't need the None
branch here, which is just confusing if it's unreachable. Cleanest would be to have a FunctionDecl
type in CDeclKind::Function
instead of the fields directly, but I'm not sure how much churn that would cause elsewhere. Maybe just an assert!
or unreachable!
suffices here.
/// In Clang 15 and below, the Clang AST resolves typedefs in the expression type of unary and | ||
/// binary expressions. For example, a BinaryExpr node adding two `size_t` expressions will be | ||
/// given an `unsigned long` type rather than the `size_t` typedef type. This behavior changed | ||
/// in Clang 16. This method adjusts AST node types to match those produced by Clang 16 and | ||
/// newer; on these later Clang versions, it should have no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need any of this if we just didn't support clang <= 15?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though it wasn't obvious that this would be the case initially. Clang doesn't make any documented promises about behavior here w/r/t typedefs, so it was possible we would need logic like this in all cases. It seems that Clang 16 and later automatically do all the bubbling-up of typedef types we need, but it's possible there are edge cases where even new versions of Clang do not.
@@ -0,0 +1,2 @@ | |||
#include <stdint.h> | |||
uint32_t var; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come there's a separate test for just uint32_t
since it's covered in types.c
already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to unify the Linux and macOS versions? This looks good enough for now, but I'm wondering how difficult do you think it'd be to unify them? For, say uint32_t
, can we make that always typedef
to u32
instead of whatever clang is saying it is (often an OS-dependent intermediate)?
Also, maybe this should be named platform_independent_types.c
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also test off_t
, which caused issues on rav1d
(memorysafety/rav1d#24)?
Also, we should test function declarations like fseeko
, etc, whose signatures will get platform-dependent types. I'm not sure if this fixes that (hopefully it does 🙏).
This substantially improves our fidelity of type translation for common integral types. We now recognize types like
size_t
anduint8_t
and convert them to the appropriate platform-independent Rust types. This is nontrivial because C-the-language does not have fixed-size types and instead defines them via platform-dependent mappings to types with platform-dependent sizes, such asunsigned long
andunsigned char
. We have to preëmpt Clang to introduce a distinction between fixed-size types and "native" C ones when we convert the AST after exporting; to make effective use of this type information in translation we need to pass the expected type of the translated expression downward through conversion of expressions and statements.