Reading CSV > Avoid redefining column types #1751

sami-eljabali · 2026-03-16T15:08:11Z

sami-eljabali
Mar 16, 2026

Hi team

Thanks for the library! Big fan 😃

We're facing a pain point in reading CSV's, with its need to explicitly define column data types more than once. Otherwise risk them being parsed incorrectly.

Take for instance the following:

@DataSchema
data class BusinessEmployee(
    @ColumnName("ID")
    val id: String,

    @ColumnName("FIRST_NAME")
    val firstName: String,

    @ColumnName("LAST_NAME")
    val lastName: String,
}

DataFrame
    .readCsv(
        inputStream = ByteArrayInputStream(data),
        colTypes = mapOf("ID" to ColType.String)
    )
    .cast<BusinessEmployee>()
    .toList()

Notice how we twice define ID as type String.

We came across the need for this as when CSVs contains ID values appearing to be integers, ie 0123, when parsed, are parsed as integers thereby omitted leading 0's.

Is there a way to avoid redefining the colTypes? Would like columns and their types be defined once in the DataSchema file. This would avoid duplication of code, and the possibility of devs forgetting to add secondary definitions.

Thanks again!

koperagen · 2026-03-16T15:28:30Z

koperagen
Mar 16, 2026
Maintainer

Hi! Thanks :)
I can recommend to define own readCsv shortcut that accepts DataSchema type as reified T and creates colTypes from class properties

4 replies

koperagen Mar 16, 2026
Maintainer

Need to think about it, maybe such function should exist in the library.
Most of the time readers just take values from underlying parser structures with proper types (json, arrow, excel, jdbc). There's no place for DataSchema to influence reading, except for converting values that were read. So essentially combining read+convertTo as a shortcut. Not so useful in general
But readCsv seems unique, DataSchema can affect reading to what type it tries to parse values. Interesting pespective

koperagen Mar 16, 2026
Maintainer

Roughly something like this should do it

val KProperty<*>.columnName: String
    get() = findAnnotation<ColumnName>()?.name ?: name
    
    inline fun <reified T> DataFrame.Companion.readCsv(): DataFrame<*> { 
      val colTypes = T::class.members.filterIsInstance<KProperty<*>>().mapNotNull { property -> property.returnType.toColType()?.let { property.columnName to it }  }.toMap()
      return DataFrame.readCsv(..., colTypes)
    }
    
fun KType.toColType(): ColType? =
    when (this.classifier) {
        Int::class -> ColType.Int
        Long::class -> ColType.Long
        Double::class -> ColType.Double
        Boolean::class -> ColType.Boolean
        BigDecimal::class -> ColType.BigDecimal
        BigInteger::class -> ColType.BigInteger
        LocalDate::class -> ColType.LocalDate
        LocalTime::class -> ColType.LocalTime
        LocalDateTime::class -> ColType.LocalDateTime
        String::class -> ColType.String
        DeprecatedInstant::class -> ColType.DeprecatedInstant
        StdlibInstant::class -> ColType.StdlibInstant
        Duration::class -> ColType.Duration
        URL::class -> ColType.Url
        DataFrame::class -> ColType.JsonArray
        DataRow::class -> ColType.JsonObject
        Char::class -> ColType.Char
        else -> null
    }

sami-eljabali Mar 16, 2026
Author

Most of the time readers just take values from underlying parser structures with proper types (json, arrow, excel, jdbc)

Yes, exactly my thinking.

Roughly something like this should do it

Thank you @koperagen! I'll look more closely into this. A fuller example would help.

That said, is it possible to add a new method that can work along side readCsv(), likereadCsvAsDataSchema()?
Or have readCsv() take in DataSchema as an argument?

Believe it'd help everyone per your point above.

Thank you agin for the quick turn around!

Jolanrensen Mar 16, 2026
Maintainer

I can give you a slight improvement over the example of @koperagen, using our already built-in schema reading functions:

public fun KType.toColType(): ColType =
    when (this.withNullability(false)) {
        typeOf<Int>() -> ColType.Int
        typeOf<Long>() -> ColType.Long
        typeOf<Double>() -> ColType.Double
        typeOf<Boolean>() -> ColType.Boolean
        typeOf<java.math.BigDecimal>() -> ColType.BigDecimal
        typeOf<java.math.BigInteger>() -> ColType.BigInteger
        typeOf<kotlinx.datetime.LocalDate>() -> ColType.LocalDate
        typeOf<kotlinx.datetime.LocalTime>() -> ColType.LocalTime
        typeOf<kotlinx.datetime.LocalDateTime>() -> ColType.LocalDateTime
        typeOf<String>() -> ColType.String
        typeOf<kotlinx.datetime.Instant>() -> ColType.DeprecatedInstant
        typeOf<kotlin.time.Instant>() -> ColType.StdlibInstant
        typeOf<kotlin.time.Duration>() -> ColType.Duration
        typeOf<java.net.URL>() -> ColType.Url
        typeOf<DataFrame<*>>() -> ColType.JsonArray
        typeOf<DataRow<*>>() -> ColType.JsonObject
        typeOf<Char>() -> ColType.Char
        else -> throw IllegalArgumentException("Unknown KType: $this")
    }

public inline fun <reified T> DataFrame.Companion.readCsvAs(inputStream: InputStream): DataFrame<*> {
    val schema = DataFrame.emptyOf<T>().schema()
    val colTypes = schema.columns.mapValues { (_, colSchema) ->
        colSchema.type.toColType()
    }
    return DataFrame.readCsv(inputStream, colTypes = colTypes)
}

in seemed to correctly pick up all provided types when I used it like this:

@DataSchema
data class BusinessEmployee(
    @ColumnName("ID")
    val id: String,

    @ColumnName("FIRST_NAME")
    val firstName: String,

    @ColumnName("LAST_NAME")
    val lastName: String,
)

DataFrame.readCsvAs<BusinessEmployee>(StringBufferInputStream(
    """
    ID,FIRST_NAME,LAST_NAME
    1,John,Smith
    2,Jane,Doe
    """.trimIndent()
))

Seems like a very good addition to the library in time :)

Jolanrensen · 2026-03-16T20:38:25Z

Jolanrensen
Mar 16, 2026
Maintainer

Another approach could be to "disable" automatic CSV type parsing for all columns, like by setting each column to String (https://kotlin.github.io/dataframe/read.html#provide-a-default-type-for-all-columns).

DataFrame.readCsv(
    inputStream = ByteArrayInputStream(data),
    colTypes = mapOf(ColType.DEFAULT to ColType.String),
)

and then, cast<BusinessEmployee>() or, a bit more flexible convertTo<BusinessEmployee>() (which can do some basic type conversions as well, like String -> Int). This is the read+convertTo method @koperagen mentions above.

One downside of this approach is indeed that if you have a column of, say type Int, then first the whole column will be read as String and then converted to Int, because it's a two-step thing.

With the solution of @koperagen, this is not the case, as supplying colTypes will directly influence the extremely efficient underlying Deephaven CSV reader. Funnily enough, the same exact idea was mentioned today here #1748.

We might actually have a need for a DataFrame.readCsvAs<Schema>()-family of functions :) I believe the same could also work for DataFrame.readJsonAs<>() (because it also calls our parser)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV > Avoid redefining column types #1751

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reading CSV > Avoid redefining column types #1751

Uh oh!

Uh oh!

sami-eljabali Mar 16, 2026

Replies: 2 comments · 4 replies

Uh oh!

koperagen Mar 16, 2026 Maintainer

Uh oh!

koperagen Mar 16, 2026 Maintainer

Uh oh!

Uh oh!

koperagen Mar 16, 2026 Maintainer

Uh oh!

sami-eljabali Mar 16, 2026 Author

Uh oh!

Jolanrensen Mar 16, 2026 Maintainer

Uh oh!

Jolanrensen Mar 16, 2026 Maintainer

sami-eljabali
Mar 16, 2026

Replies: 2 comments 4 replies

koperagen
Mar 16, 2026
Maintainer

koperagen Mar 16, 2026
Maintainer

koperagen Mar 16, 2026
Maintainer

sami-eljabali Mar 16, 2026
Author

Jolanrensen Mar 16, 2026
Maintainer

Jolanrensen
Mar 16, 2026
Maintainer