Skip to content

Applying canu's trio binning function using assembly's pseudohaplotypes #2387

@tododge

Description

@tododge

Hi Canu team,

I’m trying to assign individual long reads to each parental (pseudo-)haplotype without parental data.

Standard approaches based on small variants (e.g., whatshap haplotag) don’t seem to work well for complex or highly unbalanced structural variation. Intuitively, a read-binning strategy using haplotype-specific unique k-mers should perform better across all variant types (SNVs + SVs).

I’m wondering whether splitHaplotype is appropriate for this use case, and if so, whether you have recommendations for parameter tuning or database construction.

So far I've tried running a couple versions of the following code. hap1.fa and hap2.fa are pseudohaplotypes generated by hifiasm for a diploid species.

At this stage, I’m not very concerned about haplotype switch errors in the assemblies.

Reads are PacBio HiFi.

meryl count k=63 hap1.fa output hap1.k63.meryl
meryl count k=63 hap2.fa output hap2.k63.meryl

meryl difference hap1.k63.meryl hap2.k63.meryl output hap1.k63.only.meryl
meryl difference hap2.k63.meryl hap1.k63.meryl output hap2.k63.only.meryl

splitHaplotype -R hifi.fq.gz \
-H hap1.k63.only.meryl 1 canu.k63.only.hap1.fq.gz \
-H hap2.k63.only.meryl 1 canu.k63.only.hap2.fq.gz \
-A canu.k63.only.unk.fq.gz

The output files (hap1 and hap2) both contain reads, but they appear to be approximately 50:50 mixtures of both haplotypes.

Any guidance would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions