Skip to content

ConvertToRefFlat fails while running Drop-seq_tools-2.4.0 #113

@geshtin

Description

@geshtin

I have problems creating the refflat from the gtf file. It has been reported more often, I followed all suggestion, and as far as I can see the gtfs's look OK. I have one where it works:

1	gramene	gene	44289	49837	.	+	.	gene_id "Zm00001d027230"; gene_source "gramene"; gene_biotype "protein_coding"; gene_name "Zm00001d027230";
1	gramene	transcript	44289	49837	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	44289	44947	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon1"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	44351	44947	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	start_codon	44351	44353	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	45666	45803	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon2"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	45666	45803	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	45888	46133	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon3"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	45888	46133	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	46229	46342	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon4"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	46229	46342	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	46451	46633	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon5"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	46451	46633	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	47045	47262	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon6"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	47045	47262	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	47650	48111	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon7"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	47650	47992	.	+	1	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	stop_codon	47993	47995	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	48200	49247	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "8"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon8"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	49330	49837	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "9"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon9"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	five_prime_utr	44289	44350	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	three_prime_utr	47996	48111	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	three_prime_utr	48200	49247	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	three_prime_utr	49330	49837	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";

and one, where it fails:
1	araport11	gene	3631	5899	.	+	.	gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1	araport11	transcript	3631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	exon	3631	3913	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1"; transcript_name "AT1G01010.1";
1	araport11	CDS	3760	3913	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	start_codon	3760	3762	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	exon	3996	4276	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2"; transcript_name "AT1G01010.1";
1	araport11	CDS	3996	4276	.	+	2	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	4486	4605	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3"; transcript_name "AT1G01010.1";
1	araport11	CDS	4486	4605	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	4706	5095	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4"; transcript_name "AT1G01010.1";
1	araport11	CDS	4706	5095	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	5174	5326	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5"; transcript_name "AT1G01010.1";
1	araport11	CDS	5174	5326	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	5439	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6"; transcript_name "AT1G01010.1";
1	araport11	CDS	5439	5627	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	stop_codon	5628	5630	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	five_prime_utr	3631	3759	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	three_prime_utr	5631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";

with the following error:

INFO    2020-10-12 16:13:41     ConvertToRefFlat

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    ConvertToRefFlat -ANNOTATIONS_FILE /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf -SEQUENCE_DICTIONARY /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict -OUTPUT /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat
**********


16:13:42.607 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.618 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
16:13:42.628 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.628 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Mon Oct 12 16:13:42 CEST 2020] ConvertToRefFlat ANNOTATIONS_FILE=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf SEQUENCE_DICTIONARY=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict OUTPUT=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Oct 12 16:13:42 CEST 2020] Executing as rvd@VM.local on Linux 4.15.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.4.0(3d2b3d8_1600201514)
INFO    2020-10-12 16:13:43     GTFParser       Seen many non-increasing record positions. Printing Read-names as well.
[Mon Oct 12 16:13:45 CEST 2020] org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2194669568
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
        at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:386)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:88)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
        at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
        at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
        at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:92)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.gatherByGeneName(GeneFromGTFBuilder.java:205)
        at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.<init>(GeneFromGTFBuilder.java:48)
        at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:79)
        at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61)
        at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

Both gf's have the same fields in column 9.
I see: "GTFParser Seen many non-increasing record positions"
what does this exactly mean?
So, what is wrong here? Is the order of the gene_id, transcript_id, gene_name, transcript_name important.
I assume the gene lines do not need a transcript id (btw adding this information to the gene lines did not solve my problems).
For transcript_name I simply used transcript_id
Also the one entry where there is a ";" in the gene name was removed.
And finally, there was a mismatch in the fasta and the gtf wrt the scaffold names (for Mt and Pt) this was also corrected.

Any suggestion is greatly appreciated,
Raymond

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions