I have problems creating the refflat from the gtf file. It has been reported more often, I followed all suggestion, and as far as I can see the gtfs's look OK. I have one where it works:
1 gramene gene 44289 49837 . + . gene_id "Zm00001d027230"; gene_source "gramene"; gene_biotype "protein_coding"; gene_name "Zm00001d027230";
1 gramene transcript 44289 49837 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 44289 44947 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon1"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 44351 44947 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene start_codon 44351 44353 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 45666 45803 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon2"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 45666 45803 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 45888 46133 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon3"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 45888 46133 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 46229 46342 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon4"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 46229 46342 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 46451 46633 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon5"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 46451 46633 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 47045 47262 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon6"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 47045 47262 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 47650 48111 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon7"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene CDS 47650 47992 . + 1 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene stop_codon 47993 47995 . + 0 gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 48200 49247 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "8"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon8"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene exon 49330 49837 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "9"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon9"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene five_prime_utr 44289 44350 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene three_prime_utr 47996 48111 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene three_prime_utr 48200 49247 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1 gramene three_prime_utr 49330 49837 . + . gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
and one, where it fails:
1 araport11 gene 3631 5899 . + . gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1 araport11 transcript 3631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1 araport11 exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1"; transcript_name "AT1G01010.1";
1 araport11 CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1 araport11 start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1 araport11 exon 3996 4276 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2"; transcript_name "AT1G01010.1";
1 araport11 CDS 3996 4276 . + 2 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1 araport11 exon 4486 4605 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3"; transcript_name "AT1G01010.1";
1 araport11 CDS 4486 4605 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1 araport11 exon 4706 5095 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4"; transcript_name "AT1G01010.1";
1 araport11 CDS 4706 5095 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1 araport11 exon 5174 5326 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5"; transcript_name "AT1G01010.1";
1 araport11 CDS 5174 5326 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1 araport11 exon 5439 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6"; transcript_name "AT1G01010.1";
1 araport11 CDS 5439 5627 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1 araport11 stop_codon 5628 5630 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1 araport11 five_prime_utr 3631 3759 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1 araport11 three_prime_utr 5631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
with the following error:
INFO 2020-10-12 16:13:41 ConvertToRefFlat
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** ConvertToRefFlat -ANNOTATIONS_FILE /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf -SEQUENCE_DICTIONARY /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict -OUTPUT /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat
**********
16:13:42.607 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.618 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
16:13:42.628 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.628 WARN NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Mon Oct 12 16:13:42 CEST 2020] ConvertToRefFlat ANNOTATIONS_FILE=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf SEQUENCE_DICTIONARY=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict OUTPUT=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Oct 12 16:13:42 CEST 2020] Executing as rvd@VM.local on Linux 4.15.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.4.0(3d2b3d8_1600201514)
INFO 2020-10-12 16:13:43 GTFParser Seen many non-increasing record positions. Printing Read-names as well.
[Mon Oct 12 16:13:45 CEST 2020] org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2194669568
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:386)
at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:88)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:92)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.gatherByGeneName(GeneFromGTFBuilder.java:205)
at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.<init>(GeneFromGTFBuilder.java:48)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:79)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61)
at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
Both gf's have the same fields in column 9.
I see: "GTFParser Seen many non-increasing record positions"
what does this exactly mean?
So, what is wrong here? Is the order of the gene_id, transcript_id, gene_name, transcript_name important.
I assume the gene lines do not need a transcript id (btw adding this information to the gene lines did not solve my problems).
For transcript_name I simply used transcript_id
Also the one entry where there is a ";" in the gene name was removed.
And finally, there was a mismatch in the fasta and the gtf wrt the scaffold names (for Mt and Pt) this was also corrected.
Any suggestion is greatly appreciated,
Raymond
I have problems creating the refflat from the gtf file. It has been reported more often, I followed all suggestion, and as far as I can see the gtfs's look OK. I have one where it works:
with the following error:
Both gf's have the same fields in column 9.
I see: "GTFParser Seen many non-increasing record positions"
what does this exactly mean?
So, what is wrong here? Is the order of the gene_id, transcript_id, gene_name, transcript_name important.
I assume the gene lines do not need a transcript id (btw adding this information to the gene lines did not solve my problems).
For transcript_name I simply used transcript_id
Also the one entry where there is a ";" in the gene name was removed.
And finally, there was a mismatch in the fasta and the gtf wrt the scaffold names (for Mt and Pt) this was also corrected.
Any suggestion is greatly appreciated,
Raymond