buildJunctions in mapping pipeline produces invalid file

The buildJunctions function in the mapping pipeline builds a bed4 file of junctions:
contig, start, end, strand

to pass to various mappers. It does this by taking a bundle of gffs from a transcript iterator, sorting them into start order, and then walking along, creating intervals from the end of one entry to the start of the next. 

https://github.com/cgat-developers/cgat-flow/blob/87fcdbc71be79c8964a2af243417becc787ecea2/cgatpipelines/tools/pipeline_mapping.py#L619-L635

Unfortunately this fails to take account of the fact that the gffs list will contain things other than exons (it will include both CDS and transcript entries). In particular the transcript entries will often become the first entry when sorted on starts, but their end is after the end of all other entries. This means you get invalid entries where the start is at a higher coordinate than the end. And the interval doesn't refer to a junction. 

This file is used by mapReadsWithTopHat2 and mapReadsWithHisat. One assumes no one has used TopHat2 for years . However it leads Hisat2 to produce and invalid BAM file. See:

https://github.com/DaehwanKimLab/hisat2/issues/365

PR incoming. 


	for gffs in GTF.transcript_iterator(
	GTF.iterator(iotools.open_file(infile, "r"))):

	gffs.sort(key=lambda x: x.start)
	end = gffs[0].end
	for gff in gffs[1:]:
	# subtract one: these are not open/closed coordinates but
	# the 0-based coordinates
	# of first and last residue that are to be kept (i.e., within the
	# exon).
	outf.write("%s\t%i\t%i\t%s\n" %
	(gff.contig, end - 1, gff.start, gff.strand))
	end = gff.end
	njunctions += 1

	outf.close()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buildJunctions in mapping pipeline produces invalid file #157

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

buildJunctions in mapping pipeline produces invalid file #157

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions