agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

jdcla · 2024-04-08T19:02:49Z

Describe the bug
According to the documentation, the headers created by the script are formatted:

ID gene=gene_ID name=NAME seq_id=Chromosome_ID type=cds 5'extra=VALUE

However, when applying this script to extract sequences of CDS features the header id's contain the id of the mRNA feature, rather than that of the selected feature CDS.

e.g.
>transcript:ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds
instead of
>CDS:ENSP00000431562 gene=gene:ENSG00000182378 seq_id=X type=cds

General (please complete the following information):
v1.4
Singularity
Ubuntu Linux

To Reproduce
Simply run the script on any gff3 file containing ID fields in the CDS attribute fields.

E.g., using https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/.
agat_sp_extract_sequences.pl -g Homo_sapiens.GRCh38.110.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o cdss.fa -t cds

Expected behavior
Use the CDS ID in the header rather than the transcript/mRNA ID.

Additional context
Somewhat off-topic, but I was trying to apply this tool on gff3 files containing multiple CDS ID's per mRNA (multicistronic). It seems this is currently not supported.

The text was updated successfully, but these errors were encountered:

Juke34 · 2024-04-08T20:15:15Z

Sounds fair.
I would suggest to keep transcript ÌD because in case of isoform would be difficult to guess from which transcript the CDS comes from:

>CDS:ENSP00000431562 transcript=ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds

For the multicistronic problem, this has never been taken into account... Please open another issue that it can be discussed (At least other user can realize also this AGAT's limitation)

Juke34 · 2024-05-02T19:56:37Z

CDS chunks may share the same identifier, in this case how to differentiate the different extracted CDS chunks?
I guess we should add in the descritption the chunck number or something like that. What do you think about it @jdcla ?

jdcla · 2024-05-15T19:21:49Z

I'm not entirely sure what exactly CDS chunks refers to. Are you referring to chunks as existent on different exons?

Juke34 · 2024-05-16T15:42:49Z

Yes.
A CDS is a single feature that can exist over multiple genomic locations (in case of multi exons genes). So several CDS features (lines in the GFF) can be needed to create the biological CDS feature.

jdcla · 2024-05-16T18:40:30Z

Ok. I'm not familiar enough with annotation formats and conventions to know what the best approach is in case it's important to list what chunks a CDS is constructed from. I was simply thinking that it would make sense to list the identifier used for the CDS in the header of the fasta file if these are present in the gff file.

Juke34 added the enhancement New feature or request label Apr 8, 2024

jdcla mentioned this issue Apr 8, 2024

agat_sp_extract_sequences.pl: support for multicistronic transcripts. #451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

jdcla commented Apr 8, 2024 •

edited

Loading

Juke34 commented Apr 8, 2024

Juke34 commented May 2, 2024

jdcla commented May 15, 2024

Juke34 commented May 16, 2024

jdcla commented May 16, 2024

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

Comments

jdcla commented Apr 8, 2024 • edited Loading

Juke34 commented Apr 8, 2024

Juke34 commented May 2, 2024

jdcla commented May 15, 2024

Juke34 commented May 16, 2024

jdcla commented May 16, 2024

jdcla commented Apr 8, 2024 •

edited

Loading