Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

Open
jdcla opened this issue Apr 8, 2024 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@jdcla
Copy link

jdcla commented Apr 8, 2024

Describe the bug
According to the documentation, the headers created by the script are formatted:

ID gene=gene_ID name=NAME seq_id=Chromosome_ID type=cds 5'extra=VALUE

However, when applying this script to extract sequences of CDS features the header id's contain the id of the mRNA feature, rather than that of the selected feature CDS.

e.g.
>transcript:ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds
instead of
>CDS:ENSP00000431562 gene=gene:ENSG00000182378 seq_id=X type=cds

General (please complete the following information):
v1.4
Singularity
Ubuntu Linux

To Reproduce
Simply run the script on any gff3 file containing ID fields in the CDS attribute fields.

E.g., using https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/.
agat_sp_extract_sequences.pl -g Homo_sapiens.GRCh38.110.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o cdss.fa -t cds

Expected behavior
Use the CDS ID in the header rather than the transcript/mRNA ID.

Additional context
Somewhat off-topic, but I was trying to apply this tool on gff3 files containing multiple CDS ID's per mRNA (multicistronic). It seems this is currently not supported.

@Juke34 Juke34 added the enhancement New feature or request label Apr 8, 2024
@Juke34
Copy link
Collaborator

Juke34 commented Apr 8, 2024

Sounds fair.
I would suggest to keep transcript ÌD because in case of isoform would be difficult to guess from which transcript the CDS comes from:

>CDS:ENSP00000431562 transcript=ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds

For the multicistronic problem, this has never been taken into account... Please open another issue that it can be discussed (At least other user can realize also this AGAT's limitation)

@Juke34
Copy link
Collaborator

Juke34 commented May 2, 2024

CDS chunks may share the same identifier, in this case how to differentiate the different extracted CDS chunks?
I guess we should add in the descritption the chunck number or something like that. What do you think about it @jdcla ?

@jdcla
Copy link
Author

jdcla commented May 15, 2024

I'm not entirely sure what exactly CDS chunks refers to. Are you referring to chunks as existent on different exons?

@Juke34
Copy link
Collaborator

Juke34 commented May 16, 2024

Yes.
A CDS is a single feature that can exist over multiple genomic locations (in case of multi exons genes). So several CDS features (lines in the GFF) can be needed to create the biological CDS feature.

@jdcla
Copy link
Author

jdcla commented May 16, 2024

Ok. I'm not familiar enough with annotation formats and conventions to know what the best approach is in case it's important to list what chunks a CDS is constructed from. I was simply thinking that it would make sense to list the identifier used for the CDS in the header of the fasta file if these are present in the gff file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants