Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500K Gene Models with Many Short Sequences: Valid AGAT Output or Command Error? #495

Open
Vijithkumar2020 opened this issue Sep 26, 2024 · 1 comment

Comments

@Vijithkumar2020
Copy link

This is regarding a de novo genome of a plant that was assembled lately. I used AGAT's feature extraction tool, to get the gene models predicted by AUGUSTUS. The repeat-masked genome is of size 2.6gb, and the fasta file resulted from AGAT's feature extraction file was ~600Mb, comprising 500K gene models. The following command was used for AGAT's feature extraction. I just like to know if this is the right command that was supposed to be used as my output file contains way too many short sequences.

agat_sp_extract_sequences.pl \
--gff /output_file.gff \
--fasta /media/masked.fasta \
--output /out.fasta \
-t gene --split
@Juke34
Copy link
Collaborator

Juke34 commented Sep 26, 2024

Have you checked the help? https://nbisweden.github.io/AGAT/tools/agat_sp_extract_sequences/#briefly-in-pictures
I guess the --split is useless.
Then if you want to extract everything from the start of the gene to the end of (So it contains UTR+exon+intron) -t gene is correct.
If you want to check what is in your file before to use agat_sp_extract_sequences.pl to be sure you had 500K gene as input in the GFF use agat_sq_stat_basic.pl prior your analyse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants