From dbe4eb882bdb9914f8311c3991a474ffef3411bd Mon Sep 17 00:00:00 2001 From: Charles Plessy Date: Tue, 21 May 2024 11:46:01 +0900 Subject: [PATCH] Ran `pre-commit run --all-files` by hand. --- README.md | 183 +++++++++++++++++++++---------------------- docs/output.md | 2 - modules.json | 50 +++--------- nextflow_schema.json | 27 ++----- 4 files changed, 107 insertions(+), 155 deletions(-) diff --git a/README.md b/README.md index ddf6f64..50d84ca 100644 --- a/README.md +++ b/README.md @@ -19,124 +19,120 @@ ## Introduction -**nf-core/pairgenomealign** is a bioinformatics pipeline that aligns a single or set of query genomes in csv format with a target genome to make a pairwise representation in dotplots. +**nf-core/pairgenomealign** is a bioinformatics pipeline that aligns a single or set of query genomes in csv format with a target genome to make a pairwise representation in dotplots. -This pipeline usually takes in as an input a sample sheet in csv format which contain this set of queries or single query and align it pairwise with atarget genome in fasta or fa.gz format to make a dotplots representation of the paired alignment or alignments in case of multiple queries. +This pipeline usually takes in as an input a sample sheet in csv format which contain this set of queries or single query and align it pairwise with atarget genome in fasta or fa.gz format to make a dotplots representation of the paired alignment or alignments in case of multiple queries. ## Outputs -For each _query_ genome, this pipeline will align it to the _target_genome, post-process the alignments and produce dot plots visualisations at different steps of the workflow. Each file contains a name suffix that indicates in which order they were created. +For each _query_ genome, this pipeline will align it to the _target_ genome, post-process the alignments and produce dot plots visualisations at different steps of the workflow. Each file contains a name suffix that indicates in which order they were created. - - `.train` is the alignment parameters computed by `last-train` (optional) - - `m2m_aln` is the _**many-to-many**_ alignment between _target_ and _query_ genomes. (optional through the `--m2m` option) - - `m2m_plot` (optional) - - `m2o_aln` is the _**many-to-one**_ alignment regions of the _target_ genome are matched at most once by the _query_ genome. - - `m2o_plot` (optional) - - `o2o_aln` is the _**one-to-one**_ alignment between the _target_ and _query_ genomes. - - `o2o_plot` (optional) - - `o2m_aln` is the _**one-to-many**_ alignment between the _target_ and _query_ genomes (optional). - - `o2m_plot` (optional) +- `.train` is the alignment parameters computed by `last-train` (optional) +- `m2m_aln` is the _**many-to-many**_ alignment between _target_ and _query_ genomes. (optional through the `--m2m` option) +- `m2m_plot` (optional) +- `m2o_aln` is the _**many-to-one**_ alignment regions of the _target_ genome are matched at most once by the _query_ genome. +- `m2o_plot` (optional) +- `o2o_aln` is the _**one-to-one**_ alignment between the _target_ and _query_ genomes. +- `o2o_plot` (optional) +- `o2m_aln` is the _**one-to-many**_ alignment between the _target_ and _query_ genomes (optional). +- `o2m_plot` (optional) ## Mandatory parameters - * `--target`: path or URL to one genome file in FASTA format. It will be indexed. +- `--target`: path or URL to one genome file in FASTA format. It will be indexed. - * `--input`: path to a sample sheet in comma-separated format with one header line`sample, fasta`, and one row per genome (ID and path or URL to FASTA file). - - — or — - - `--query`: path or URL to one genome file in FASTA format. +- `--input`: path to a sample sheet in comma-separated format with one header line`sample, fasta`, and one row per genome (ID and path or URL to FASTA file). + — or — + `--query`: path or URL to one genome file in FASTA format. ## Options - * `--seed` selects the name of the [LAST seed][] The default (`YASS`) searches for “_long-and-weak similarities_” that “_allow for mismatches but not gaps_”. Among alternatives, there are `NEAR` for “_short-and-strong (near-identical) similarities_ … _with many gaps (insertions and deletions)_”, `MAM8` to find _“weak - similarities with high sensitivity, but low speed and high memory usage”_ - or `RY128` that “_reduces run time and memory use, by only seeking seeds at - ~1/128 of positions in each sequence_”, which is useful when the purpose of - running this pipeline is only to generate whole-genome dotplots, or when - sensitivity for tiny fragments may be unnecessary or undesirable. Setting - the seed to `PSEUDO` triggers protein-to-DNA alignment mode (experimental). - - * `--lastal_args` defaults to `-C2` and is applied to both - the calls to `last-train` and `lastal`, like in the [LAST cookbook][] - and the [last-genome-alignments][] tutorial. - - * `--lastal_extr_args` (default: `-D1e9`) is only passed to `lastal` and - can be used for arguments that are not recognised by `last-train`. - - * `--lastal_params`: path to a file containing alignment parameters - computed by [`last-train`][] or a [scoring matrix][]. If this option - is not used, the pipeline will run `last-train` for each query. - - * `--m2m`: (default: false) Compute and output the many-to-many alignment. - This adds time and can comsume considerable amount of space; use only - if you need that data. - - * `--o2m`: (default: false) Also compute the _**one-to-many**_ alignments - and dotplots. This is sometimes useful when troubleshooting the - preparation of diploid assemblies. - - * `--one_to_one_only`: do not copy the other alignments to the results - folder, thus saving disk space. - - * By default, `last-split` runs with `-m1e-5` to omit alignments with - mismap probability > 10−5, but this can be overriden with - the `--last_split_mismap` option. - - * `--last_split_args` defaults to empty value and is not very useful at the - moment, but is kept for backwards compatibility. It can be used to pass - options to `last-split`. Note that if you used `--m2m false` (which is - the default), the split parameters have to be passed in - `--lastal_extra_args` and have different names (see _split options_ in the - [lastal documentation][]). - - * The dotplots can be modified by overriding defaults and passing new - arguments via the `--dotplot_options` argument. Defaults and available - options can be seen on the manual page of the [`last-dotplot`][] program. - By default in this pipeline, the sequences of the _query_ genome are - sorted and oriented by their alignment to the _target_ genome - (`--sort2=3 --strands2=1`). For readability, their names are written - horizontally (`--rot2=h`). - - * Use `--skip_dotplot_m2m`, `--skip_dotplot_m2o`, `--skip_dotplot_o2o` - `--skip_dotplot_o2m` to skip the production of the dot plots that can be - computationally expensive and visually uninformative on large genomes with - shared repeats. File suffixes (see above) will not change. - - * By default the LAST index is named `target` and the ouput files are named - from the query IDs. Use the `--targetName` option to provide a name - that will be used for the LAST index and that will be prefixed to the - query IDs with a `___` separator. - - - [`lastal`]: https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst - [`last-dotplot`]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-dotplot.rst - [LAST seed]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-seeds.rst - [LAST cookbook]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst - [`last-train`]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-train.rst - [LAST tuning]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-tuning.rst - [scoring matrix]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-matrices.rst - [lastal documentation]: https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst - [last-genome-alignments]: https://github.com/mcfrith/last-genome-alignments +- `--seed` selects the name of the [LAST seed][] The default (`YASS`) searches for “_long-and-weak similarities_” that “_allow for mismatches but not gaps_”. Among alternatives, there are `NEAR` for “_short-and-strong (near-identical) similarities_ … _with many gaps (insertions and deletions)_”, `MAM8` to find _“weak + similarities with high sensitivity, but low speed and high memory usage”_ + or `RY128` that “_reduces run time and memory use, by only seeking seeds at + ~1/128 of positions in each sequence_”, which is useful when the purpose of + running this pipeline is only to generate whole-genome dotplots, or when + sensitivity for tiny fragments may be unnecessary or undesirable. Setting + the seed to `PSEUDO` triggers protein-to-DNA alignment mode (experimental). + +- `--lastal_args` defaults to `-C2` and is applied to both + the calls to `last-train` and `lastal`, like in the [LAST cookbook][] + and the [last-genome-alignments][] tutorial. + +- `--lastal_extr_args` (default: `-D1e9`) is only passed to `lastal` and + can be used for arguments that are not recognised by `last-train`. + +- `--lastal_params`: path to a file containing alignment parameters + computed by [`last-train`][] or a [scoring matrix][]. If this option + is not used, the pipeline will run `last-train` for each query. + +- `--m2m`: (default: false) Compute and output the many-to-many alignment. + This adds time and can comsume considerable amount of space; use only + if you need that data. + +- `--o2m`: (default: false) Also compute the _**one-to-many**_ alignments + and dotplots. This is sometimes useful when troubleshooting the + preparation of diploid assemblies. + +- `--one_to_one_only`: do not copy the other alignments to the results + folder, thus saving disk space. + +- By default, `last-split` runs with `-m1e-5` to omit alignments with + mismap probability > 10−5, but this can be overriden with + the `--last_split_mismap` option. + +- `--last_split_args` defaults to empty value and is not very useful at the + moment, but is kept for backwards compatibility. It can be used to pass + options to `last-split`. Note that if you used `--m2m false` (which is + the default), the split parameters have to be passed in + `--lastal_extra_args` and have different names (see _split options_ in the + [lastal documentation][]). + +- The dotplots can be modified by overriding defaults and passing new + arguments via the `--dotplot_options` argument. Defaults and available + options can be seen on the manual page of the [`last-dotplot`][] program. + By default in this pipeline, the sequences of the _query_ genome are + sorted and oriented by their alignment to the _target_ genome + (`--sort2=3 --strands2=1`). For readability, their names are written + horizontally (`--rot2=h`). + +- Use `--skip_dotplot_m2m`, `--skip_dotplot_m2o`, `--skip_dotplot_o2o` + `--skip_dotplot_o2m` to skip the production of the dot plots that can be + computationally expensive and visually uninformative on large genomes with + shared repeats. File suffixes (see above) will not change. + +- By default the LAST index is named `target` and the ouput files are named + from the query IDs. Use the `--targetName` option to provide a name + that will be used for the LAST index and that will be prefixed to the + query IDs with a `___` separator. + +[`lastal`]: https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst +[`last-dotplot`]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-dotplot.rst +[LAST seed]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-seeds.rst +[LAST cookbook]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst +[`last-train`]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-train.rst +[LAST tuning]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-tuning.rst +[scoring matrix]: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-matrices.rst +[lastal documentation]: https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst +[last-genome-alignments]: https://github.com/mcfrith/last-genome-alignments ## Fixed arguments (taken from the [LAST cookbook][] and the [LAST tuning][] manual) - * The `lastdb` step soft-masks simple repeats by default, (`-c -R01`).It indexes both strands (`-S2`), which increases speed at the expense of memory usage. +- The `lastdb` step soft-masks simple repeats by default, (`-c -R01`).It indexes both strands (`-S2`), which increases speed at the expense of memory usage. - * The `last-train` commands runs with `--revsym` as the DNA strands play equivalent roles in the studied genomes, unless the `--read_align` option is selected. +- The `last-train` commands runs with `--revsym` as the DNA strands play equivalent roles in the studied genomes, unless the `--read_align` option is selected. - * `last-split` runs with `-fMAF+` to make it show per-base mismap probabilities, except in read alignment mode (see below). +- `last-split` runs with `-fMAF+` to make it show per-base mismap probabilities, except in read alignment mode (see below). ## Usage > [!NOTE] > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data. - First, prepare a samplesheet with your input data that looks as follows: `samplesheet.csv`: @@ -145,12 +141,11 @@ First, prepare a samplesheet with your input data that looks as follows: sample,fasta Query_1,AEG588A1_S1_L002_R1_001.fasta ``` -Each row represents a fasta file, this can also contain multiple rows to accomodate multiple query genomes in fasta format. +Each row represents a fasta file, this can also contain multiple rows to accomodate multiple query genomes in fasta format. Now, you can run the pipeline using: - ```bash nextflow run nf-core/pairgenomealign \ -profile \ @@ -188,7 +183,7 @@ For further information or help, don't hesitate to get in touch on the [Slack `# If you use this pipeline, please cite: -Extreme genome scrambling in marine planktonic Oikopleura dioica cryptic species_. Charles Plessy, Michael J. Mansfield, Aleksandra Bliznina, Aki Masunaga, Charlotte West, Yongkai Tan, Andrew W. Liu, Jan Grašič, María Sara del Río Pisula, Gaspar Sánchez-Serna, Marc Fabrega-Torrus, Alfonso Ferrández-Roldán, Vittoria Roncalli, Pavla Navratilova, Eric M. Thompson, Takeshi Onuma, Hiroki Nishida, Cristian Cañestro, Nicholas M. Luscombe. Genome Res. 2024. 34: 426-440; doi:[10.1101/2023.05.09.539028](https://doi.org/10.1101/gr.278295.123). PubMed ID: [38621828](https://pubmed.ncbi.nlm.nih.gov/38621828/) +Extreme genome scrambling in marine planktonic Oikopleura dioica cryptic species. Charles Plessy, Michael J. Mansfield, Aleksandra Bliznina, Aki Masunaga, Charlotte West, Yongkai Tan, Andrew W. Liu, Jan Grašič, María Sara del Río Pisula, Gaspar Sánchez-Serna, Marc Fabrega-Torrus, Alfonso Ferrández-Roldán, Vittoria Roncalli, Pavla Navratilova, Eric M. Thompson, Takeshi Onuma, Hiroki Nishida, Cristian Cañestro, Nicholas M. Luscombe. Genome Res. 2024. 34: 426-440; doi:[10.1101/2023.05.09.539028](https://doi.org/10.1101/gr.278295.123). PubMed ID: [38621828](https://pubmed.ncbi.nlm.nih.gov/38621828/) [OIST research news article](https://www.oist.jp/news-center/news/2024/4/25/oikopleura-who-species-identity-crisis-genome-community) diff --git a/docs/output.md b/docs/output.md index 531575a..7b3b0ec 100644 --- a/docs/output.md +++ b/docs/output.md @@ -6,7 +6,6 @@ This document describes the output produced by the pipeline. Most of the plots a The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. - ## Pipeline overview The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: @@ -14,7 +13,6 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution - ### MultiQC
diff --git a/modules.json b/modules.json index 8cf7b1a..926c7ad 100644 --- a/modules.json +++ b/modules.json @@ -8,65 +8,47 @@ "assemblyscan": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "gfastats": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "last/dotplot": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "last/lastal": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "last/lastdb": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "last/mafswap": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "last/split": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "last/train": { "branch": "master", "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] }, "multiqc": { "branch": "master", "git_sha": "b7ebe95761cd389603f9cc0e0dc384c0f663815a", - "installed_by": [ - "modules" - ] + "installed_by": ["modules"] } } }, @@ -75,26 +57,20 @@ "utils_nextflow_pipeline": { "branch": "master", "git_sha": "5caf7640a9ef1d18d765d55339be751bb0969dfa", - "installed_by": [ - "subworkflows" - ] + "installed_by": ["subworkflows"] }, "utils_nfcore_pipeline": { "branch": "master", "git_sha": "92de218a329bfc9a9033116eb5f65fd270e72ba3", - "installed_by": [ - "subworkflows" - ] + "installed_by": ["subworkflows"] }, "utils_nfvalidation_plugin": { "branch": "master", "git_sha": "5caf7640a9ef1d18d765d55339be751bb0969dfa", - "installed_by": [ - "subworkflows" - ] + "installed_by": ["subworkflows"] } } } } } -} \ No newline at end of file +} diff --git a/nextflow_schema.json b/nextflow_schema.json index 81d70a1..8620e2e 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -26,13 +26,7 @@ "properties": { "seed": { "type": "string", - "enum": [ - "YASS", - "NEAR", - "MAM8", - "RY128", - "PSEUDO" - ], + "enum": ["YASS", "NEAR", "MAM8", "RY128", "PSEUDO"], "help_text": "--seed selects the name of the LAST seed The default (YASS) searches for \u201clong-and-weak similarities\u201d that \u201callow for mismatches but not gaps\u201d. Among alternatives, there are NEAR for \u201cshort-and-strong (near-identical) similarities \u2026 with many gaps (insertions and deletions)\u201d, MAM8 to find \u201cweak similarities with high sensitivity, but low speed and high memory usage\u201d or RY128 that \u201creduces run time and memory use, by only seeking seeds at ~1/128 of positions in each sequence\u201d, which is useful when the purpose of running this pipeline is only to generate whole-genome dotplots, or when sensitivity for tiny fragments may be unnecessary or undesirable. Setting the seed to PSEUDO triggers protein-to-DNA alignment mode (experimental).", "description": "The default (YASS) searches for \u201clong-and-weak similarities\u201d that \u201callow for mismatches but not gaps\u201d.", "default": "YASS" @@ -75,11 +69,7 @@ "type": "object", "fa_icon": "fas fa-terminal", "description": "Define where the pipeline should find input data and save output data.", - "required": [ - "input", - "target", - "outdir" - ], + "required": ["input", "target", "outdir"], "properties": { "input": { "type": "string", @@ -258,14 +248,7 @@ "description": "Method used to save pipeline results to output directory.", "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", "fa_icon": "fas fa-copy", - "enum": [ - "symlink", - "rellink", - "link", - "copy", - "copyNoFollow", - "move" - ], + "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], "hidden": true }, "email_on_fail": { @@ -376,7 +359,7 @@ }, "last_split_mismap": { "type": "string", - "default": 1e-05, + "default": 1e-5, "description": "By default, last-split runs with -m1e-5 to omit alignments with mismap probability > 10\u22125, but this can be overriden with the --last_split_mismap option." } } @@ -411,4 +394,4 @@ "$ref": "#/definitions/new_group_1" } ] -} \ No newline at end of file +}