Skip to content

Module: Alignments

Marie Hoffmann edited this page Jun 27, 2018 · 40 revisions

Tasks

  • Setup the repo structure
  • Implement the aligned_sequence_concept - use SeqAn2 Gaps as a baseline and discuss needed interfaces.
  • Implement the aligned_sequence_linear_access -> former seqan2::ArrayGaps
  • Implement the aligned_sequence_logarithmic_access -> former seqan2::AnchorGaps (Marie)
  • Implement the aligned_sequence_constant_access -> requires sdsl [optional] (Marie)
  • Implement the alignment data structure (JΓΆrg)
  • Implement the static/dynamic configuration model
  • Implement the new public align interface

Module Structure

alignment/detail/...                 // The core implementation of the alignment code.

alignment/representation.hpp         // Required only if output format is not score only.
alignment/representation/...         // Data structures and utility functions to represent and work with alignments.

alignment/pairwise.hpp
alignment/pairwise/...             // Defining the public API for calling the standard pairwise alignment algorithms.

alignment/msa.hpp
alignment/msa/...                  // Defining the public API for calling the msa algorithms.

alignment/seed_extend.hpp          // special alignment implementation for seed extension.
alignment/seed_extend/...          

alignment/split.hpp                // special alignment implementation for split breakpoint calculation.
alignment/split/...

alignment/banded_chain.hpp
alignment/banded_chain/...         // special alignment implementation for banded_chain alignment.

Alignment Representation

In SeqAn 2.x we have several possibilities to represent an alignment. In the following we introduce a new concept called aligned_sequence to replace the gaps data structures. We will use then an alignment object to replace the Align-object which had many shortcomings.

Gap Sequence

template <typename t>
concept bool aligned_sequence_concept(t g)
{
   requires sequence_concept<t>;

   typename t::underlying_type;

   // Element access
   { g[] const; } -> typename t::value_type;

   // Manipulation
   { g.insert_gap(0, 4) }       -> bool;
   { g.insert_gap(0, 4, hint) } -> bool;
   { g.remove_gap(0, 2) }       -> bool;
   { g.remove_gap(0, 2, hint) } -> bool;

   { g.map_to_aligned_position(0)     } -> typename t::size_type;
   { g.map_to_underlying_position(0)  } -> typename t::size_type;
};

Open questions:

  • although it implements the view concept, it is not a lightweight object, maybe an action_concept<>???

from the range-proposal:

By specifying that Ranges do not own their elements, and further specifying that range adaptors operate on and produce Ranges (not Containers), we are able to answer these questions in a clear and consistent way. The result of a chain of range adaptors is always a lightweight object that is cheap to copy and assign (O(1) as opposed to O(N)), and that refers to elements whose lifetime is managed by some other object. Mutating an element through the resulting Range object mutates the underlying sequence. Copies of the resulting range are aliases to the same elements, and mutations to the elements are visible through all the aliased ranges.

  • is the projection of the positions a find or map operation?

aligned sequence implementations:

Array Gaps Uses simple interleaved blocks structure, where the first value gives the number of gaps and the second the number of sequence chars that follow. Requires linear scan of the block structure for random access.

template <typename underlying_t> 
  requires container_concept<underlying_t>
class aligned_sequence_adaptor_linear_access
{
public:
protected:
private:   
};

Anchor Gaps Using sorted sequence anchors, which allow binary search for random access.

template <typename host_t>
class aligned_sequence_adaptor_logarithmic_access
{
public:
protected:
private:
};

Bitvector Gaps

  • using rank-support query data structure to lookup a position in view space.
    • sdsl::rank_support_v5? (+0.0625n bits)
    • sdsl::rank_support_il? (+128 bits)
    • all compressed sdsl bit vector data structures have to be rebuilt in case of modification, i.e. insertion/deletion of gaps!
    • therefore, constant runtime only when reading
template <typename host_t>
class aligned_sequence_adaptor_constant_access
{
public:
protected:
private:
};

Alignment View

template <typename first_t, typename ...remaining_ts>
    requires aligned_sequence_concept<first_t> &&
             (aligned_sequence_concept<remaining_ts> && ...)
class alignment : public std::vector<std::conditional_t<sizeof...(remaining_ts) == 0, 
                                                        first_t,
                                                        std::variant<first_t, remaining_ts...>>
{
    // TODO: Define the interface.
};

What else do we want to do with the alignment view? Utility functions:

  • merge two alignments.
  • print/stream/serialize the alignment.
  • compute statistics.
  • ...

Alignment Algorithms

We will offer in general 4 public pairwise sequence alignment interfaces. The 4 apis are split into, static configuration or dynamic configuration, and an align(...) and an align_and_then(...) interface. The former is used to wait on the result, the latter can be used in asynchronous function calls, where a callback function is called at the end of the alignment.

Compile-time configuration

We use a config object which must fulfil the align_config_static_concept. This defines a set of static constexpr parameters which are evaluated at compile time.

template <typename exec_policy_t, typename seq1_t, typename seq2_t, 
          typename align_config_t,
          typename scoring_scheme_t>
   requires is_execution_policy<exec_policy_t>::value &&
            container_concept<seq1_t> &&
            container_concept<seq2_t> &&
            align_config_static_concept<align_config_static_t>
inline auto 
align(exec_policy_t const & exec_policy,
      seq1_t const & seq1,
      seq2_t const & seq2,
      align_config_t const & config,
      align_parameters<scoring_scheme_t> const & params)
{
   // delegate to corresponding functions.
   // depending on configuration we can return several different things, like the alignment and score or just the score or the trace matrix or what ever we want.
   return {score, std::optional<output_format_t>{}};
}

// Continuation with callback function.
template <typename exec_policy_t, typename seq1_t, typename seq2_t, 
          typename align_config_t,
          typename scoring_scheme_t,
          typename callback_f>
   requires is_execution_policy<exec_policy_t>::value &&
            container_concept<seq1_t> &&
            container_concept<seq2_t> &&
            align_config_static_concept<align_config_t>
inline void
align_and_then(exec_policy_t const & exec_policy,
               seq1_t const & seq1,
               seq2_t const & seq2,
               align_config_t const & config,
               align_parameters<scoring_scheme_t> const & params)
               callback_f && callback)
{
   // delegate to corresponding functions.
   // depending on configuration we can callback several different things, like the alignment and score or just the score or the trace matrix or what ever we want.
   // calls callback(score, std::optional<output_format_t>{});
}

Runtime configuration

The runtime configuration creates a chain of continuators to create a static config object. The config type must fulfil the align_config_dynamic_concept. Since we cannot choose the output format at runtime to select a different return type, we cannot specify the output format in the dynamic config object.

template <typename exec_policy_t, 
          typename seq1_t, 
          typename seq2_t, 
          typename align_config_t,
          typename scoring_scheme_t>
   requires is_execution_policy<exec_policy_t>::value &&
            container_concept<seq1_t> &&
            container_concept<seq2_t> &&   
            align_config_dynamic_concept<align_config_t>
inline auto 
align(exec_policy_t const & exec_policy, seq1_t const & seq1, seq2_t const & seq2,
      align_config_t const & config,
      align_parameters<scoring_scheme_t> const & params)
{
   // delegate to corresponding functions.
   // depending on configuration we can return several different things, like the alignment and score or just the score or the trace matrix or what ever we want.
   return {score, std::optional<aligned_sequence_view_constant_access>{}};
}

template <typename exec_policy_t, typename seq1_t, typename seq2_t, 
          typename align_config_t,
          typename scoring_scheme_t,
          typename callback_f>
   requires is_execution_policy<exec_policy_t>::value &&
            container_concept<seq1_t> &&
            container_concept<seq2_t> &&
            align_config_dynamic_concept<align_config_t>
inline void
align_and_then(exec_policy_t const & exec_policy,
               seq1_t const & seq1,
               seq2_t const & seq2,
               align_config_t const & config,
               align_parameters<scoring_scheme_t> const & params)
               callback_f && callback)
{
   // delegate to corresponding functions.
   // depending on configuration we can callback several different things, like the alignment and score or just the score or the trace matrix or what ever we want.
   // calls callback(score, std::optional<aligned_sequence_view_constant_access>{});
}

The following gist shows an implementation of the configs and the parameter settings.

https://gist.github.com/rrahn/aec7f42dfdb5c0dfc574b3a060628d9b

https://gist.github.com/marehr/2b638e36e4526e0c195269025768fe06

Clone this wiki locally