Skip to content

A server side application determining bacterial relatedness

License

Notifications You must be signed in to change notification settings

davidhwyllie/findNeighbour

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abstract

FindNeighbour is a server application for investigating bacterial relatedness. Accessible via RESTful webservices, FindNeighbour maintains an in-memory distance matrix on thes server for a sequence collection, which is automatically cached to disc. It supports incremental addition of samples, and, for a given sample, allows queries identifying similar sequences with millisecond response times.

The inputs to the service are strings containing DNA sequence information, typically generated by mapping and basecalling, followed by storage in FASTA or other formats. The service can be queried with strings containing DNA sequence information and a single nucleotide polymorphism threshold; it returns a list of similar samples. The software is designed for, has been extensively tested with, mapped data from bacterial genome sequencing.

Requirements

FindNeighbour comprises several components.

  • Python code, built on web.py, handles API calls. The component doing this is webservice-server.py.
  • A C++ daemon which is called by webservice-server.py. This code is findNeighbour.cpp.

Tools are also provided to launch one or more instances of the webservice.

findNeighbour.cpp can be compiled and run on Linux (using gcc). We have not tested it on Windows, but expect it would work. It uses the C++ Standard Library 14. OpenMP is required for parallelisation.
Server memory requirements are dependent the number of samples stored, the amount of variation between them, and their length. We have tested it with bacterial genomes. Approximate server memory requirements are 2GB for 400 samples, and 20GB for 4,000 of mapped M. tuberculosis data (4.4 million bases/genome).

Performance

Hosted on a Ubuntu Linux server with 128G of RAM, using 16 threads, the server takes about 1 second to add a sample to a collection of 1,000 M. tuberculosis samples. Addition time scales linearly with the number of samples in the sequence store.

Queries requesting samples similar to a sample return with ~ 50 msec response times.

Setting up the server

0 Prepare the server

First of all you should check if the system has gcc compiler and openmp library, the examples are in linux. Compilation has been tested on Windows with DevC++ and Visual Studio 15.

Linux compilation instructions: 1- Check openMP library: echo | cpp -fopenmp -dM | grep -i open 2- Check gcc compiler: gcc --version ## get compiler version 3- install lz with sudo apt-get install zlib1g-dev

Install gcc: sudo apt-get install gcc-4.2

Install openmp: apt-get install libgomp1

http://openmp.org/wp/openmp-compilers/ https://huseyincakir.wordpress.com/2009/11/05/installing-openmp-in-linux-debian/

Install web.py library http://webpy.org/install

1 Compile the application

First, compile the C++ component.

make clean make

or

Internally, the make file does this:

g++ -std=c++11 -fopenmp -O3 findNeighbour.cpp -lz -o findNeighbour

2 Optionally, you can interact directly with the findNeighbour daemon.

We do not recommend that you do this. You can skip this step and go to step 3.

To start the daemon, do one of:

./findNeighbour

./findNeighbour -t 8

./findNeighbour --threads 8 --name /path/to/writable/directory

  • --threads is the variable to determine the number of threads to use when processing samples.
  • --name determines the location where the daemon will store files. It must be writable. By default the value of threads is 8, recovery is 0, and name is the current working directory.

The findNeighbour daemon will now be running. It accepts several commands, including the following:

Tables Possible Responses
INSERT id_sample fasta_sample Err or OK
GETVALUE IDS id_sample threshold Err, or a list containing ids of samples within threshold snps of id_sample: ['id_sample1',..,'id_sampleN']
GETVALUE SNP id_sample threshold Err, or a list containing pairs of samples including id_sample, and their pairwise distances: [['id_sample1',snp],..,['id_sampleN',snpN]]
GETALLVALUES IDS threshold Err, or a list of all ids in the store: 'id_sample1',..,'id_sampleN']
GETALLVALUES SNP threshold Err, or Err, or a list containing all pairs of samples and their pairwise distances: [['id_sample1',snp],..,['id_sampleN',snpN]]
BACKUP Err or OK
EXIT Exits
Examples of use
# insert four sequences into the server
INSERT 1 ACCTGNCCTG
INSERT 2 ACAAGNCTCG
INSERT 3 ACCTGNNNAG
INSERT 4 ANANTNNNGG

# get pairs of samples, which include id 1, and have pairwise distance with id 1 <= 10 SNP
GETVALUE SNP 1 10

# get ids of samples, and have pairwise distance with id 1 <= 10 SNP
GETVALUE IDS 1 10

# get all pairs of samples with SNP distance <= 10
GETALLVALUES SNP 10

# get all the ids which have neighbours with SNP distances <=10
GETALLVALUES IDS 10

# save the contents
BACKUP

# exit
EXIT

3 Start the findNeighbour web service

Server:

python webservice_server.py ip port path_to_store_files

Example:

python webservice-server.py localhost 8185 R00000039

On the client:

python webservice-client.py # this will run some queries against the server

Client

# example use of the FindNeighbour web server.
# these commands are found in webservice-client.py
import xmlrpclib

client=xmlrpclib.ServerProxy("http://localhost:8185")  # or wherever your server is running

# insert four sequences, each comprising 10 nucleotides
print client.insert('1','ACCTGNCCTG')
print client.insert('2','ACAAGNCTCG')
print client.insert('3','ACCTGNNNAG')
print client.insert('4','ANANTNNNGG')

# query the server service
print client.query_get_value_ids('1','5')
print client.query_get_value_snp('1','5')
print client.query_get_values_ids('5')
print client.query_get_values_snp('5')

# force save all results
print client.save()

# stop the service
# print client.exit()

This completes the process for launching a single server. Various scripts are provided which provide examples of how to programmatically launch multiple services, for the purposes to demonstrating the sharding functionality we describe in the paper.

For example: push_samples : recovers fasta files, loads them into a findNeighbour instance; create_fn_branches.py: makes multiple instances of servers webservice-populate-branches.py : load samples into various branches, depending on their classification (which is computed by external scripts).

About

A server side application determining bacterial relatedness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published