TribeMCL
From Biocourse.org
Contents |
Abstract
TribeMCL 은 단백질 서열의 클러스터링을 위한 툴입니다. 즉 연관된 그룹, Protein family로 묶어주는 역활을 합니다.
주어진 Dataset 에서 Protein 간의 Similarity Pattern을 분석하여 유사그룹으로 분류되게 됩니다.
TribeMCL은 Markov Clustering 방법을 사용하며 이는 클러스터링 저해요소를 제거하는 데 유용하게 사용됩니다.
These problems include:
multi-domain proteins, peptide fragments and proteins which possess domains which are very widespread (promiscuous domains).
The efficiency of the method makes it applicable to the clustering of very large datasets. We routinely use the algorithm to cluster datasets as large as 500,000 peptides.
Install tribe-mcl
./configure --prefix=/usr/local/mcl --enable-tribe
make
make install
64bit에서 tribe-matrix 프로그램 실행시 새그멘테이션 에러가 날때는 32bit에서 컴파일된 tribe-matrix를 가지고 와서 덮어 씌우면 실행이 됨
Method
Protein sequence similarities are represented as a graph where nodes represent proteins and edges represent protein sequence similarities detected using a method such as BLAST.
This graph is weighted according to the -log(E-value) detected by BLAST for each sequence similarity.
The graph is then transformed into a markov matrix which represents a transition probability between all nodes of the graph based on connectivities and weights.
This matrix is passed through iterative rounds of expansion (raising the matrix to a power) and inflation (rescaling transition probabilities after expansion).
Matrix expansion corresponds to computing the probabilities of random walks of higher lengths through the graph, while inflation promotes and demotes the probabilities of paths in the graph, allowing convergence. Iterative rounds of expansion and inflation are carried out on the Markov matrix until no change can be detected. The Markov matrix at this point is then interpreted as a clustering.
This clustering is used to infer protein family relationships from the initial input set.
For more details please refer to the citation above or to learn more about the MCL algorithm itself refer to this URL: http://micans.org/mcl/index.html
Input
A parsed set of sequence similarities from BLAST.
A simple Perl script is provided to parse raw NCBI BLAST 2.0 output and produce the corresponding input file.
For protein sequence clustering, one needs to take a set of proteins to be clustered, and use BLAST to detect all similarity relationships between proteins in the set.
The results from this analysis are fed into TRIBE-MCL and a clustering results is obtained.
The example below shows the input format for TRIBE-MCL.
Each line shows a similarity between two proteins, and a BLAST E-Value for that similarity.
The first line, for example, shows that protein HINF-KW2-000030 is similar to protein ECOL-RIM-000672 from our initial input set with a BLAST E-value of 1x10-133.
The example below shows the input format for TRIBE-MCL. Each line shows a similarity between two proteins, and a BLAST E-Value for that similarity.
The first line, for example, shows that protein HINF-KW2-000030 is similar to protein ECOL-RIM-000672 from our initial input set with a BLAST E-value of 1x10-133.
Parsed Input Format
HINF-KW2-000030 ECOL-RIM-000672 1 133
HINF-KW2-000030 ECOL-EDL-000668 1 133
HINF-KW2-000030 ECOL-MG1-000624 1 133
HINF-KW2-000030 PAER-PAO-004002 1 111
HINF-KW2-000030 XFAS-9A5-001309 6 88
HINF-KW2-000030 CCRE-XXX-001538 2 81
HINF-KW2-000030 RPRO-MAD-000268 1 69
HINF-KW2-000030 AAEO-VF5-000017 1 61
HINF-KW2-000030 BHAL-C12-002566 1 53
HINF-KW2-000030 CJEJ-NCT-001206 1 51
/ / / /
| | | |
Protein 1 | | |
Protein 2 | |
| |
Number |
Exponent (eg 1x10-51)
Output
The output file shows each protein from the initial input set together with the cluster it has been detected in.
Typically proteins in the same protein family are in the same cluster in this file. The output format can be described as follows:
Cluster No.
|
| Protein ID
| |
\ \
8926 HINF-KW2-000019
8926 HPYL-J99-000130
8926 NMEN-MC5-000452
8926 NMEN-Z24-001861
8926 P39414
8926 P75763
8926 P77405
8926 Q07252
8926 Q41364
8926 Q57048
8926 SAUR-MU5-000690
8926 SAUR-MU5-002694
8926 SAUR-MU5-002695
8926 SAUR-N13-000644
8926 SAUR-N13-002484
8927 BSUB-168-001848
8927 P80241
8928 BSUB-168-001849
8929 BSUB-168-001850
8930 BSUB-168-001856
8931 BHAL-C12-002749
8931 BSUB-168-001857
Parameter Settings
The main parameter setting is part of the core MCL algorithm, and influences the granularity (or size) of the output clusters.
For very small or 'tight' protein families an inflation value setting of 4.0 or 5.0 is fine.
For larger (broader) protein families settings of 1.1 2.0 and 3.0 can be used.
This parameter will be explained further below and more accurately at: http://micans.org/mcl/index.html
사용방법 예시
설치후 실행파일의 위치는 기본적으로 /usr/local/mcl 입니다. Tribe-MCL 을 사용하기 위해서는 아래의 간단한 프로토콜을 따라 하시면 됩니다.
1) 우선 클러스터링하기위한 dataset을 준비해야 됩니다. Blast와 같은 서열유사성 도구를 사용해서 All-Against-All로 실행합니다.
blastall -p blastp -d database -i query.file -o blast-output.file -e 0.01
>> 적당한 evalue값 지정
2) 1)의 결과를 다른 포맷으로 변화시키는 과정입니다. (standard TRIBE-MCL format)
Blast를 사용했을 경우 다음의 명령어를 사용합니다.
tribe-parse blast.results > blast.mclparsed
3) Markob matrix를 생성해 주는 단계입니다. 2)의 결과 파일을 사용하며 다음의 명령어를 사용합니다.
tribe-matrix results.mclparsed
수행결과, Markob-matrix 파일인 matrix.mcl 을 생성하고 protein.index 라는 인덱스 파일을 생성합니다.
markov 옵션
-help -> Show some help
-ind somefile -> output the index to 'somefile' instead of 'proteins.index'
-out somefile -> output the markov matrix to 'somefile' not 'matrix.mci'
-chunk X -> Set the memory allocation chunksize (default 20MB)
This should be increased for very large jobs.
4) core MCL algorithm 수행 . 위쪽 Parameter Setting 부분 참고(중요)
mcl matrix.mci -options....
mcl 관련옵션
-I X -> Set the inflation value to X
(must be a real number greater than 1.0)
-progress 100 -> Show a progress bar with a dot for every 1%
complete for every iteration
-o file.out -> Place results in the file called 'file.out'
-help -> Show all options and some help
5) 4)의 결과 파일과 3)의 인덱스 파일을 이용하여 최종결과파일인 Clusters File을 생성합니다.
tribe-families file.out proteins.index > clusters



