############
Introduction
############

‘Omics’ technologies yield large datasets, which are commonly subjected to cluster analysis in order to group them into comprehensible communities, i.e. co-regulated groups, which might be functionally related (Si et al., 2011). A critical step in cluster analysis is cluster validation (Handl et al., 2005), the most stringent form of validation being the assessment of exact reproducibility of a cluster in the light of the uncertainty of the data.
This issue is addressed by pyGCluster, an algorithm working in two steps. Firstly it creates many agglomerative hierarchical clusterings (AHCs) of the input data by injecting noise based on the uncertainty of the data and clusters them using different distance linkage combinations (DLCs). Secondly, pyGCluster creates a meta-clustering, i.e. clustering of the resulting, highly reproducible clusters into communities to gain a most complete representation of common patterns in the data. Communities are defined as sets of clusters with a specific pairwise overlap.

******************
Algorithm Workflow
******************

The workflow of pyGCluster can be divided in:

    - iterative steps
        - re-sample the data based on mean and standard deviation
        - clustering of data using different distance linkage combinations (DLCs)
    - meta-clustering of highly reproducible clusters into communities, i.e. sets of clusters with a specific overlap
    - visualize results via node maps, expression maps and expression profiles

========================
Re-sampling & clustering
========================

For each iteration, a new dataset is generated evoking the re-sampling routine. pyGCluster uses by default a noise injection function that generates a new data set by drawing from normal distributions defined by each data point, i.e. object o in condition l is defined by μ\ :sub:`ol`\ ± σ\ :sub:`ol`\.
Clustering is then performed using SciPy or fastcluster routines.

======================
Community construction
======================

Communities are created after the iterations by a meta-clustering of the most frequent clusters, i.e. top X% or top Y number of clusters. Community construction is performed iteratively through an AHC approach with a specifically developed distance metric (see publication) and complete linkage. Complete linkage was chosen because it insures that all clusters or meta-clusters have overlapping objects. The customized distance metric ensures that a) smaller clusters are merged earlier in the hierarchy (closer to the bottom) and b) clusters that have a smaller overlap to each other than the threshold will merge after the root, i.e. never into the same branch. After each iteration, very closely related clusters (in terms of their object content) are merged in the hierarchy forming one branch or community starting from the root. The final node map shows these iterations and where meta clusters are merged into the community. The closest node to the root in the final node map is the last iteration, in which no change to the community composition was detected. Using this approach the number of final clusters or communities to consider and analyze is reduced.

===================================
Node map and expression map example
===================================

The figures show an example of a node map and expression map generated by pyGCluster. The node map illustrates the data set of Höhner et al. (2013). The node shapes indicate whether a cluster was found using Euclidean distance (squares), correlation distance (circles) or both (triangles). The node color indicates the community membership. The strength of pyGCluster is shown in the green community in which Euclidean and correlation distance identified high frequency clusters (see arrow). Both distance metrics were required to identify all clusters. The black triangle in the middle represents the root node. Since the community construction is performed iteratively, the different iter steps are visible in the node map. For each community, the node closest to the root is the last iteration in which no change in the communities with respect to their composition was detected.

.. figure:: images/_Fig_1_revision.*
   :width: 500 px
   :align: center

.. figure:: images/expressionmaps/EXPMRH.*
   :width: 500 px
   :align: center

The example expression map is taken from Höhner et al. (2013)

.. note::
   Some texts where copied from the original publication and are thus hereby marked as citation

*******************
General information
*******************

Copyright 2011-2013 by:

    | D. Jaeger,
    | J. Barth,
    | A. Niehues,
    | C. Fufezan

The latest Documentation was generated on: |today|

===================
Contact information
===================

Please refer to:

    | Dr. Christian Fufezan
    | Institute of Plant Biology and Biotechnology
    | Schlossplatz 8 , R 110.105
    | University of Muenster
    | Germany
    | eMail: christian@fufezan.net
    | Tel: +049 251 83 24861
    |
    | http://www.uni-muenster.de/Biologie.IBBP.AGFufezan


**************
Implementation
**************

pyGCluster requires Python2.7 or higher, is freely available at http://pyGCluster.github.io and published under MIT license.

pyGCluster dependencies are:
    | numpy
    | scipy
    | fastcluster (optionally)
    | rpy2 (optionally)
    | graphviz (optionally)

Fastcluster (Müllner,D. (2013)) offers significant speed increase compared to the same SciPy routines.

********
Download
********

Get the latest version via github
    | https://github.com/pygcluster/pyGCluster

or the latest package at
    | http://pyGCluster.github.com/dist/pyGCluster.tar.bz2
    | http://pyGCluster.github.com/dist/pyGCluster.zip

The complete Documentation can be found as pdf
    | http://pyGCluster.github.com/dist/pyGCluster.pdf


********
Citation
********

Please cite us when using pyGlcuster in your work.

Jaeger, D., Barth, B., Niehues, A. and Fufezan, C. (2013) pyGCluster, a novel hierarchical clustering approach


The original publication can be found here:
    | http://bioinformatics.oxfordjournals.org...
    | http://bioinformatics.oxfordjournals.org...


************
Installation
************

Please execute the following command in the pyGCluster folder::

    sudo python setup.py install


==================
Installation notes
==================

If Windows XP (SP3) is used please make sure to install SciPy version 0.10.0


====================
Functionality check
====================

After installation, please run the script test_pyGCluster.py from the
exampleScripts folder to check if pyGCluster was installed properly.


.. automodule:: test_pyGCluster


**********
References
**********

    | Bréhélin,L. et al. (2008) Using repeated measurements to validate hierarchical gene clusters. Bioinformatics, 24, 682-628.
    | Gansner,E.R. and North,S.C. (2000) An open graph visualization system and its applications to software engineering. Software Pract. Exper., 30, 1203-1233.
    | Handl,J. et al. (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics, 21, 3201-3212.
    | Höhner,R. et al. (2013) The metabolic status drives acclimation of iron deficiency responses in Chlamydomonas reinhardtii as revealed by proteomics based hierar-chical clustering and reverse genetics. Mol. Cell. Proteomics, in press.
    | Müllner,D. (2013) fastcluster: fast hierarchical agglomerative clustering routines for R and Python. J. Stat. Softw., 53, 1-18.
    | Saeed,A.I. et al. (2003) TM4: A free, open-source system for microarray data man-agement and analysis. Biotechniques, 34, 374-378.
    | Shannon,P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13, 2498-2504.
    | Si,J. et al. (2011) Model-based clustering for rna-seq data. Joint statistical meeting, Juli 30 - August 4, Florida.