pyGCluster is a clustering algorithm focusing on noise injection for subsequent cluster validation. By requesting identical cluster identity, the reproducibility of a large amount of clusters obtained with agglomerative hierarchical clustering (AHC) is assessed. Furthermore, a multitude of different distance-linkage combinations (DLCs) are evaluated. Finally, associations of highly reproducible clusters, called communities, are created. Graphical representation of the results as node maps and expression maps is implemented.
The pyGCluster class
Parameters: |
|
---|
In order to work with the default noise-injection function as well as plot expression maps correctly, the data-dict has to have the following structure.
Example:
>>> data = {
... Identifier1 : {
... condition1 : ( mean11, sd11 ),
... condition2 : ( mean12, sd12 ),
... condition3 : ( mean13, sd13 ),
... },
... Identifier2 : {
... condition2 : ( mean22, sd22 ),
... condition3 : ( mean23, sd23 ),
... condition3 : ( mean13, sd13 ),
... },
... }
>>> import pyGCluster
>>> ClusterClass = pyGCluster.Cluster(data=data, verbosity_level=1, working_directory=...)
Note
If any condition for an identifier in the “nested_data_dict”-dict is missing, this entry is discarded, i.e. not imported into the Cluster Class. This is because pyGCluster does not implement any missing value estimation. One possible solution is to replace missing values by a mean value and a standard deviation that is representative for the complete data range in the given condition.
pyGCluster inherits from the regular Python Dictionary object. Hence, the attributes of pyGCluster can be accessed as Python Dictionary keys.
A selection of the most important attributes / keys are:
>>> # general
>>> ClusterClass[ 'Working directory' ]
... # this is the directory where all pyGCluster results
... # (pickle objects, expression maps, node map, ...) are saved into.
/Users/Shared/moClusterDirectory
>>> # original data ca be accessed via
>>> ClusterClass[ 'Data' ]
... # this collections.OrderedDict contains the data that has been
... # or will be clustered (see also below).
... plenty of data ;)
>>> ClusterClass[ 'Conditions' ]
... # sorted list of all conditions that are defined in the "Data"-dictionary
[ 'condition1', 'condition2', 'condition3' ]
>>> ClusterClass[ 'Identifiers' ]
... # sorted tuple of all identifiers, i.e. ClusterClass[ 'Data' ].keys()
( 'Identifier1', 'Identifier2' , ... 'IdentifierN' )
>>> # re-sampling paramerters
>>> ClusterClass[ 'Iterations' ]
... # the number of datasets that were clustered.
1000000
>>> ClusterClass[ 'Cluster 2 clusterID' ]
... # dictionary with clusters as keys, and their respective row index
... # in the "Cluster count"-matrix (= clusterID) as values.
{ ... }
>>> ClusterClass[ 'Cluster counts' ]
... # numpy.uint32 matrix holding the counts for each
... # distance-linkage combination of the clusters.
>>> ClusterClass[ 'Distance-linkage combinations' ]
... # sorted list containing the distance-linkage combinations
... # that were evaluted in the re-sampling routine.
>>> # Communities
>>> ClusterClass[ 'Communities' ]
... # see function pyGCluster.Cluster.build_nodemap for further information.
>>> # Visualization
>>> ClusterClass[ 'Additional labels' ]
... # dictionary with an identifier of the "Data"-dict as key,
... # and a list of additional information (e.g. annotation, GO terms) as value.
{
'Identifier1' :
['Photosynthesis related' , 'zeroFactor: 12.31' ],
'Identifier2' : [ ... ] ,
...
}
>>> ClusterClass[ 'for IO skip clusters bigger than' ]
... # Default = 100. Since some clusters are really large
... # (with sizes close to the root (the cluster holding all objects)),
... # clusters with more objects than this value
... # are not plotted as expression maps or expression profile plots.
pyGCluster offers the possibility to save the analysis (e.g. after re-sampling) via pyGCluster.Cluster.save() , and continue via pyGCluster.Cluster.load() Initializes pyGCluster.Cluster class
Classically, users start the multiprocessing clustering routine with multiple distance linkage combinations via the pyGCluster.Cluster.do_it_all() function. This function allows to update the pyGCluster class with all user parameters before it calls pyGCluster.Cluster.resample(). The main advantage in calling pyGCluster.Cluster.do_it_all() is that all general plotting functions are called afterwards as well, these are:
If one choses, one can manually update the parameters (setting the key, value pairs in pyGCluster) and then evoke pyGCluster.Cluster.resample() with the appropriate parameters. This useful if certain memory intensive distance-linkage combinations are to be clustered on a specific computer.
Note
Cluster Class can be initilized empty and filled using pyGCluster.Cluster.load()
Construction of communities from a set of most_frequent_cluster. This set is obtained via pyGCluster.Cluster._get_most_frequent_clusters(), to which the first three parameters are passed. These clusters are then subjected to AHC with complete linkage. The distance matrix is calculated via pyGCluster.Cluster.calculate_distance_matrix(). The combination of complete linkage and the distance matrix assures that all clusters in a community exhibit at least the “starting_min_overlap” to each other. From the resulting cluster tree, a “first draft” of communities is obtained. These “first” communities are then themselves considered as clusters, and subjected to AHC again, until the community assignment of clusters remains constant. By this, clusters are inserted into a target community, which initially did not overlap with each cluster inside the target community, but do overlap if the clusters in the target community are combined into a single cluster. By this, the degree of stringency is reduced; the clusters fit into a community in a broader sense. For further information on the community construction, see the publication of pyGCluster.
>>> name = ( cluster, level )
... # internal name of the community.
... # The first element in the tuple ("cluster") contains the indices
... # of the objects that comprise a community.
... # The second element gives the level,
... # or iteration when the community was formed.
>>> self[ 'Communities' ][ name ][ 'children' ]
... # list containing the clusters that build the community.
>>> self[ 'Communities' ][ name ][ '# of nodes merged into community' ]
... # the number of clusters that build the community.
>>> self[ 'Communities' ][ name ][ 'index 2 obCoFreq dict' ]
... # an OrderedDict in which each index is assigned its obCoFreq.
... # Negative indices correspond to "placeholders",
... # which are required for the insertion of black lines into expression maps.
... # Black lines in expression maps seperate the individual clusters
... # that form a community, sorted by when
... # they were inserted into the community.
>>> self[ 'Communities' ][ name ][ 'highest obCoFreq' ]
... # the highest obCoFreq encountered in a community.
>>> self[ 'Communities' ][ name ][ 'cluster ID' ]
... # the ID of the cluster containing the object with the highest obCoFreq.
Of the following parameters, the first three are passed to pyGCluster.Cluster._get_most_frequent_clusters():
Parameters: |
|
---|---|
Return type: | none |
The overlap betweeen a pair of clusters is relative, i.e. defined as the size of the overlap divided by the size of the larger of the two clusters.
The resulting condensed distance matrix in not returned, but rather stored in self[ ‘Nodemap - condensed distance matrix’ ].
Parameters: |
|
---|---|
Return type: | none |
Checks if the re-sampling routine may be terminated, because the number of most frequent clusters remains almost constant. This is done by examining a plot of the amount of most frequent clusters vs. the number of iterations. Convergence is declared once the median normalized slope in a given window of iterations is equal or below “iter_tol”. For further information see Supplementary Material of the corresponding publication.
Return type: | boolean |
---|
Simple check if any value of the data_tuples (i.e. any mean) is below zero. Below zero indicates that the input data was log2 transformed.
Return type: | boolean |
---|
Creates a two-sided PDF file containing the full picture of the convergence plot, as well as a zoom of it. The convergence plot illustrates the development of the amount of most frequent clusters vs. the number of iterations. The dotted line in this plots represents the normalized slope, which is used for internal convergence determination.
If rpy2 cannot be imported, a CSV file is created instead.
Parameters: | filename (string) – the filename of the PDF (or CSV) file. |
---|---|
Return type: | none |
Returns a list of rainbow colors. Colors are expressed as hexcodes of RGB values.
Parameters: | n_colors (int) – number of rainbow colors. |
---|---|
Return type: | list |
Resets all variables holding any result of the re-sampling process. This includes the convergence determination as well as the community structure. Does not delete the data that is intended to be clustered.
Return type: | None |
---|
Evokes all necessary functions which constitute the main functionality of pyGCluster. This is AHC clustering with noise injection and a variety of DLCs, in order to identify highly reproducible clusters, followed by a meta-clustering of highly reproducible clusters into so-called ‘communities’.
The functions that are called are:
For a complete list of possible Distance matrix calculations see: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html or Linkage methods see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
Note
If memory is of concern (e.g. for a large dataset, > 5000 objects), cpus_2_use should be kept low.
Parameters: |
|
---|
Note
If alphabet contains ‘,’, this character is removed from alphabet, because the indices comprising a cluster are saved comma-seperated.
Parameters: |
|
---|
Convergence determination:
Parameters: |
|
---|
Output/Plotting:
Parameters: |
|
---|---|
Return type: | None |
For more information to each parameter, please refer to pyGCluster.Cluster.resample(), and the subsequent functions: pyGCluster.Cluster.build_nodemap(), pyGCluster.Cluster.write_dot(), pyGCluster.Cluster.draw_community_expression_maps(), pyGCluster.Cluster.draw_expression_profiles().
Plots the expression map for each community showing its object composition.
The following parameters are passed to pyGCluster.Cluster.draw_expression_map():
Parameters: |
|
---|---|
Return type: | none |
Draws expression map as SVG
Parameters: |
|
---|---|
Return type: | none |
>>> data = {
... fastaID1 : {
... cond1 : ( mean, sd ) , cond2 : ( mean, sd ), ...
... }
... fastaID2 : {
... cond1 : ( mean, sd ) , cond2 : ( mean, sd ), ...
... }
... }
Plots an expression map for a given cluster. Either the parameter “clusterID” or “cluster” can be defined. This function is useful to plot a user-defined cluster, e.g. knowledge-based cluster (TCA-cluster, Glycolysis-cluster ...). In this case, the parameter “cluster” should be defined.
Parameters: |
|
---|
The following parameters are passed to pyGCluster.Cluster.draw_expression_map():
Parameters: |
|
---|---|
Return type: | none |
Plots the expression map for a given “community cluster”: Any cluster in the community node map is internally represented as a tuple with two elements: “cluster” and “level”. Those objects are stored as keys in self[ ‘Communities’ ], from where they may be extracted and fed into this function.
Parameters: |
|
---|
The following parameters are passed to pyGCluster.Cluster.draw_expression_map():
Parameters: |
|
---|---|
Return type: | none |
Draws an expression profile plot (SVG) for each community, illustrating the main “expression pattern” of a community. Each line in this plot represents an object. The “grey cloud” illustrates the range of the standard deviation of the mean values. The plots are named prefixed by “exProf”, followed by the community name as it is shown in the node map.
Parameters: |
|
---|---|
Return type: | none |
Returns a tuple with (i) the cFreq and (ii) a Collections.DefaultDict containing the DLC:frequency pairs for either an identifier, e.g. “JGI4|Chlre4|123456” or clusterID or cluster. Returns ‘None’ if the identifier is not part of the data set, or clusterID or cluster was not found during iterations.
Example:
>>> cFreq, dlc_freq_dict = cluster.frequencies( identifier = 'JGI4|Chlre4|123456' )
>>> dlc_freq_dict
... defaultdict(<type 'float'>,
... {'average-correlation': 0.0, 'complete-correlation': 0.0,
... 'centroid-euclidean': 0.0015, 'median-euclidean': 0.0064666666666666666,
... 'ward-euclidean': 0.0041333333333333335, 'weighted-correlation': 0.0,
... 'complete-euclidean': 0.0014, 'weighted-euclidean': 0.0066333333333333331,
... 'average-euclidean': 0.0020333333333333332})
Parameters: |
|
---|---|
Return type: | tuple |
Prints some information about the clustering via pyGCluster:
- number of genes/proteins clustered
- number of conditions defined
- number of distance-linkage combinations
- number of iterations performed
as well as some information about the communities, the legend for the shapes of nodes in the node map and the way the functions were called.
Return type: | none |
---|
Fills a pyGCluster.Cluster object with the session saved as “filename”. If “filename” is not a complete path, e.g. “example.pkl” (instead of “/home/user/Desktop/example.pkl”), the directory given by self[ ‘Working directory’ ] is used.
Note
>>> LoadedClustering = pyGCluster.Cluster()
>>> LoadedClustering.load( "/home/user/Desktop/example.pkl" )
Parameters: | filename (string) – may be either a simple file name (“example.pkl”) or a complete path (e.g. “/home/user/Desktop/example.pkl”). |
---|---|
Return type: | none |
Returns the median from a list of numeric values.
Parameters: | _list (list) – |
---|---|
Return type: | int / float |
Plot the frequencies of each cluster as a expression map: which cluster was found by which distance-linkage combination, and with what frequency? The plot’s filename is prefixed by ‘clusterFreqsMap’, followed by the values of the parameters. E.g. ‘clusterFreqsMap_minSize4_top0clusters_top10promille.svg’. Clusters are sorted by size.
Parameters: |
|
---|
Note
if top_X_clusters is set to zero ( 0 ), this filter is switched off (switched off by default).
Return type: | None |
---|
Creates a density plot of mean values for each condition via rpy2.
Return type: | none |
---|
node label = nodeID internally used for self[‘Nodemap’] (not the same as clusterID!).
node border color is white if the node is a close2root-cluster (i.e. larger than self[ ‘for IO skip clusters bigger than’ ] ).
edge label = distance between parent and children.
Parameters: | tree_filename (string) – name of the Graphviz DOT file containing the dendrogram of the AHC of most frequent clusters. Best given with ”.dot”-extension! |
---|---|
Return type: | none |
Routine for the assessment of cluster reproducibility (re-sampling routine). To this, a high number of noise-injected datasets are created, which are subsequently clustered by AHC. Those are created via pyGCluster.function_2_generate_noise_injected_datasets() (default = usage of Gaussian distributions). Each ‘simulated’ dataset is then subjected to AHC x times, where x equals the number of distance-linkage combinations that come from all possible combinations of “distances” and “linkages”. In order to speed up the re-sampling routine, it is distributed to multiple threads, if cpus_2_use > 1.
The re-sampling routine stops once either convergence (see below) is detected or iter_max iterations have been performed. Eventually, only clusters with a maximum frequency of at least min_cluster_freq_2_retain are stored; all others are discarded.
In order to visually inspect convergence, a convergence plot is created. For more information about the convergence estimation, see Supplementary Material of pyGCluster’s publication.
For a complete list of possible Distance matrix calculations see: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html or Linkage methods see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
Note
If memory is of concern (e.g. for a large dataset, > 5000 objects), cpus_2_use should be kept low.
Parameters: |
|
---|
Note
If alphabet contains ‘,’, this character is removed from alphabet, because the indices comprising a cluster are saved comma-seperated.
Parameters: |
|
---|
Convergence determination:
Parameters: |
|
---|---|
Return type: | None |
Saves the current pyGCluster.Cluster object in a Pickle object.
Parameters: | filename (string) – may be either a simple file name (“example.pkl”) or a complete path (e.g. “/home/user/Desktop/example.pkl”). In the former case, the pickle is stored in pyGCluster’s working directory. |
---|---|
Return type: | none |
Writes a Graphviz DOT file representing the cluster composition of communities. Herein, each node represents a cluster. Its name is a combination of the cluster’s ID, followed by the level / iteration it was inserted into the community:
- The node’s size reflects the cluster’s cFreq.
- The node’s shape illustrates by which distance metric the cluster was found (if the shape is a point, this illustrates that this cluster was not among the most_frequent_clusters, but only formed during AHC of clusters).
- The node’s color shows the community membership; except for clusters which are larger than self[ ‘for IO skip clusters bigger than’ ], those are highlighted in grey.
- The node connecting all clusters is the root (the cluster holding all objects), which is highlighted in white.
The DOT file may be rendered with “Graphviz” or further processed with other appropriate programs such as e.g. “Gephi”. If “Graphviz” is available, the DOT file is eventually rendered with “Graphviz“‘s dot-algorithm.
In addition, a expression map for each cluster of the node map is created (via pyGCluster.Cluster.draw_expression_map_for_community_cluster()).
Those are saved in the sub-folder “communityClusters”.
This function also calls pyGCluster.Cluster.write_legend(), which creates a TXT file containing the object composition of all clusters, as well as their frequencies.
Parameters: |
|
---|---|
Return type: | none |
Creates a legend for the community node map as a TXT file. Herein, the object composition of each cluster of the node map as well as its frequencies are recorded. Since this function is internally called by pyGCluster.Cluster.write_dot(), it is typically not necessary to call this function.
Parameters: | filename (string) – name of the legend TXT file, best given with extension ”.txt”. |
---|---|
Return type: | none |
Returns the default alphabet which is used to save clusters in a lesser memory-intense form: instead of saving e.g. a cluster containing identifiers with indices of 1,20,30 as “1,20,30”, the indices are converted to a baseX system -> “1,k,u”.
>>> string.printable.replace( ',', '' )
Return type: | string |
---|
This is the function that is called for each multiprocesses that is evoked internally in pyGCluster during the re-sampling routine. Agglomerative hierarchical clustering is performed for each distance-linkage combination (DLC) on each of iteration datasets. Clusters from each hierarchical tree are extracted, and their counts are saved in a temporary cluster-count matrix. After iterations iterations, clusters are filtered according to min_cluster_freq_2_retain. These clusters, together with their respective counts among all DLCs, are returned. The return value is a list containing tuples with two elements: cluster (string) and counts ( one dimensional np.array )
Parameters: |
|
---|---|
Return type: | list |
Any multiprocesses given by processes are terminated.
Parameters: | processes (list) – list containing multiprocess.Process() |
---|---|
Return type: | none |
Generator yielding a re-sampled dataset with each iteration. A re-sampled dataset is created by re-sampling each data point from the normal distribution given by its associated mean and standard deviation value. See the example in Supplementary Material in pyGCluster’s publication for how to define an own noise-function (e.g. uniform noise).
Parameters: |
|
---|---|
Return type: | none |