2. Module pyGCluster

pyGCluster is a clustering algorithm focusing on noise injection for subsequent cluster validation. By requesting identical cluster identity, the reproducibility of a large amount of clusters obtained with agglomerative hierarchical clustering (AHC) is assessed. Furthermore, a multitude of different distance-linkage combinations (DLCs) are evaluated. Finally, associations of highly reproducible clusters, called communities, are created. Graphical representation of the results as node maps and expression maps is implemented.

The pyGCluster module contains the main class pyGCluster.Cluster and some functions
class pyGCluster.Cluster(data=None, working_directory=None, verbosity_level=1)[source]

The pyGCluster class

Parameters:
  • working_directory (string) – directory in which all results are written (requires write-permission!).
  • verbosity_level (int) – either 0, 1 or 2.
  • data (dict) – Dictionary containing the data which is to be clustered.

In order to work with the default noise-injection function as well as plot expression maps correctly, the data-dict has to have the following structure.

Example:

>>> data = {
...            Identifier1 : {
...                            condition1 :  ( mean11, sd11 ),
...                            condition2 :  ( mean12, sd12 ),
...                            condition3 :  ( mean13, sd13 ),
...             },
...            Identifier2 : {
...                            condition2 :  ( mean22, sd22 ),
...                            condition3 :  ( mean23, sd23 ),
...                            condition3 :  ( mean13, sd13 ),
...             },
... }
>>> import pyGCluster
>>> ClusterClass = pyGCluster.Cluster(data=data, verbosity_level=1, working_directory=...)

Note

If any condition for an identifier in the “nested_data_dict”-dict is missing, this entry is discarded, i.e. not imported into the Cluster Class. This is because pyGCluster does not implement any missing value estimation. One possible solution is to replace missing values by a mean value and a standard deviation that is representative for the complete data range in the given condition.

pyGCluster inherits from the regular Python Dictionary object. Hence, the attributes of pyGCluster can be accessed as Python Dictionary keys.

A selection of the most important attributes / keys are:

>>> # general
>>> ClusterClass[ 'Working directory' ]
...     # this is the directory where all pyGCluster results
...     # (pickle objects, expression maps, node map, ...) are saved into.
/Users/Shared/moClusterDirectory
>>> # original data ca be accessed via
>>> ClusterClass[ 'Data' ]
...     # this collections.OrderedDict contains the data that has been
...     # or will be clustered (see also below).
... plenty of data ;)
>>> ClusterClass[ 'Conditions' ]
...     # sorted list of all conditions that are defined in the "Data"-dictionary
[ 'condition1', 'condition2', 'condition3' ]
>>> ClusterClass[ 'Identifiers' ]
...     # sorted tuple of all identifiers, i.e. ClusterClass[ 'Data' ].keys()
( 'Identifier1', 'Identifier2' , ... 'IdentifierN' )
>>> # re-sampling paramerters
>>> ClusterClass[ 'Iterations' ]
...     # the number of datasets that were clustered.
1000000
>>> ClusterClass[ 'Cluster 2 clusterID' ]
...     # dictionary with clusters as keys, and their respective row index
...     # in the "Cluster count"-matrix (= clusterID) as values.
{ ... }
>>> ClusterClass[ 'Cluster counts' ]
...     # numpy.uint32 matrix holding the counts for each
...     # distance-linkage combination of the clusters.
>>> ClusterClass[ 'Distance-linkage combinations' ]
...     # sorted list containing the distance-linkage combinations
...     # that were evaluted in the re-sampling routine.
>>> # Communities
>>> ClusterClass[ 'Communities' ]
...     # see function pyGCluster.Cluster.build_nodemap for further information.
>>> # Visualization
>>> ClusterClass[ 'Additional labels' ]
...     # dictionary with an identifier of the "Data"-dict as key,
...     # and a list of additional information (e.g. annotation, GO terms) as value.
{
    'Identifier1' :
                ['Photosynthesis related' , 'zeroFactor: 12.31' ],
    'Identifier2' : [ ... ] ,
     ...
}
>>> ClusterClass[ 'for IO skip clusters bigger than' ]
...     # Default = 100. Since some clusters are really large
...     # (with sizes close to the root (the cluster holding all objects)),
...     # clusters with more objects than this value
...     # are not plotted as expression maps or expression profile plots.

pyGCluster offers the possibility to save the analysis (e.g. after re-sampling) via pyGCluster.Cluster.save() , and continue via pyGCluster.Cluster.load() Initializes pyGCluster.Cluster class

Classically, users start the multiprocessing clustering routine with multiple distance linkage combinations via the pyGCluster.Cluster.do_it_all() function. This function allows to update the pyGCluster class with all user parameters before it calls pyGCluster.Cluster.resample(). The main advantage in calling pyGCluster.Cluster.do_it_all() is that all general plotting functions are called afterwards as well, these are:

If one choses, one can manually update the parameters (setting the key, value pairs in pyGCluster) and then evoke pyGCluster.Cluster.resample() with the appropriate parameters. This useful if certain memory intensive distance-linkage combinations are to be clustered on a specific computer.

Note

Cluster Class can be initilized empty and filled using pyGCluster.Cluster.load()

build_nodemap(min_cluster_size=4, top_X_clusters=0, threshold_4_the_lowest_max_freq=0.01, starting_min_overlap=0.1, increasing_min_overlap=0.05)[source]

Construction of communities from a set of most_frequent_cluster. This set is obtained via pyGCluster.Cluster._get_most_frequent_clusters(), to which the first three parameters are passed. These clusters are then subjected to AHC with complete linkage. The distance matrix is calculated via pyGCluster.Cluster.calculate_distance_matrix(). The combination of complete linkage and the distance matrix assures that all clusters in a community exhibit at least the “starting_min_overlap” to each other. From the resulting cluster tree, a “first draft” of communities is obtained. These “first” communities are then themselves considered as clusters, and subjected to AHC again, until the community assignment of clusters remains constant. By this, clusters are inserted into a target community, which initially did not overlap with each cluster inside the target community, but do overlap if the clusters in the target community are combined into a single cluster. By this, the degree of stringency is reduced; the clusters fit into a community in a broader sense. For further information on the community construction, see the publication of pyGCluster.

Internal structure of communities:
>>> name = ( cluster, level )
...         # internal name of the community.
...         # The first element in the tuple ("cluster") contains the indices
...         # of the objects that comprise a community.
...         # The second element gives the level,
...         # or iteration when the community was formed.
>>> self[ 'Communities' ][ name ][ 'children' ]
...         # list containing the clusters that build the community.
>>> self[ 'Communities' ][ name ][ '# of nodes merged into community' ]
...         # the number of clusters that build the community.
>>> self[ 'Communities' ][ name ][ 'index 2 obCoFreq dict' ]
...         # an OrderedDict in which each index is assigned its obCoFreq.
...         # Negative indices correspond to "placeholders",
...         # which are required for the insertion of black lines into expression maps.
...         # Black lines in expression maps seperate the individual clusters
...         # that form a community, sorted by when
...         # they were inserted into the community.
>>> self[ 'Communities' ][ name ][ 'highest obCoFreq' ]
...         # the highest obCoFreq encountered in a community.
>>> self[ 'Communities' ][ name ][ 'cluster ID' ]
...         # the ID of the cluster containing the object with the highest obCoFreq.

Of the following parameters, the first three are passed to pyGCluster.Cluster._get_most_frequent_clusters():

Parameters:
  • min_cluster_size (int) – clusters smaller than this threshold are not considered for the community construction.
  • top_X_clusters (int) – form communities from the top X clusters sorted by their maximum frequency.
  • threshold_4_the_lowest_max_freq (float) – [0, 1[ form communities from clusters whose maximum frequency is at least this value.
  • starting_min_overlap (float) – ]0, 1[ minimum required relative overlap between clusters so that they are assigned the same community. The relative overlap is defined as the size of the overlap between two clusters, divided by the size of the larger cluster.
  • increasing_min_overlap (float) – defines the increase of the required overlap between communities
Return type:

none

calculate_distance_matrix(clusters, min_overlap=0.25)[source]
Calculates the specifically developed distance matrix for the AHC of clusters:
  1. Clusters sharing not the minimum overlap are attributed a distance of “self[ ‘Root size’ ]” (i.e. len( self[ ‘Data’ ] ) ).
  2. Clusters are attributed a distance of “self[ ‘Root size’ ] - 1” to the root cluster.
  3. Clusters sharing the minimum overlap are attributed a distance of “size of the larger of the two clusters minus size of the overlap”.

The overlap betweeen a pair of clusters is relative, i.e. defined as the size of the overlap divided by the size of the larger of the two clusters.

The resulting condensed distance matrix in not returned, but rather stored in self[ ‘Nodemap - condensed distance matrix’ ].

Parameters:
  • clusters (list of clusters. Clusters are represented as tuples consisting of their object’s indices.) – the most frequent clusters whose “distance” is to be determined.
  • min_overlap (float) – ]0, 1[ threshold value to determine if the distance between two clusters is calculated according to (1) or (3).
Return type:

none

check4convergence()[source]

Checks if the re-sampling routine may be terminated, because the number of most frequent clusters remains almost constant. This is done by examining a plot of the amount of most frequent clusters vs. the number of iterations. Convergence is declared once the median normalized slope in a given window of iterations is equal or below “iter_tol”. For further information see Supplementary Material of the corresponding publication.

Return type:boolean
check_if_data_is_log2_transformed()[source]

Simple check if any value of the data_tuples (i.e. any mean) is below zero. Below zero indicates that the input data was log2 transformed.

Return type:boolean
convergence_plot(filename='convergence_plot.pdf')[source]

Creates a two-sided PDF file containing the full picture of the convergence plot, as well as a zoom of it. The convergence plot illustrates the development of the amount of most frequent clusters vs. the number of iterations. The dotted line in this plots represents the normalized slope, which is used for internal convergence determination.

If rpy2 cannot be imported, a CSV file is created instead.

Parameters:filename (string) – the filename of the PDF (or CSV) file.
Return type:none
create_rainbow_colors(n_colors=10)[source]

Returns a list of rainbow colors. Colors are expressed as hexcodes of RGB values.

Parameters:n_colors (int) – number of rainbow colors.
Return type:list
delete_resampling_results()[source]

Resets all variables holding any result of the re-sampling process. This includes the convergence determination as well as the community structure. Does not delete the data that is intended to be clustered.

Return type:None
do_it_all(working_directory=None, distances=None, linkages=None, function_2_generate_noise_injected_datasets=None, min_cluster_size=4, alphabet=None, force_plotting=False, min_cluster_freq_2_retain=0.001, pickle_filename='pyGCluster_resampled.pkl', cpus_2_use=None, iter_max=250000, iter_tol=1e-07, iter_step=5000, iter_top_P=0.001, iter_window=50000, iter_till_the_end=False, top_X_clusters=0, threshold_4_the_lowest_max_freq=0.01, starting_min_overlap=0.1, increasing_min_overlap=0.05, color_gradient='1337', box_style='classic', min_value_4_expression_map=None, max_value_4_expression_map=None, additional_labels=None)[source]

Evokes all necessary functions which constitute the main functionality of pyGCluster. This is AHC clustering with noise injection and a variety of DLCs, in order to identify highly reproducible clusters, followed by a meta-clustering of highly reproducible clusters into so-called ‘communities’.

The functions that are called are:

For a complete list of possible Distance matrix calculations see: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html or Linkage methods see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

Note

If memory is of concern (e.g. for a large dataset, > 5000 objects), cpus_2_use should be kept low.

Parameters:
  • distances (list) – list of distance metrices, given as strings, e.g. [ ‘correlation’, ‘euclidean’ ]
  • linkages (list) – list of distance metrices, given as strings, e.g. [ ‘average’, ‘complete’, ‘ward’ ]
  • function_2_generate_noise_injected_datasets (function) – function to generate noise-injected datasets. If None (default), Gaussian distributions are used.
  • min_cluster_size (int) – minimum size of a cluster, so that it is included in the assessment of cluster reproducibilities.
  • alphabet (string) – alphabet used to convert decimal indices to characters to save memory. Defaults to string.printable, without ‘,’.

Note

If alphabet contains ‘,’, this character is removed from alphabet, because the indices comprising a cluster are saved comma-seperated.

Parameters:
  • force_plotting (boolean) – the convergence plot is created after each iter_step iteration (otherwise only when convergence is detected).
  • min_cluster_freq_2_retain (float) – ]0, 1[ minimum frequency of a cluster (only the maximum of the dlc-frequencies matters here) it has to exhibit to be stored in pyGCluster once all iterations are finished.
  • cpus_2_use (int) – number of threads that are evoked in the re-sampling routine.
  • iter_max (int) – maximum number of re-sampling iterations.

Convergence determination:

Parameters:
  • iter_tol (float) – ]0, 1e-3[ value for the threshold of the median of normalized slopes, in order to declare convergence.
  • iter_step (int) – number of iterations each multiprocess performs and simultaneously the interval in which to check for convergence.
  • iter_top_P (float) – ]0, 1[ for the convergence estmation, the amount of most frequent clusters is examined. This is the threshold for the minimum frequency of a cluster to be included.
  • iter_window (int) – size of the sliding window in iterations. The median is obtained from normalized slopes inside this window - should be a multiple of iter_step
  • iter_till_the_end (boolean) – if set to True, the convergence determination is switched off; hence, re-sampling is performed until iter_max is reached.

Output/Plotting:

Parameters:
  • pickle_filename (string) – Filename of the output pickle object
  • top_X_clusters (int) – Plot of the top X clusters in the sorted list (by freq) of clusters having a maximum cluster frequency of at least threshold_4_the_lowest_max_freq (clusterfreq-plot is still sorted by size).
  • threshold_4_the_lowest_max_freq (float) – ]0, 1[ Clusters must have a maximum frequency of at least threshold_4_the_lowest_max_freq to appear in the plot.
  • min_value_4_expression_map (float) – lower bound for color coding of values in the expression map. Remember that log2-values are expected, i.e. this value should be < 0!
  • max_value_4_expression_map (float) – upper bound for color coding of values in the expression map.
  • color_gradient (string) – name of the color gradient used for plotting the expression map. Currently supported are default, Daniel, barplot, 1337, BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn and Spectral
  • expression_map_filename (string) – file name for expression map. .svg will be added if required.
  • legend_filename (string) – file name for legend .svg will be added if required.
  • box_style (string) – the way the relative standard deviation is visualized in the expression map. Currently supported are ‘modern’, ‘fusion’ or ‘classic’.
  • starting_min_overlap (float) – ]0, 1[ minimum required relative overlap between clusters so that they are assigned the same community. The relative overlap is defined as the size of the overlap between two clusters, divided by the size of the larger cluster.
  • increasing_min_overlap (float) – defines the increase of the required overlap between communities
  • additional_labels (dict) – dictionary, where additional labels can be defined which will be added in the expression map plots to the gene/protein names
Return type:

None

For more information to each parameter, please refer to pyGCluster.Cluster.resample(), and the subsequent functions: pyGCluster.Cluster.build_nodemap(), pyGCluster.Cluster.write_dot(), pyGCluster.Cluster.draw_community_expression_maps(), pyGCluster.Cluster.draw_expression_profiles().

draw_community_expression_maps(min_value_4_expression_map=None, max_value_4_expression_map=None, color_gradient='1337', box_style='classic')[source]

Plots the expression map for each community showing its object composition.

The following parameters are passed to pyGCluster.Cluster.draw_expression_map():

Parameters:
  • min_value_4_expression_map (float) – lower bound for color coding of values in the expression map. Remember that log2-values are expected, i.e. this value should be < 0!
  • max_value_4_expression_map (float) – upper bound for color coding of values in the expression map.
  • color_gradient (string) – name of the color gradient used for plotting the expression map. Currently supported are default, Daniel, barplot, 1337, BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn and Spectral
  • box_style (string) – name of box style used in SVG. Currently supported are classic, modern, fusion.
Return type:

none

draw_expression_map(identifiers=None, data=None, conditions=None, additional_labels=None, min_value_4_expression_map=None, max_value_4_expression_map=None, expression_map_filename=None, legend_filename=None, color_gradient=None, box_style='classic')[source]

Draws expression map as SVG

Parameters:
  • min_value_4_expression_map (float) – lower bound for color coding of values in the expression map. Remember that log2-values are expected, i.e. this value should be < 0!
  • max_value_4_expression_map (float) – upper bound for color coding of values in the expression map.
  • color_gradient (string) – name of the color gradient used for plotting the expression map. Currently supported are default, Daniel, barplot, 1337, BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn and Spectral
  • expression_map_filename (string) – file name for expression map. .svg will be added if required.
  • legend_filename (string) – file name for legend .svg will be added if required.
  • box_style (string) – the way the relative standard deviation is visualized in the expression map. Currently supported are ‘modern’, ‘fusion’ or ‘classic’.
  • additional_labels (dict) – dictionary, where additional labels can be defined which will be added in the expression map plots to the gene/protein names
Return type:

none

Data has to be a nested dict in the following format:
>>>  data =   {
...         fastaID1 : {
...                 cond1 : ( mean, sd ) , cond2 : ( mean, sd ), ...
...         }
...         fastaID2 : {
...                 cond1 : ( mean, sd ) , cond2 : ( mean, sd ), ...
...         }
...  }
optional and, if needed, data will be extracted from
self[ ‘Data’ ]
self[ ‘Identifiers’ ]
self[ ‘Conditions’ ]
draw_expression_map_for_cluster(clusterID=None, cluster=None, filename=None, min_value_4_expression_map=None, max_value_4_expression_map=None, color_gradient='default', box_style='classic')[source]

Plots an expression map for a given cluster. Either the parameter “clusterID” or “cluster” can be defined. This function is useful to plot a user-defined cluster, e.g. knowledge-based cluster (TCA-cluster, Glycolysis-cluster ...). In this case, the parameter “cluster” should be defined.

Parameters:
  • clusterID (int) – ID of a cluster (those are obtained e.g. from the plot of cluster frequencies or the node map)
  • cluster (tuple) – tuple containing the indices of the objects describing a cluster.
  • filename (string) – name of the SVG file for the expression map.

The following parameters are passed to pyGCluster.Cluster.draw_expression_map():

Parameters:
  • min_value_4_expression_map (float) – lower bound for color coding of values in the expression map. Remember that log2-values are expected, i.e. this value should be < 0!
  • max_value_4_expression_map (float) – upper bound for color coding of values in the expression map.
  • color_gradient (string) – name of the color gradient used for plotting the expression map. Currently supported are default, Daniel, barplot, 1337, BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn and Spectral
  • box_style (string) – name of box style used in SVG. Currently supported are classic, modern, fusion.
Return type:

none

draw_expression_map_for_community_cluster(name, min_value_4_expression_map=None, max_value_4_expression_map=None, color_gradient='1337', sub_folder=None, min_obcofreq_2_plot=None, box_style='classic')[source]

Plots the expression map for a given “community cluster”: Any cluster in the community node map is internally represented as a tuple with two elements: “cluster” and “level”. Those objects are stored as keys in self[ ‘Communities’ ], from where they may be extracted and fed into this function.

Parameters:
  • name (tuple) – “community cluster” -> best obtain from self[ ‘Communities’ ].keys()
  • min_obcofreq_2_plot (float) – minimum obCoFreq of an cluster’s object to be shown in the expression map.

The following parameters are passed to pyGCluster.Cluster.draw_expression_map():

Parameters:
  • min_value_4_expression_map (float) – lower bound for color coding of values in the expression map. Remember that log2-values are expected, i.e. this value should be < 0!
  • max_value_4_expression_map (float) – upper bound for color coding of values in the expression map.
  • color_gradient (string) – name of the color gradient used for plotting the expression map. Currently supported are default, Daniel, barplot, 1337, BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn and Spectral
  • box_style (string) – name of box style used in SVG. Currently supported are classic, modern, fusion.
  • sub_folder (string) – if specified, the expression map is saved in this folder, rather than in pyGCluster’s working directory.
Return type:

none

draw_expression_profiles(min_value_4_expression_map=None, max_value_4_expression_map=None)[source]

Draws an expression profile plot (SVG) for each community, illustrating the main “expression pattern” of a community. Each line in this plot represents an object. The “grey cloud” illustrates the range of the standard deviation of the mean values. The plots are named prefixed by “exProf”, followed by the community name as it is shown in the node map.

Parameters:
  • min_value_4_expression_map (int) – minimum of the y-axis (since data should be log2 values, this value should typically be < 0).
  • max_value_4_expression_map (int) – maximum for the y-axis.
Return type:

none

frequencies(identifier=None, clusterID=None, cluster=None)[source]

Returns a tuple with (i) the cFreq and (ii) a Collections.DefaultDict containing the DLC:frequency pairs for either an identifier, e.g. “JGI4|Chlre4|123456” or clusterID or cluster. Returns ‘None’ if the identifier is not part of the data set, or clusterID or cluster was not found during iterations.

Example:

>>> cFreq, dlc_freq_dict = cluster.frequencies( identifier = 'JGI4|Chlre4|123456' )
>>> dlc_freq_dict
... defaultdict(<type 'float'>,
... {'average-correlation': 0.0, 'complete-correlation': 0.0,
... 'centroid-euclidean': 0.0015, 'median-euclidean': 0.0064666666666666666,
... 'ward-euclidean': 0.0041333333333333335, 'weighted-correlation': 0.0,
... 'complete-euclidean': 0.0014, 'weighted-euclidean': 0.0066333333333333331,
... 'average-euclidean': 0.0020333333333333332})
Parameters:
  • identifier (string) – search frequencies by identifier input
  • clusterID (int) – search frequencies by cluster ID input
  • cluster (tuple) – search frequencies by cluster (tuple of ints) input
Return type:

tuple

info()[source]

Prints some information about the clustering via pyGCluster:

  • number of genes/proteins clustered
  • number of conditions defined
  • number of distance-linkage combinations
  • number of iterations performed

as well as some information about the communities, the legend for the shapes of nodes in the node map and the way the functions were called.

Return type:none
load(filename)[source]

Fills a pyGCluster.Cluster object with the session saved as “filename”. If “filename” is not a complete path, e.g. “example.pkl” (instead of “/home/user/Desktop/example.pkl”), the directory given by self[ ‘Working directory’ ] is used.

Note

Loading of pyGCluster has to be performed as a 2-step-procedure:
>>> LoadedClustering = pyGCluster.Cluster()
>>> LoadedClustering.load( "/home/user/Desktop/example.pkl" )
Parameters:filename (string) – may be either a simple file name (“example.pkl”) or a complete path (e.g. “/home/user/Desktop/example.pkl”).
Return type:none
median(_list)[source]

Returns the median from a list of numeric values.

Parameters:_list (list) –
Return type:int / float
plot_clusterfreqs(min_cluster_size=4, top_X_clusters=0, threshold_4_the_lowest_max_freq=0.01)[source]

Plot the frequencies of each cluster as a expression map: which cluster was found by which distance-linkage combination, and with what frequency? The plot’s filename is prefixed by ‘clusterFreqsMap’, followed by the values of the parameters. E.g. ‘clusterFreqsMap_minSize4_top0clusters_top10promille.svg’. Clusters are sorted by size.

Parameters:
  • min_cluster_size (int) – only clusters with a size equal or greater than min_cluster_size appear in the plot of the cluster freqs.
  • threshold_4_the_lowest_max_freq (float) – ]0, 1[ Clusters must have a maximum frequency of at least threshold_4_the_lowest_max_freq to appear in the plot.
  • top_X_clusters (int) – Plot of the top X clusters in the sorted list (by freq) of clusters having a maximum cluster frequency of at least threshold_4_the_lowest_max_freq (clusterfreq-plot is still sorted by size).

Note

if top_X_clusters is set to zero ( 0 ), this filter is switched off (switched off by default).

Return type:None
plot_mean_distributions()[source]

Creates a density plot of mean values for each condition via rpy2.

Return type:none
plot_nodetree(tree_filename='tree.dot')[source]
plot the dendrogram for the clustering of the most_frequent_clusters.
  • node label = nodeID internally used for self[‘Nodemap’] (not the same as clusterID!).

  • node border color is white if the node is a close2root-cluster (i.e. larger than self[ ‘for IO skip clusters bigger than’ ] ).

  • edge label = distance between parent and children.

  • edge - color codes:
    • black = default; highlights child which is not a most_frequent_cluster but was created during formation of the dendrogram.
    • green = children are connected with the root.
    • red = highlights child which is a most_frequent_cluster.
    • yellow = most_frequent_cluster is directly connected with the root.
Parameters:tree_filename (string) – name of the Graphviz DOT file containing the dendrogram of the AHC of most frequent clusters. Best given with ”.dot”-extension!
Return type:none
resample(distances, linkages, function_2_generate_noise_injected_datasets=None, min_cluster_size=4, alphabet=None, force_plotting=False, min_cluster_freq_2_retain=0.001, pickle_filename='pyGCluster_resampled.pkl', cpus_2_use=None, iter_tol=1e-07, iter_step=5000, iter_max=250000, iter_top_P=0.001, iter_window=50000, iter_till_the_end=False)[source]

Routine for the assessment of cluster reproducibility (re-sampling routine). To this, a high number of noise-injected datasets are created, which are subsequently clustered by AHC. Those are created via pyGCluster.function_2_generate_noise_injected_datasets() (default = usage of Gaussian distributions). Each ‘simulated’ dataset is then subjected to AHC x times, where x equals the number of distance-linkage combinations that come from all possible combinations of “distances” and “linkages”. In order to speed up the re-sampling routine, it is distributed to multiple threads, if cpus_2_use > 1.

The re-sampling routine stops once either convergence (see below) is detected or iter_max iterations have been performed. Eventually, only clusters with a maximum frequency of at least min_cluster_freq_2_retain are stored; all others are discarded.

In order to visually inspect convergence, a convergence plot is created. For more information about the convergence estimation, see Supplementary Material of pyGCluster’s publication.

For a complete list of possible Distance matrix calculations see: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html or Linkage methods see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

Note

If memory is of concern (e.g. for a large dataset, > 5000 objects), cpus_2_use should be kept low.

Parameters:
  • distances (list) – list of distance metrices, given as strings, e.g. [ ‘correlation’, ‘euclidean’ ]
  • linkages (list) – list of distance metrices, given as strings, e.g. [ ‘average’, ‘complete’, ‘ward’ ]
  • function_2_generate_noise_injected_datasets (function) – function to generate noise-injected datasets. If None (default), Gaussian distributions are used.
  • min_cluster_size (int) – minimum size of a cluster, so that it is included in the assessment of cluster reproducibilities.
  • alphabet (string) – alphabet used to convert decimal indices to characters to save memory. Defaults to string.printable, without ‘,’.

Note

If alphabet contains ‘,’, this character is removed from alphabet, because the indices comprising a cluster are saved comma-seperated.

Parameters:
  • force_plotting (boolean) – the convergence plot is created after each iter_step iteration (otherwise only when convergence is detected).
  • min_cluster_freq_2_retain (float) – ]0, 1[ minimum frequency of a cluster (only the maximum of the dlc-frequencies matters here) it has to exhibit to be stored in pyGCluster once all iterations are finished.
  • cpus_2_use (int) – number of threads that are evoked in the re-sampling routine.
  • iter_max (int) – maximum number of re-sampling iterations.

Convergence determination:

Parameters:
  • iter_tol (float) – ]0, 1e-3[ value for the threshold of the median of normalized slopes, in order to declare convergence.
  • iter_step (int) – number of iterations each multiprocess performs and simultaneously the interval in which to check for convergence.
  • iter_top_P (float) – ]0, 1[ for the convergence estmation, the amount of most frequent clusters is examined. This is the threshold for the minimum frequency of a cluster to be included.
  • iter_window (int) – size of the sliding window in iterations. The median is obtained from normalized slopes inside this window - should be a multiple of iter_step
  • iter_till_the_end (boolean) – if set to True, the convergence determination is switched off; hence, re-sampling is performed until iter_max is reached.
Return type:

None

save(filename='pyGCluster.pkl')[source]

Saves the current pyGCluster.Cluster object in a Pickle object.

Parameters:filename (string) – may be either a simple file name (“example.pkl”) or a complete path (e.g. “/home/user/Desktop/example.pkl”). In the former case, the pickle is stored in pyGCluster’s working directory.
Return type:none
write_dot(filename, scaleByFreq=True, min_obcofreq_2_plot=None, n_legend_nodes=5, min_value_4_expression_map=None, max_value_4_expression_map=None, color_gradient='1337', box_style='classic')[source]

Writes a Graphviz DOT file representing the cluster composition of communities. Herein, each node represents a cluster. Its name is a combination of the cluster’s ID, followed by the level / iteration it was inserted into the community:

  • The node’s size reflects the cluster’s cFreq.
  • The node’s shape illustrates by which distance metric the cluster was found (if the shape is a point, this illustrates that this cluster was not among the most_frequent_clusters, but only formed during AHC of clusters).
  • The node’s color shows the community membership; except for clusters which are larger than self[ ‘for IO skip clusters bigger than’ ], those are highlighted in grey.
  • The node connecting all clusters is the root (the cluster holding all objects), which is highlighted in white.

The DOT file may be rendered with “Graphviz” or further processed with other appropriate programs such as e.g. “Gephi”. If “Graphviz” is available, the DOT file is eventually rendered with “Graphviz“‘s dot-algorithm.

In addition, a expression map for each cluster of the node map is created (via pyGCluster.Cluster.draw_expression_map_for_community_cluster()).

Those are saved in the sub-folder “communityClusters”.

This function also calls pyGCluster.Cluster.write_legend(), which creates a TXT file containing the object composition of all clusters, as well as their frequencies.

Parameters:
  • filename (string) – file name of the Graphviz DOT file representing the node map, best given with extension ”.dot”.
  • scaleByFreq (boolean) – switch to either scale nodes (= clusters) by cFreq or apply a constant size to each node (the latter may be useful to put emphasis on the nodes’ shapes).
  • min_obcofreq_2_plot (float) – if defined, clusters with lower cFreq than this value are skipped, i.e. not plotted.
  • n_legend_nodes (int) – number of nodes representing the legend for the node sizes. The node sizes themselves encode for the cFreq. “Legend nodes” are drawn as grey boxes.
  • min_value_4_expression_map (float) – lower bound for color coding of values in the expression map. Remember that log2-values are expected, i.e. this value should be < 0.
  • max_value_4_expression_map (float) – upper bound for color coding of values in the expression map.
  • color_gradient (string) – name of the color gradient used for plotting the expression map.
  • box_style (string) – the way the relative standard deviation is visualized in the expression map. Currently supported are ‘modern’, ‘fusion’ or ‘classic’.
Return type:

none

write_legend(filename='legend.txt')[source]

Creates a legend for the community node map as a TXT file. Herein, the object composition of each cluster of the node map as well as its frequencies are recorded. Since this function is internally called by pyGCluster.Cluster.write_dot(), it is typically not necessary to call this function.

Parameters:filename (string) – name of the legend TXT file, best given with extension ”.txt”.
Return type:none
pyGCluster.create_default_alphabet()[source]

Returns the default alphabet which is used to save clusters in a lesser memory-intense form: instead of saving e.g. a cluster containing identifiers with indices of 1,20,30 as “1,20,30”, the indices are converted to a baseX system -> “1,k,u”.

The default alphabet that is returned is:
>>> string.printable.replace( ',', '' )
Return type:string
pyGCluster.resampling_multiprocess(DataQ=None, data=None, iterations=5000, alphabet=None, dlc=None, min_cluster_size=4, min_cluster_freq_2_retain=0.001, function_2_generate_noise_injected_datasets=None)[source]

This is the function that is called for each multiprocesses that is evoked internally in pyGCluster during the re-sampling routine. Agglomerative hierarchical clustering is performed for each distance-linkage combination (DLC) on each of iteration datasets. Clusters from each hierarchical tree are extracted, and their counts are saved in a temporary cluster-count matrix. After iterations iterations, clusters are filtered according to min_cluster_freq_2_retain. These clusters, together with their respective counts among all DLCs, are returned. The return value is a list containing tuples with two elements: cluster (string) and counts ( one dimensional np.array )

Parameters:
  • DataQ (multiprocessing.Queue()) – data queue which is used to pipe the re-sampling results back to pyGCluster.
  • data (collections.OrderedDict()) – dictionary ( OrderedDict! ) holding the data to be clustered -> passed through to the noise-function.
  • iterations (int) – the number of iterations this multiprocess is going to perform.
  • alphabet (string) – in order to save memory, the indices describing a cluster are converted to a specific alphabet (rather than decimal system).
  • dlc (list) – list of the distance-linkage combinations that are going to be evaluated.
  • min_cluster_size (int) – minimum size of a cluster to be considered in the re-sampling routine (smaller clusters are discarded)
  • min_cluster_freq_2_retain (float) – once all iterations are performed, clusters are filtered according to 50% (because typically forwarded from pyGCluster) of this threshold.
  • function_2_generate_noise_injected_datasets (function) – function to generate re-sampled datasets.
Return type:

list

pyGCluster.seekAndDestry(processes)[source]

Any multiprocesses given by processes are terminated.

Parameters:processes (list) – list containing multiprocess.Process()
Return type:none
pyGCluster.yield_noisejected_dataset(data, iterations)[source]

Generator yielding a re-sampled dataset with each iteration. A re-sampled dataset is created by re-sampling each data point from the normal distribution given by its associated mean and standard deviation value. See the example in Supplementary Material in pyGCluster’s publication for how to define an own noise-function (e.g. uniform noise).

Parameters:
  • data (collections.OrderedDict()) – dictionary ( OrderedDict! ) holding the data to be re-sampled.
  • iterations (int) – the number of re-sampled datasets this generator will yield.
Return type:

none

Previous topic

1. Introduction

Next topic

3. Usage

This Page