Variability-Based Neighbour Clustering

Hi all, as part of a corpus study I’m interested in using clustering techniques to group observations. Thanks to another discussion here I came across VNC as a possible technique for this.

I’m hoping to use @folgert’s code for this, but having an issue using it on my version of Python (3.7.4). Installed fine, but when running the VNC notebook I get this issue in the fifth cell:

    ~/Downloads/diachronic-text-analysis-master/HACluster/ in ete_tree(self, labels)
    140             from ete2 import Tree, NodeStyle, TreeStyle
    141         elif sys.version_info[0] == 3:
--> 142             from ete3 import Tree, NodeStyle, TreeStyle
    143         else:
    144             raise ValueError('Your version of Python is not supported.')

ModuleNotFoundError: No module named 'ete3'

a) If anyone has any thoughts on this problem (what version of python was this written in?)
b) Knows of any related code that does similar stuff (excepting @mike.kestemont’s code for his Beckett project)?
c) More broadly whether anyone has particular opinions on this topic of style-based clustering? Obviously developing chronologically contiguous clusters is helpful on one level but hardly exhaustive, and I wonder what techniques others have used? I’ve employed some basic K-Means (although this often ends up producing chronologically contiguous clusters if corpus position is a variable) but not much beyond that.



Hi! Python 3.7 should probably work, but you never know… The error here seems to be a missing package, the etetoolkit (, which is probably missing from the requirements in the script. With pip you can install it using:

pip install ete3

I’m curious what other people have to say about b) and c)!

Ah, of course there’s also the original R code written by Stefan Gries and Martin Hilpert:

Great, combined with the updated code that’s working well for me now. Thanks!