Visualising outputs of the R package Stylo when using large datasets

e_smith · March 18, 2021, 2:20pm

Dear all,
I’m currently working on an authorship verification project using the R package Stylo. The issue I’m having is that my corpus contains 700+ fragments and as such the visualisations outputted from the Stylo package are unreadable.

I have tried using the different visualisation ‘flavours’ in Stylo to minimise the information represented on the graphs but the output is still too crowded to be interpretable. I am currently able to run the Stylo experiments in both R and Python (using rpy2). I have also tried using a couple of different tools in Python to plot the outputs but I haven’t been able to find anything that works well for this volume of data so far.

What are your recommendations for packages or tools to visualise, cluster or otherwise interpret the pairwise distance matrix outputted from my experiments in a clear way, using either R or Python?

folgert · March 21, 2021, 9:48am

Hi!

I have some good experiences with the software package figtree (FigTree), which is essentially a graphical editor for cluster trees, and allows folding branches in trees. Might be worth checking out.

ash · March 22, 2021, 5:52pm

Another solution for visualizing big textual datasets is using distance-based networks and giving up on hierarchical clustering (“stylo” also generates EDGES table that could be used to make network objects or put directly into Gephi , etc.)
More details: https://academic.oup.com/dsh/article/32/1/50/2957386

ryan.heuser · March 23, 2021, 11:11am

By the way do you have any code for moving from a python hclust object to a FigTree tree? I found this (python - Save dendrogram to newick format - Stack Overflow), but when I open the output file in FigTree, I get a duplicate leaf node error.

Also @ash great idea, I’ve also had more luck a lot of the time with network diagrams over dendrograms or t-SNE/dimensionality reduction.

folgert · March 23, 2021, 11:29am

I think that code is correct. You could try to add a unique ID to each label.

andreskarjus · March 26, 2021, 8:12am

If R (as this is mentioned in the post), visualizing the entire distance matrix is easy with reshape2::melt(data) %>% ggplot(aes(Var1,Var2,fill=value)+geom_tile(). Ordering the axes often makes for more informative viz too, e.g. reduce the distance matrix to 1 dimension using mds or tsne and use that to order the axes (the 700 fragments).