Imbalanced Training / Validation / Test Split

Thanks @Nanne, good points! I should have been more careful and elaborate in my comments.

I agree with you that this is a data problem, but I think the problem is deeper than ‘it requires more data’ and has to do with the nature of these skewed/long-tailed distributions.

The skewness suggests that if the initial classification ground truth process was done with more data, it probably would have resulted in more categories than there are now, with the ones that are small in the current getting more data, but other/new categories being as small.

Indeed students get taught about recall/precision/f-score, but that is usually only calculated on the whole set, and rarely analysed in a breakdown over categories. What we rarely teach our students (and do ourselves in papers) is analyse whether any improvements we see by tweaking models or using different algorithms, are the result of getting the big categories more correct, or getting all or most (or just the tail) categories more correct. Moreover, we rarely discuss what it means that the category distribution is skewed or long-tailed.

@marieke pointed me to a few really interesting papers by Filip Ilievski and others (she was involved herself as well I think) studying evaluation and performance of Entity Linking for long-tailed distributions [1,2], showing that pretty much all algorithms submitted to the SemEval evaluation campaign mainly do well on frequent entities for which there is lots of information, but fail on the more rare entities. Yet before them, virtually no one had ever looked at that. The same applies to text mining, information retrieval, recommender systems, NER, text recognition and probably a whole bunch of other fields.

So I guess my real point is that I encourage multiple analyses and breakdowns of evaluation results, to get a better understanding of how models differ and how that relates to the nature of the task they’re supposed to perform. And to think more deeply about the nature of such distributions and how that relates to the tasks we’re working on.

[1] https://www.aclweb.org/anthology/C18-1056.pdf
[2] https://www.aclweb.org/anthology/S18-1009.pdf

2 Likes

Fully agree that this is super valuable. There are some efforts in the Fair, Accountable, Transparent (FAT) ML area that overlap with this, but I’ve seen a number of works come to the same conclusion as the papers you referred to, so we can definitely do better!

Trying to discuss it a little in my course, but my impression is that it requires quite a deep understanding of data to really get the issues at play. If you know of any literature on it that would be accessible for first year AI / Information science then feel free to share!

1 Like

@Nanne @mjlavin80 Thanks for the suggestions. I am documenting my project, which is about a dataset. Would you recommend any methods for evaluating datasets? (not for GAN models). In my paper, I use both qualitative and quantitative methods to validate the dataset. As for the quantitative methods, I show generator loss, discriminator loss, and L1 loss in order to show if the dataset is trained properly with the GAN model. I also show each image every 20 epochs and the results of other images for qualitative methods. I would appreciate your further feedback! :slight_smile: melvin.wevers too.

@elibooklover I think using qualitative methods for this use case makes a lot of sense. Really, what you’re trying to score is whether the colorized images look like the kind images a human would do, right? So how about a task where users are shown 50% human colorized and 50% machine colorized and are asked to guess which ones are which? The “best” images are the ones the get the most votes for being human made. You can also compare the recall rates of human vs machine colorized images to get a sense of how different they are. Of course, with the machine colorized images, the lower the recall score, the better, since that would mean you’re fooling more people. If the two sets are indiscernible from one another, the recall rates would both converge on 50%.

Not entirely sure what you mean by evaluating the dataset. The losses you mention are for evaluating the training of a model, if you train two datasets with the same model, and for one dataset you get lower loss values I don’t think you can conclude that this dataset is ‘better’. The combination of that model with that dataset gives you a lower loss, but might be due to the model, or even the type of loss you’ve chosen.

It depends on the task, but from @mjlavin80’s post it seems you’re doing colourisation, and from the L1 loss you report it seems you have some ground truth. In that case I would report PSNR / SSIM. But the better evaluation is, like @mjlavin80 suggested also, to do a user study, as plausible colourisation and matching the ground truth arent the same thing.

Might be worth to start a new thread on this if you have more questions, and perhaps describe what you’re doing a little more, to provide some context!

2 Likes

@elibooklover also, if you’re images aren’t quite to the level where they’d fool anyone, you could have users rate quality on a scale so that you can look for patterns in which ones seem to be the most accurate. (You might need to do an IRB application to run this study, but it could be conducted somewhere like Amazon Turk for little cost.)

1 Like

I’m actually getting quite good results using the mixup method, with generates more training data and controls overfitting in my case. Have you used it, if so, what are your experiences?

That’s cool to hear! I never tried mixup myself, I only became aware of it when Cutmix was getting attention. Colleague who tried these kind of tricks extensively describes cutmix as the ‘proper way’ to do mixup.

He recommends to use cutmix with ~50% of images mixed, and to not combine it with mixup. But obviously no guarantuee that for your problem cutmix will be better than mixup, but might be worth to give it a try as well!

I saw cutmix also popping up, I’ll also check that one out. Thanks!