I’m training a model for a computer vision classification task on a coded dataset. I’m using an existing model and transfer learning to both new and existing categories.
I’m not sure how to proceed with the training/validation/test split. Normally, I would go for an 80/20/20 split. However, in this case the number of tagged examples is heavily skewed. Some categories have over 400 items, while others have 10.
I could use a fixed number of validation and/or test items (i.e. 10), which would exclude some categories. Or I could use a percentage of the training items, however, this would lead to different numbers of validation and training items. I could upsample the items in these categories to at least have an equal number of training/test items, but this would decrease the variation in images, which I could partially counter by transforming/augmenting these images.
Any ideas on how to approach this issue?