Multiple models trained on your data perform surprisingly poorly, despite having decent metrics on the validation set. The code seems fine, so you decide to take a closer look at your training data. You check a random sample - the label is wrong. So is the next. Your stomach sinks and you start looking through your data in batches*. Thirty minutes later, you realize that x% of your data is incorrect.
Unfortunately this I’ve been in this situation a few too many times. Creating datasets is hard, even for relatively simple tasks like document classification/sentiment analysis etc. Not only do you have to worry about stuff like annotator bias, but label noise is insidious. You can measure label noise with annotator agreement metrics**, but even a relatively high agreement score can allow for ~10% of your data to be wrong. A good example of this is SST5, where classification performance is significantly worse than SST2
This problem can be even worse if you don’t have very clearly defined orthogonal classes - It’s relatively easy to come up with a labelling scheme that has a large amount of overlap between classes - this normally results in a lot of variance between annotators as each annotator will overrepresent a single class. For example, if you’re building sentiment analysis models for survey results, how do you deal with the fact that 30% of your data might have customers complaining about your after sales support but also complementing the fact that you’re the most affordable option on the market? Do you switch to an aspect based sentiment approach? If not, do you mark that sample as neutral? Irrespective of the judgement call that you take, how do you ensure that the annotators can be aligned to take similar judgement calls?
A lot of companies I’ve done contract work for, default to Mturk for annotations. I think that this is a particularly bad idea - MTurk can be very unreliable and needs a lot of effort to ensure that you get high quality annotations. With MTurk, you don’t have a strong ability to train the annotators and align them with your requirements, and while you can restrict the regions where you get workers from - you can’t be sure that they have the kinds of information that you need without directly testing it. Finally, if you do decide to test it, how do you ensure that the care/attention the annotators pay to the qualification tests will be maintained through every sample that they annotate?
In my experience, the best thing to do is hire a set of annotators. That’s hard too, but once you eventually figure out the hiring, here’s some things that I’ve learned managing annotators.
- Spend time training annotators. In particular, task specific training is super useful. The best way I’ve found to do this, is to walk them through some sample annotations, give them a set to do independently and discuss what you think they could have done differently. I often need to do multiple iterations of this. Additionally, spending time communicating the overall project idea and helping them understand just how important the task is (despite the peanuts they’re paid) leads to better results.
- Put a lot of effort into your annotation guidelines. The easiest way to do this is to annotate a few hundred samples manually and document your thought process. I’ve found that drawing decision trees for a set of samples really helps communicate the thought process.
- Prepare to do annotations in cycles. Very often, getting a minimal set of data annotated, and trying to build models on it, will help target the specific kinds of annotated data that you need, or even bad classifiers can help filter through a lot of data and save you annotation time. Annotating a lot of data all at once limits the kinds of things you can do with the data.
- In cases where you can’t directly calculate agreement, figure out ways to test annotators at irregular intervals. For example, Napoles et al used human annotators to rate how well an error correction system was working and they would present samples that they knew had errors in them at regular intervals to check if the annotators were paying enough attention.
- Frequent deadlines/followups - I used to like to expect a throughput of N samples/week. This led to significantly lower quality as they would rush to complete a week’s worth of samples in a day or less and make a lot of mistakes. Frequent check-ins every two to three days helps. I’ve toyed with the idea of having the annotators use a web app and track database insertions to flag large spikes in activity - but for the moment it seems a little Orwellian.
- Pick the right tooling. I’ve been a big fan of using just straight Google Docs/Excel etc because it seemed like the most straightforward solution, but when you have multiple annotators hopping on and off the project, a large pool of unlabelled data that needs to be stratified along multiple factors and also need to maintain a specific overlap between annotators handling all the different sheets is a nightmare. A lot of annotation tools exist - they all mostly do the same thing and they’re all relatively inflexible. Roll your own if you have to, it’s relatively easy with whatever web framework you’re familiar with.
- Annotator Overlap Roit et al Recommend having pairs of three annotators - where two annotators create the annotation, and a third annotator consolidates the samples to create the golden set. Unfortunately this didn’t work for our lab in practice - we found that the individual that we had consolidate annotations was rather unreliable.
*This is a decent way to get an idea of your data, you essentially figure out number of errors per batch and evaluate as many batches as you need to until you get a stable average value. You can then run a significance test to make sure that the average is representative if you want to be very thorough, but it usually is.
**Agreement metrics essentially evaluate what proportion of your labels annotators agree about after adjusting for agreement based on pure chance.