How to augment data for NLP
Today’s article is a more technical one, for those interested in enhancing their hands-on experience with natural language processing. If you’d like to implement NLP through a deep neural network, large amounts of data must be fed into it - both to achieve satisfactory accuracy during prediction and due to the nature of neural networks. However, there are a variety of techniques that allow you to augment your data, essentially creating multiple versions of your inputs. This allows you to multiply your existing small dataset through a number of different modifications.
Researchers at Cornell have recently compiled methods, motivations and a list of available techniques in a concise paper, along with a GitHub repository. The latter contains additional papers grouped by topics such as : text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, and adversarial examples.
Github repository: https://github.com/styfeng/DataAug4NLP