Skip to content

makcedward/nlpaug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

747 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



Build Code Quality Downloads

nlpaug

This python library helps you with augmenting NLP, audio, and spectrogram data for machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestrate multiple augmenters together.

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, easy-to-use and lightweight library. Augment data in a few lines of code
  • Plug and play with common machine learning and neural network frameworks
  • Support textual, audio, and spectrogram inputs
  • Python 3.12-ready V2 baseline with offline-first tests and GitHub Actions coverage

Textual Data Augmentation Example


Acoustic Data Augmentation Example


Section Description
Quick Demo How to use this library
Augmenter Introduce all available augmentation methods
Installation How to install this library
Recent Changes Latest enhancement
Extension Reading More real life examples or researchs
Reference Reference of external resources such as data or model

Quick Demo

Augmenter

Augmenter Target Augmenter Action Description
Textual Character KeyboardAug substitute Simulate keyboard distance error
Textual OcrAug substitute Simulate OCR engine error
Textual RandomAug insert, substitute, swap, delete Apply augmentation randomly
Textual Word AntonymAug substitute Substitute opposite meaning word according to WordNet antonym
Textual ContextualWordEmbsAug insert, substitute Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation
Textual RandomWordAug swap, crop, delete Apply augmentation randomly
Textual SpellingAug substitute Substitute word according to spelling mistake dictionary
Textual SplitAug split Split one word to two words randomly
Textual SynonymAug substitute Substitute similar word according to WordNet/ PPDB synonym
Textual TfIdfAug insert, substitute Use TF-IDF to find out how word should be augmented
Textual WordEmbsAug insert, substitute Leverage word2vec, GloVe or fasttext embeddings to apply augmentation
Textual BackTranslationAug substitute Leverage two translation models for augmentation
Textual ReservedAug substitute Replace reserved words
Textual Sentence ContextualWordEmbsForSentenceAug insert Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction
Textual AbstSummAug substitute Summarize article by abstractive summarization method
Textual LambadaAug substitute Using language model to generate text and then using classification model to retain high quality results
Signal Audio CropAug delete Delete audio's segment
Signal LoudnessAug substitute Adjust audio's volume
Signal MaskAug substitute Mask audio's segment
Signal NoiseAug substitute Inject noise
Signal PitchAug substitute Adjust audio's pitch
Signal ShiftAug substitute Shift time dimension forward/ backward
Signal SpeedAug substitute Adjust audio's speed
Signal VtlpAug substitute Change vocal tract
Signal NormalizeAug substitute Normalize audio
Signal PolarityInverseAug substitute Swap positive and negative for audio
Signal Spectrogram FrequencyMaskingAug substitute Set block of values to zero according to frequency dimension
Signal TimeMaskingAug substitute Set block of values to zero according to time dimension
Signal LoudnessAug substitute Adjust volume

Flow

Augmenter Augmenter Description
Pipeline Sequential Apply list of augmentation functions sequentially
Pipeline Sometimes Apply some augmentation functions randomly

Installation

The library targets Python 3.12+.

Install the core package:

pip install nlpaug

Install feature extras as needed:

pip install "nlpaug[transformers]"
pip install "nlpaug[nltk]"
pip install "nlpaug[word-embs]"
pip install "nlpaug[audio]"
pip install "nlpaug[lambada]"

Install the latest GitHub version:

pip install "git+https://github.com/makcedward/nlpaug.git"

If you use WordEmbsAug (word2vec, glove or fasttext), download the pretrained assets first:

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.')
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.')
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.')

If you use SynonymAug with PPDB, download the language pack from:

http://paraphrase.org/#/download

Testing

Run the default offline suite:

python -m pytest

Run the optional integration suite:

python -m pytest -m integration

uv workflow

Create a Python 3.12 environment and run the default suite:

make test

Install the common optional extras and run the fuller local suite:

make test-full

Install the heaviest optional extras and run only integration tests:

make test-integration

If you prefer direct uv commands:

uv venv --python 3.12
uv pip install -p .venv/bin/python -e ".[dev]"
uv run --python .venv/bin/python pytest

Or use the repo scripts directly:

./scripts/setup_uv.sh core
./scripts/test_uv.sh core
./scripts/setup_uv.sh full
./scripts/test_uv.sh full
./scripts/setup_uv.sh integration
./scripts/test_uv.sh integration

Recent Changes

2.0.0 Jun 2026

  • Upgrade runtime baseline to Python 3.12+
  • Refresh major optional dependencies, including transformers 5.9, gensim 4.4, librosa 0.11, and NumPy 2.x
  • Add uv-based setup and test scripts for core, full, and integration workflows
  • Modernize offline-first tests so the default suite runs without downloading real models
  • Mock transformer-backed augmenters in tests and add broader regression coverage
  • Add GitHub Actions coverage reporting and local coverage scripts
  • Refactor shared augmenter hot paths and sentence generation internals for better readability and performance

1.1.11 Jul 6, 2022

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Citation

@misc{ma2019nlpaug,
  title={NLP Augmentation},
  author={Edward Ma},
  howpublished={https://github.com/makcedward/nlpaug},
  year={2019}
}

This package is cited by many books, workshop and academic research papers (70+). Here are some of examples and you may visit here to get the full list.

Workshops cited nlpaug

Book cited nlpaug

Research paper cited nlpaug

Contributions


sakares saengkaew


Binoy Dalal


Emrecan Çelik