Images in this NLP blog are borrowed from lecture note of Prof. Bal Krishna Bal.
The end-to-end structure that orchestrates the flow of data into and output from a NLP model is known as a natural language processing pipeline (or set of multiple models). It covers the input of the raw data, the features, the outputs, the machine learning model and model parameters, and the outputs of the predictions. NLP pipeline is also define as the step by step process of text.
Natural Language Processing Pipeline
- Data acquisition
- Text cleaning
- Feature engineering
- Monitoring and model updating
The process of sampling signals that gauge actual physical circumstances and transforming the resulting samples into digital numeric values that a computer can manipulate is known as data acquisition. In simple meaning data acquisition is the process of collecting, prepare data in the computer readable format. There are variety of data are in various sources. So, at first we should scrap data and pull them in the machine readable format and this process is known as data acquisition.
Synonym substitution (add texts with words replaced by their synonyms).
Back translation (adding to the source texts that are the target translation's back translation).
TF-IDF based word substitution (use TF-IDF results to replace words in the text)
The Bigram flip (add texts with flipped order of bigrams) / Swapping out entities (add texts with replaced entities)
Making data noisier (add words that have similar spellings)
Snorkel (which uses heuristics to generate synthetic data by altering current data and generating new data samples) (builds massive training data sets automatically without manual labeling). NLPAug and Simple Data Augmentation (libraries for creating synthetic samples for NLP)
Active Education (algorithm interactively queries a data point and gets its label)
Text Extraction and Cleanup
Process of separating the raw text from the other non-textual information, such as markup, metadata, etc., and converting the text to the necessary encoding format from the input data is called text extraction.
HTML Parsing and Cleanup:
- Writing a HTML parser
- Using existing libraries – Beautiful Soup, Scrapy
- Text encoding and conversion to machine-readable forms (such as emojis).
- There are various encoding schemes, and the default encoding for various operating systems may vary.
System-specific error correction(outputs of ASR or OCR that requires correction or post-editing)
Sentence segmentation and word tokenization: Process of splitting paragraph into sentence level or sentence into word level is called tokenization or word segmentation.
- Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.
- Normalization, language detection, code mixing, transliteration, etc.
- POS tagging, parsing, coreference resolution, etc.
Stemming and Lemmatization
Stemming: Technique of stripping a word of its suffixes and reducing it to a basic form so that all of the word's variations can be represented by the same form.
Lemmatization: The process of mapping every word form to its root word, or lemma, is known as lemmatization.
Stemming and Lemmatization
Other Preprocessing Steps
- Text normalization
- Lowercase to uppercase conversion
- Text to digit conversion
- Abbreviation expansion
- Language recognition
- the usage of libraries like Polyglot
- Transliteration and code mixing
- Advanced pre-processing
- A group of techniques that convert textual traits into a numeric vector that machine learning algorithms can comprehend.
Classical NLP/ML Pipeline
The raw data is transformed into a machine-readable format during the feature engineering process. In the traditional ML pipeline, the transformation functions are often handmade and tailored to the task at hand.
The task at hand and subject expertise serve as major sources of inspiration for features.
A major benefit of handcrafted features is that they maintain the model's interpretability, making it easy to quantify the contribution of each feature to the model's prediction.
- The primary flaw in traditional ML models is feature engineering.
- Handcrafted feature engineering causes a bottleneck in the model development cycle and its performance.
After pre-processing, the raw data is delivered straight to a model in the DL pipeline. Although the model is capable of "learning" features from the data, the model lacks interpretability because all these features are taught via model parameters.
- Start with simple heuristics
- Building your model Pre-process your input to the ML model (use available NLP techniques before feeding your input to the ML model) by creating a feature from the heuristic.
- Model construction
- ensemble and stacking
- Improved feature engineering
- Transfer learning
- Reapplication of heuristics
- Measuring the quality of the model we've developed is an important stage in the NLP process.
- A measure of how well the model performs on unobserved data.
- Using the proper evaluation metric and adhering to the proper evaluation method are essential for success.
- There are two different types of evaluation 1.intrinsic 2. extrinsic. The former focuses on measuring the system's performance using precision and recall, while the latter evaluates performance in relation to the goal.
Post modeling phases
- Once our model has been tried and tested, we move on to the post modeling phase,
3) model updating