If you are a data scientist or aspire to be one investing your time in learning natural language processing (NLP) will be an investment in your future. 2020 saw a surge in the field of natural language processing. In this blog post you will discover 5 popular NLP libraries, and it’s applications.
Preprocessing a crucial step in any machine learning pipeline. If you are building a language model you would have to create a word vector which involves removing stop words, and converting words to its root form.
Spacy is a popular Python library for sentence tokenization, lemmatization, and stemming. It is an industry grade library which can be used for text preprocessing and training deep learning based text classifiers.
Getting started with Spacy: Named Entity Recognition is an important task in natural language processing. NER helps in extracting important entities like location, organization names, etc.
import spacy # python -m spacy download en_core_web_sm nlp = spacy.load('en_core_web_sm') sentences = ['Stockholm is a beautiful city', 'Mumbai is a vibrant city' ] for sentence in sentences: doc = nlp(sentence) for entity in doc.ents: print(entity.text, entity.label_) print(spacy.explain(entity.label_))
The above code processes the two sentences and extracts the location in both sentences.
Let us now see the output
As seen from the output the code was able to extract Stockholm and Mumbai and associated them with the GPE label which indicates countries, cities, or states.
NLTK is another popular Python library for text preprocessing. It was started as an academic project and soon became very popular amongst researchers and academicians.
Let us see how we can do Part of Speech Tagging using NLTK. Part of speech tagging is used to extract the important part of speech like nouns, pronouns, adverbs, adjectives, etc.
import nltk import os sentence = "Python is a beautiful programming language." tokens = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokens) entities = nltk.chunk.ne_chunk(tagged) print(entities)
The parts of speech that were extract from the above sentence are
A popular application of NLP is to categorize a document into a given set of labels. There are a number of Python libraries which can help you to train deep learning based models for topic modeling, text summarization, sentiment analysis etc. Let us have a look at some of these popular libraries
Most deep learning based NLP models rely on pretrained language models using a process called transfer learning. A huge corpus of document is trained and then this model can be fine-tuned for a specific domain. Some popular libraries which help in using pretrained models and building industry grade NLP applications are
Farm is a popular open source package developed by a Berlin based company. It is used to make the life of developers easier by providing some nice functionalities like experiment tracking, multitask-learning and parallelized processing of documents.
Flair is a popular PyTorch based framework which helps developers to build state of the NLP applications like named entity recognition, part-of-speech tagging, sense disambiguation and classification.
Transformers is a popular Python library to easily access pretrained models and has support for both PyTorch and TensorFlow. If you want to build an entire NLP pipeline by using pretrained models for Natural language understanding and generation tasks transformers will make your life easier.
Gensim is another popular Python library widely used for topic modelling and provides an easy-to-use interface for popular algorithms like word2vec to find synonymous words.