NLP tools in Python

Page content

Libraries

I need a few libraries for NLP and each of them are very powerful. I downloaded all of these libraries via pip, like pip install -U {package}. In the last section, I summaraized the libraries and I can install them at once later.

  • spaCy: Open source NLP library.
  • NLTK: Natural Language ToolKit. It is older than spaCy (spaCy 2015~, NLTK 2001~).
  • gensim: NLP tools. I installed it for Doc2Vec.
  • TensorFlow: For custom models of machine learning including Keras.
  • Scikit-learn: Machine learning library.
  • SciPy: For fast and easy linear algebra operations.
  • Pandas: Easy data manipulations.
  • Jupyter notebook: Another post
  • Py2PDF: Get the text data from PDF file. Here is another post.

Notes: Anaconda, Conda, pip, and pyenv

If you are familiar with Python, I recommend to use pyenv for your Python environment.. But if you are a beginner, I highly recommend you to use Anaconda instead. Otherwise you would lose your purpose (learning NLP.)

Anaconda and pip could be conflict sometime. If you want to know about the details, go to this page.

Summary

In pip_install_nlp37.txt,

jupyter
matplotlib
scipy
numpy
PyPDF2
spacy
scikit-learn
gensim
tensorflow
pandas
nltk

Install them with pip install -r pip_install_nlp37.txt.

I installed a few modules for spacy also.

python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg