nlp

Set up Ubuntu 20.04 server with Nvidia GPU (and spaCy)

After OS install - basical update Upgrade softwares sudo su apt update apt upgrade -y ssh hardening ssh user@newserver.com mkdir .ssh chmod 700 .ssh #copy id_ecdsa.pub in .ssh/authorized_keys chmod 600 .ssh/authorized_keys In /etc/ssh/sshd_config add the following configurations. PermitRootLogin no PubkeyAuthentication yes PasswordAuthentication no After configuration, restart sshd. sudo systemctl restart sshd sudo su passwd # Change root passwd Firewall - ufw https://ubuntu.com/engage/20.04-webinars This comes through the constant security patching process and new features like the Ubuntu Server Live installer, iptables to nftables migration, and more resilient boot loader.

TF-IDF

Definition TF: Term Frequency. IDF: Inverse Document Frequency Document (sentence) -> words. Terms Capital $D$: A corpus. Subbed (D_i) or lower $d_i$: i-th document in the corpus. TF https://en.wikipedia.org/wiki/Tf–idf This term is irrelavant to the corpus (defined at each document). Raw count There are several definitions of “term frequency.” $$ f_{t,d_i} := \mathrm{the\ number\ of\ the\ words\ }t\ \mathrm{in\ the\ document\ } d_i $$ The most simple and “meaningful” for NLP is frequency form,

MeCab

Install Basics install pip install -U mecab-python3 Download dictionary https://pypi.org/project/mecab-python3/ These wheels include an internal (statically linked) copy of the MeCab library, and a copy of the mecab-ipadic dictionary (using UTF-8 text encoding), which is automatically used by default. If you wish to use a different dictionary, you will need to install it yourself, write a mecabrc file directing MeCab to use it, and set the environment variable MECABRC to point to this file.

Word2vec

This is a draft. Most important in NLP I think In 2020, NLP is a study to try to understand human languages with computers. From my common sense, it is pragmatically impossible for human to “understand” our languages in numbers. Yes, of course, scientism peopls could say “our world constitutes of quantum mechanical particles, and if we have enough machine power we can simulate a humam. So language is understandable with numbers.

Install GiNZA on macOS (April 2020)

Install on 3.8.1 - failed As of April 2020, I tried to install ginza on my macOS laptop in Python 3.8.1 (pyenv-virtualenv). pip install -U ginza but it was failed and returned an error like, ... Collecting ja_ginza_dict<3.2.0,>=3.1.0 Using cached ja_ginza_dict-3.1.0-1.tar.gz (44.8 MB) ERROR: Command errored out with exit status 1: command: /Users/atlex/.pyenv/versions/3.8.1/envs/nlp/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/setup.py'"'"'; __file__='"'"'/private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/pip-egg-info cwd: /private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/ Complete output (19 lines): Traceback (most recent call last): File "/Users/atlex/.

Py2PDF - read text from PDF file

Concept There are manuy useful PDF documents in the internet. These are very useful when we get data for training. PyPDF2 is a library for manipulating PDF files via Python. PyPDF2 Official Documentation Install You can install PyPDF2 via pip. pip install PyPDF2 How to use - read a PDF file PdfFileReader Class - Official PyPDF2 document We should open a file with mode rb. Read the file with PyPDF2.PdfFileReader(file_object) import PyPDF2 with open("sample.

SpaCy

What is spaCy In README of the GitHub project, there is a discription what is spaCy. spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages. It features state-of-the-art speed, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration.

String manipulation with Python (for NLP)

f-strings Concept The official name of f-strings is “formatted string literal.” f-string is a “modern” way to put values of variables into the strings (so fat in Feb. 2020.) Before fstring appears, we were using format method. For me, fstring is very intuitive than format method. Formatted string literals - Python official document format - Python official document Simple example Both print lines in following code print the string “It’s me, Mario”.

NLP tools in Python

Libraries I need a few libraries for NLP and each of them are very powerful. I downloaded all of these libraries via pip, like pip install -U {package}. In the last section, I summaraized the libraries and I can install them at once later. spaCy: Open source NLP library. NLTK: Natural Language ToolKit. It is older than spaCy (spaCy 2015~, NLTK 2001~). gensim: NLP tools. I installed it for Doc2Vec. TensorFlow: For custom models of machine learning including Keras.