NLP

Enabling TensorFlow using GPU

We can install TensorFlow via pip easily, but we should care a little bit more if you want to enable GPU. Requirements https://www.tensorflow.org/install/gpu#software_requirements #Here is how I installed my NVIDIA GPU environment. Install Pre requirements sudo apt-get install libcupti-dev #already installed in my case echo 'export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc Install cuDNN Download a compatible version from https://developer.nvidia.com/rdp/cudnn-download. tar -xzvf cudnn-10.2-linux-x64-v8.0.1.13.tgz sudo cp cuda/include/cudnn*.h /usr/local/cuda/include sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64 sudo chmod a+r /usr/local/cuda/include/cudnn*.

TensorBoard

TensorBoard We can easily visualize our neural networks written by TensorFlow in a graph format with TensorBoard (it can more actually). https://www.tensorflow.org/tensorboard/get_started Install As of 2020/07/09, TensorBoard is installed when you install TensorFlow with pip. pip install -U tensorboard <- it already installed when you install tensorflow with pip. it coflict and cause problem Simple sampl code First, create a smiple model. mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.

Tensorflow architecture

Official https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/extend/architecture.md TensorFlow has master, client, worker components. You can imagine a distributed system, and it is correct. TensorFlow is designed to make a cluster. Distributed TensorFlow And here is the official document about distributed TensorFlow with sample codes. https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md Deprecated: the link was expired Another Sample Here is sample cluster code by IONOS (one of the biggest German ISP.) https://www.ionos.de/community/server-cloud-infrastructure/tensorflow/einrichten-eines-verteilten-tensorflow-clusters-auf-cloud-servern/ You can see there is parameter servers and worker servers.

Tensorflow - step by step

Intro - Official quickstart for beginners https://www.tensorflow.org/tutorials/quickstart/beginner Import TensorFlow library and load official MNIST dataset. import tensorflow as tf mnist = tf.keras.datasets.mnist Split MNIST dataset into training and dataset. and regularize (from 0 to 1). (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 The meaning of values is quoted below. https://conx.readthedocs.io/en/latest/MNIST.html The MNIST digits are grayscale images, with each pixel represented as a single intensity value in the range 0 (black) to 1 (white).

Set up Ubuntu 20.04 server with Nvidia GPU (and spaCy)

After OS install - basical update Upgrade softwares sudo su apt update apt upgrade -y ssh hardening ssh user@newserver.com mkdir .ssh chmod 700 .ssh #copy id_ecdsa.pub in .ssh/authorized_keys chmod 600 .ssh/authorized_keys In /etc/ssh/sshd_config add the following configurations. PermitRootLogin no PubkeyAuthentication yes PasswordAuthentication no After configuration, restart sshd. sudo systemctl restart sshd sudo su passwd # Change root passwd Firewall - ufw https://ubuntu.com/engage/20.04-webinars This comes through the constant security patching process and new features like the Ubuntu Server Live installer, iptables to nftables migration, and more resilient boot loader.

MeCab

Install Basics install pip install -U mecab-python3 Download dictionary https://pypi.org/project/mecab-python3/ These wheels include an internal (statically linked) copy of the MeCab library, and a copy of the mecab-ipadic dictionary (using UTF-8 text encoding), which is automatically used by default. If you wish to use a different dictionary, you will need to install it yourself, write a mecabrc file directing MeCab to use it, and set the environment variable MECABRC to point to this file.

Word2vec

This is a draft. Most important in NLP I think In 2020, NLP is a study to try to understand human languages with computers. From my common sense, it is pragmatically impossible for human to “understand” our languages in numbers. Yes, of course, scientism peopls could say “our world constitutes of quantum mechanical particles, and if we have enough machine power we can simulate a humam. So language is understandable with numbers.

Install GiNZA on macOS (April 2020)

Install on 3.8.1 - failed As of April 2020, I tried to install ginza on my macOS laptop in Python 3.8.1 (pyenv-virtualenv). pip install -U ginza but it was failed and returned an error like, ... Collecting ja_ginza_dict<3.2.0,>=3.1.0 Using cached ja_ginza_dict-3.1.0-1.tar.gz (44.8 MB) ERROR: Command errored out with exit status 1: command: /Users/atlex/.pyenv/versions/3.8.1/envs/nlp/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/setup.py'"'"'; __file__='"'"'/private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/pip-egg-info cwd: /private/var/folders/hf/kz2h5f215jx73h3vl_d423pr0000gn/T/pip-install-_gi9fc1z/ja-ginza-dict/ Complete output (19 lines): Traceback (most recent call last): File "/Users/atlex/.

Py2PDF - read text from PDF file

Concept There are manuy useful PDF documents in the internet. These are very useful when we get data for training. PyPDF2 is a library for manipulating PDF files via Python. PyPDF2 Official Documentation Install You can install PyPDF2 via pip. pip install PyPDF2 How to use - read a PDF file PdfFileReader Class - Official PyPDF2 document We should open a file with mode rb. Read the file with PyPDF2.PdfFileReader(file_object) import PyPDF2 with open("sample.

SpaCy

What is spaCy In README of the GitHub project, there is a discription what is spaCy. spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages. It features state-of-the-art speed, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration.