Word2vec

Page content

This is a draft.

Most important in NLP I think

In 2020, NLP is a study to try to understand human languages with computers. From my common sense, it is pragmatically impossible for human to “understand” our languages in numbers.

Yes, of course, scientism peopls could say “our world constitutes of quantum mechanical particles, and if we have enough machine power we can simulate a humam. So language is understandable with numbers.” But you should think once how many resource do you need to compute a human brain. This is why I said “pragmatically impossible.”

But this challenge is very very interesting for me! I started to lean NLP because I want to see the development of this challenge. I’m not a specialist in this field at all, but in the very first step of the learning I recognized most important part of the challenge are laid in this word-number translations.

As of 2020 April, I’ve seen in the Internet that Word2Vec is a common way to do that so I start to investigate it.

The idea is very simple so I thought it was invented around more than 20 years ago, but according to the Wikipedia page, the idea was published in 201 the idea was published in 2013.

My first Word2Vec

I could get an overview of Word2Vec with this video in a bed. Awesome.

https://www.youtube.com/watch?v=64qSgA66P-8

Funny practice.

https://www.youtube.com/watch?v=zFScws0mb7M

His code.

https://github.com/SmokinCaterpillar/doc2vec_user_comments

I want to learn Doc2Vec and the next page is pretty awesome.

https://shuzhanfan.github.io/2018/08/understanding-word2vec-and-doc2vec/

About Negative sampling

https://aegis4048.github.io/optimize_computational_efficiency_of_skip-gram_with_negative_sampling

RNN for POS tagging https://www.youtube.com/watch?v=2AuMgtw-z6s

(To Be added.)

Note.

word2Vec single layer NN.

  • Skip-gram
  • CBOW

Doc2Vec

Original Paper of Doc2Vec.

https://arxiv.org/abs/1405.4053

spacy + doc2vec

https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/

Attention

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Sence2Vec

https://explosion.ai/blog/sense2vec-with-spacy

Tensorflow implementation

https://www.youtube.com/watch?v=64qSgA66P-8

need to be checked.

  1. Prepare a corpus (a list of sentences.)
  2. Remove stop words from the sentences.
  3. Extract unique words and label it as one hot vectors.
  4. Skip-gram: Make “neighbor” data (depend on window size).
  5. Make a model.
  6. Train it.