SpaCy

Page content

What is spaCy

In README of the GitHub project, there is a discription what is spaCy.

spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages. It features state-of-the-art speed, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It’s commercial open-source software, released under the MIT license.

Install

pip install -U spacy

Before using the library, download a language library first. In my case, I downloaded English language library as follows.

python -m spacy download en_core_web_sm

I added the following language library also.

python -m spacy download de_core_news_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg

Enabling GPU functions.

pip install -U spacy[cuda100]

spaCy usage

Number of words in the dictionary of loaded module

TBW

Basic step - Tokenization

Tokenization is the process of breaking up the original text into component pieces (=tokens).

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages.")

for token in doc:
    print(token.text)

Get part of speach (POS)

Each Token contain POS and you can check it like the follow.

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages.")

for token in doc:
    # print(token.pos) <- token.pos will show numbers and each number is correspond to its POS.
    print(token.pos_)

I don’t write a snippet, but token.dep_ returns syntactic dependency.

For my note => Notes about token.tag_ and spacy.explain(token.tag_) should be added here.

Pipeline object

The loaded model (nlp in formar cases) has a pipeline object. This is show how the model process text in line.

image caption
Processing pipeline - spaCy official document

Extract sentences

Extract sentences.

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages.")

for sentence in doc.sents:
    print(sentence)

Extract entities

Extract entities (named entity recognition, NER).

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages.")

for entity in doc.ents:
    print(entity)
    print(entity.label_) # <- Each entity has its own label.

Here is the list of labels.

https://spacy.io/api/annotation#named-entities

Get noun chuncks

Get noun chuncks

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages.")

for chunk in doc.noun_chunks:
    print(chunk)

Tokenization - a few details

At the first step of a pipeline, spaCy devide the text into components called “token”. This process is called tokenization. Token can be regarded as “meaningful segment”.

This is not a simple proceesser such that split the text on spaces. Of course it does, but after the splitting it recognizes other components s.t. punctuations, quotations, etc..

Here is the more detail about tokenization in spaCy (official document.)

Tip

Extract tokens which are not stop words.

tokens = [token.text for token in doc if not token.is_stop]

Cool tool - displaCy

https://spacy.io/usage/visualizers

If you are using Jupyter Notebook, try the snippet below.

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 50+ languages.")

spacy.displacy.render(doc, style="dep", jupyter=True)

As of April 2020, style can take one of "dep" and "ent".

You can also save a figure (To Be Written).

https://spacy.io/usage/visualizers#html

Stemming

The meaning of “stem” is “word stem” in this case. Stem is a linguistic word and there is a similar concept “root”.

https://en.wikipedia.org/wiki/Root_(linguistics)

Inflectional roots are often called stems,

https://en.wikipedia.org/wiki/Word_stem

Lemmatization

Lemma is easily understand as dictionary form.

To Be Written…

Stemming and lemmatization in spaCy

There is no stemmer in spacy! Instead, there is a lemmatizer.

Why? -> https://github.com/explosion/spaCy/issues/327

https://spacy.io/api/lemmatizer

import spacy
nlp = spacy.load("en_core_web_sm")

text = "Stay aware of the latest information on the COVID-19 outbreak, available on the WHO website and through your national and local public health authority."
doc = nlp(text)

for token in doc: 
    print(f"{token.text}\t{token.lemma}\t{token.lemma_}")

Each lemma has its hash.

To be written…

Stop words

“Stop words” is not a term of linguistics but in computer science. Stop words can be defined as “most common words in a language”. In this definition, “common” can be interpreted in may other ways, so it strongly depend on the tools.

Check wtop words

We can also find stop words in spaCy library.

import spacy
nlp = spacy.load("en_core_web_sm")

# Achtung! This will show you a set which has more than 300 elements
print(nlp.Defaults.stop_words)

# Check how many words are listed in stop words of spaCy as of April 2020
len(nlp.Defaults.stop_words)
# 326

# Check the word "me" is in stop words or not
nlp.vocab["me"].is_stop
# True

Add and delete from stop words

Make is_stop variable in vocab.

import spacy
nlp = spacy.load("en_core_web_sm")

print(len(nlp.Defaults.stop_words))
# 326
print(nlp.vocab["Volvic"].is_stop)
# False

# Add a stop word
nlp.Defaults.stop_words.add("Volvic")
nlp.vocab["Volvic"].is_stop = True

# Check whether it is added
print(len(nlp.Defaults.stop_words))
# 327 (incremented!!)
print(nlp.vocab["Volvic"].is_stop)
# True!!

# Delete the word "put" from stop words
nlp.Defaults.stop_words.remove("put")
nlp.vocab["put"].is_stop = False

# Check whether it is added
print(len(nlp.Defaults.stop_words))
# 326 (decremented!!)
print(nlp.vocab["put"].is_stop)
# False!!

Matcher

Matcher is more powerful patterns (strings) matching tool. Not only a single word, but Matcher can grab multi patterns at once.

https://spacy.io/usage/rule-based-matching

Here is the official sample.

import spacy
nlp = spacy.load("en_core_web_sm")

# Import the Matcher lib
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern)


doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

In the pattern we can use regex..

https://spacy.io/usage/rule-based-matching#regex

Without regex, Operators (OP key) can perferm similar things.

https://spacy.io/usage/rule-based-matching#quantifiers

About attrivutes. https://spacy.io/usage/rule-based-matching#adding-patterns-attributes

Architecture of spaCy

image caption
Architecture - spaCy official document