String manipulation with Python (for NLP)

Page content

f-strings

Concept

The official name of f-strings is “formatted string literal.” f-string is a “modern” way to put values of variables into the strings (so fat in Feb. 2020.) Before fstring appears, we were using format method. For me, fstring is very intuitive than format method.

Formatted string literals - Python official document

format - Python official document

Simple example

Both print lines in following code print the string “It’s me, Mario”.

name = "Mario!"

# Format method
print("It's me, {}".format(name))

# f-strings method
print(f"It's me, {name}")

Only you have to remember about f-strings is, you should put variables in the curly braces.

Padding

Population_tuple = [(Germany,82790000),(U.S., 327200000),(France,66990000)]

# Unformatted
for country,population in Population_tuple:
    print(f"{country} {population}") 

Germany 82790000
U.S. 327200000
France 66990000

# Space padded
for country,population in Population_tuple:
    print(f"{country:{10}}{population:{12}}") 

Germany       82790000
U.S.         327200000
France        66990000

# Align to left (<) and right (>)
# Padding with dots (.)
for country,population in Population_tuple:
    print(f"{country:.<{10}}{population:.>{12}}") 

Germany.......82790000
U.S..........327200000
France........66990000

Regular expression

Regular expression is almost a mandatory skill1 for There is a Python standard regular expression library re

re - Python official document

Additional to normal regular expression, re library has its own patterns.

Here is a snippets for the library.

import re

# Set up sample text and pattern
text = "This is a sample text. My local IP is 192.168.150.5."
pattern = r"[0-9]+(?:\.[0-9]+){3}"

# Find the pattern and print it
my_match = re.search(pattern,text)
print (my_match)
# <re.Match object; span=(38, 51), match='192.168.150.5'>
# ↑
# The first pattern match is start from 39th and end to 52nd string.
# The matched string is 192.168.150.5

# Find all patterns in the text
text = "This is a sample text. My local IP is 192.168.150.5, and the Gateway IP is 192.168.150.1"
my_mathes = re.findall(pattern,text)
print(my_mathes)
#['192.168.150.5', '192.168.150.1']

The prefix r at the pattern variable denotes that the variable is in form of “regular expression”.

https://docs.python.org/3/library/re.html#module-re

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with ‘r’. So r"\n" is a two-character string containing ‘' and ’n’, while “\n” is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

About regular expression, I made a note for myself.

PyPDF2

Concept

There are manuy useful PDF documents in the internet. These are very useful when we get data for training.

PyPDF2 is a library for manipulating PDF files via Python.

PyPDF2 Official Documentation

I write about PyPDF2 to another post.


  1. Not a skill but rather knowlege or tips. ↩︎