Pandas - cheetsheet

Page content

Basic information about dataframe

df.info() #basic information about dataframe
len(df.index) #rethrn the number of rows (data)
df.count() #return the number of values which are non-NaN on each column
df.head()
df.tail()

Count the data in a column

In this example, the column is “Product”.

df["Product"].value_counts()

unique values to series.

df["Product"].unique()
# the type numpy.ndarray

check distrivution in graph

# Check the data distribution
# The column is Score
ax = df["Score"]value_counts().plot(kind='bar')
fig = ax.get_figure()

Tips - inplace

We can develop our codes fast without inplace.

https://stackoverflow.com/questions/43893457/understanding-inplace-true

Tips - cehck duplicate in liset

https://stackoverflow.com/questions/1541797/how-do-i-check-if-there-are-duplicates-in-a-flat-list

len(your_list) != len(set(your_list))

Import CSV to DataFrame

import pandas as pd
df = pd.read_csv("/path/to/csvfile.csv", sep=",")

The firsr row in the CSV file should be the names of the column.

Drop NaN data

Delete if both Text or Score is NaN.

df = df[['Text','Score']].dropna()

Replace data

For filled score, replace with inplace.

df['Score'].replace([1,2,3], 0, inplace=True)
df['Score'].replace([4,5], 1, inplace=True)

Truncate data

Truncate datea from index 0 to 100,000.

df = df.truncate(before=0, after=100000)

Plot the data distribution

ax = df.Score.value_counts().plot(kind='bar')
fig = ax.get_figure()

Tips - split data with scikit-learn

from sklearn.model_selection import train_test_split
train_data, dev_data = train_test_split(preprocessed_data, test_size=0.33, random_state=42)

4. Mar. 2021 more comprehensive,,,

I realize that this is not a good reference for me. I can’t use this manual effectively.

CSV data

$ head data.csv
Number,delay
1,0.02
26,0.07
127,0.08
21732,0.09
595,0.1
146,0.11
105,0.12
1242,0.13
219,0.14

We need the first notation line, because pd.read_csv read the first line as labels.

Pandas

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("./delay-new.csv", sep=",")
print(df.keys())

## df.plot(kind='scatter',x='delay',y='Number',color='blue')
## plt.savefig('sample.png')
$ python plot-delay.py
Index(['Number', 'delay'], dtype='object')