df.info() #basic information about dataframe
len(df.index) #rethrn the number of rows (data)
df.count() #return the number of values which are non-NaN on each column
df.head()
df.tail()
In this example, the column is “Product”.
df["Product"].value_counts()
unique values to series.
df["Product"].unique()
# the type numpy.ndarray
# Check the data distribution
# The column is Score
ax = df["Score"]value_counts().plot(kind='bar')
fig = ax.get_figure()
inplace
We can develop our codes fast without inplace
.
https://stackoverflow.com/questions/43893457/understanding-inplace-true
https://stackoverflow.com/questions/1541797/how-do-i-check-if-there-are-duplicates-in-a-flat-list
len(your_list) != len(set(your_list))
import pandas as pd
df = pd.read_csv("/path/to/csvfile.csv", sep=",")
The firsr row in the CSV file should be the names of the column.
Delete if both Text
or Score
is NaN
.
df = df[['Text','Score']].dropna()
For filled score, replace with inplace
.
df['Score'].replace([1,2,3], 0, inplace=True)
df['Score'].replace([4,5], 1, inplace=True)
Truncate datea from index 0 to 100,000.
df = df.truncate(before=0, after=100000)
ax = df.Score.value_counts().plot(kind='bar')
fig = ax.get_figure()
from sklearn.model_selection import train_test_split
train_data, dev_data = train_test_split(preprocessed_data, test_size=0.33, random_state=42)
I realize that this is not a good reference for me. I can’t use this manual effectively.
$ head data.csv
Number,delay
1,0.02
26,0.07
127,0.08
21732,0.09
595,0.1
146,0.11
105,0.12
1242,0.13
219,0.14
We need the first notation line, because pd.read_csv
read the first line as labels.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("./delay-new.csv", sep=",")
print(df.keys())
## df.plot(kind='scatter',x='delay',y='Number',color='blue')
## plt.savefig('sample.png')
$ python plot-delay.py
Index(['Number', 'delay'], dtype='object')