Pandas DataFrame について

Pythonによるデータ分析入門 ―NumPy、pandasを使ったデータ処理
- 作者: Wes McKinney,小林儀匡,鈴木宏尚,瀬戸山雅人,滝口開資,野上大介
- 出版社/メーカー: オライリージャパン
- 発売日: 2013/12/26
- メディア: 大型本
- この商品を含むブログ (12件) を見る
PandasのDataFrameは行列形式のデータ構造で、indexやcolumnにkeyが指定できたり*1、indexやcolumnをまとめたものにnameを付けられたりする。DataFrameに対する集計関数や条件選択などはある程度直感的に操作できるが、Python標準のデータ構造から階層的なindexを持ったDataFrameを作成する場合や、階層的なindexを持ったDataFrameをどのように自由に変換できるのか、などが分かりづらかったので色々試してみた。
Nested dictを元にDataFrameを作成する場合、一番外側のkeyがcolumn、次に内側のkeyがindex、となってDataFrameを作成してくれる。ただし、tupleをkeyに使ったdictを元にDataFrameを作成する場合は、indexが無いことになるので、indexを指定しないとValueError: If using all scalar values, you must pass an indexというエラーになる。そのため、indexを指定するか、あるいは値をscalarではなくlistとするか、が必要になる(df5とdf6の違い)。*2
$ python dataframe_run.py
df1
col1 col2 col3
idx1 0 4 8
idx2 1 5 9
idx3 2 6 10
idx4 3 7 11
df2
col1 col2 col3
idx1 0 4 8
idx2 1 5 9
idx3 2 6 10
idx4 3 7 11
df1 == df2
col1 col2 col3
idx1 True True True
idx2 True True True
idx3 True True True
idx4 True True True
df3(MultiIndex) #1
col1 col2 col3
idx1-1 idx2-1 idx3-1 0 1 2
idx1-2 idx2-2 idx3-2 3 4 5
idx2-3 idx3-3 6 7 8
idx3-4 9 10 11
df3(MultiIndex) #2: Append column
col1 col2 col3 col4
idx1-1 idx2-1 idx3-1 0 1 2 100
idx1-2 idx2-2 idx3-2 3 4 5 100
idx2-3 idx3-3 6 7 8 100
idx3-4 9 10 11 100
df3(MultiIndex) #3: Access to scalar value
10
df3(MultiIndex) #4: Access to scalar by ix property
11
df3(MultiIndex) #5: Access to record by ix with condtions
col1 col3
idx1-2 idx2-2 idx3-2 3 5
idx2-3 idx3-3 6 8
df3(MultiIndex) #6: Access to record by ix with conds specified by isin method
col2 col3
idx1-1 idx2-1 idx3-1 1 2
idx1-2 idx2-3 idx3-3 7 8
df3(MultiIndex) #7: Pivot table with fillna method
col3 2 5 8 11
col2
1 0 -1 -1 -1
4 -1 3 -1 -1
7 -1 -1 6 -1
10 -1 -1 -1 9
df3(MultiInex) #8: Stack method
idx1-1 idx2-1 idx3-1 col1 0
col2 1
col3 2
col4 100
idx1-2 idx2-2 idx3-2 col1 3
col2 4
col3 5
col4 100
idx2-3 idx3-3 col1 6
col2 7
col3 8
col4 100
idx3-4 col1 9
col2 10
col3 11
col4 100
dtype: int64
df3(MultiIndex) #9: Stack and unstack
col1 col2 col3 col4
idx1-1 idx2-1 idx3-1 0 1 2 100
idx1-2 idx2-2 idx3-2 3 4 5 100
idx2-3 idx3-3 6 7 8 100
idx3-4 9 10 11 100
df3(MultiIndex) #10: Unstack method
col1 col2 col3 \
idx3-1 idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4 idx3-1
idx1-1 idx2-1 0 NaN NaN NaN 1 NaN NaN NaN 2
idx1-2 idx2-2 NaN 3 NaN NaN NaN 4 NaN NaN NaN
idx2-3 NaN NaN 6 9 NaN NaN 7 10 NaN
col4
idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4
idx1-1 idx2-1 NaN NaN NaN 100 NaN NaN NaN
idx1-2 idx2-2 5 NaN NaN NaN 100 NaN NaN
idx2-3 NaN 8 11 NaN NaN 100 100
df3(MultiIndex) #11: Nested unstacks
col1 \
idx3-1 idx3-2 idx3-3 idx3-4
idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1
idx1-1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
idx1-2 NaN NaN NaN NaN 3 NaN NaN NaN 6 NaN
... col4 \
... idx3-1 idx3-2 idx3-3 idx3-4
... idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2
idx1-1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
idx1-2 ... NaN NaN 100 NaN NaN NaN 100 NaN NaN
idx2-3
idx1-1 NaN
idx1-2 100
[2 rows x 48 columns]
df3(MultiIndex) #12: Swaplevel method
col1 col2 col3 col4
idx3-1 idx2-1 idx1-1 0 1 2 100
idx3-2 idx2-2 idx1-2 3 4 5 100
idx3-3 idx2-3 idx1-2 6 7 8 100
idx3-4 idx2-3 idx1-2 9 10 11 100
df4(Nested dict)
a1 a2 \
b1 {'c3': 2, 'c1': 0, 'c2': 1} {'c3': 11, 'c1': 9, 'c2': 10}
b2 {'c3': 5, 'c1': 3, 'c2': 4} {'c3': 14, 'c1': 12, 'c2': 13}
b3 {'c3': 8, 'c1': 6, 'c2': 7} {'c3': 17, 'c1': 15, 'c2': 16}
a3
b1 {'c3': 20, 'c1': 18, 'c2': 19}
b2 {'c3': 23, 'c1': 21, 'c2': 22}
b3 {'c3': 26, 'c1': 24, 'c2': 25}
df5(Dict with tuple key) #1
a1 a2 ... a3
b1 b2 b3 b1 ... b3 b1 b2 b3
c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ... c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
0 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25 26
[1 rows x 27 columns]
df5(Dict with tuple key) #2: Stack method
a1 a2 a3
b1 b2 b3 b1 b2 b3 b1 b2 b3
c1 0 3 6 9 12 15 18 21 24
c2 1 4 7 10 13 16 19 22 25
c3 2 5 8 11 14 17 20 23 26
df5(Dict with tuple key) #3: Nested stacks
c1 b1 a1 0
a2 9
a3 18
b2 a1 3
a2 12
a3 21
b3 a1 6
a2 15
a3 24
c2 b1 a1 1
a2 10
a3 19
b2 a1 4
a2 13
a3 22
b3 a1 7
a2 16
a3 25
c3 b1 a1 2
a2 11
a3 20
b2 a1 5
a2 14
a3 23
b3 a1 8
a2 17
a3 26
dtype: int64
df5(Dict with tuple key) #4: Nested stacks and swaplevel
a1 b1 c1 0
a2 b1 c1 9
a3 b1 c1 18
a1 b2 c1 3
a2 b2 c1 12
a3 b2 c1 21
a1 b3 c1 6
a2 b3 c1 15
a3 b3 c1 24
a1 b1 c2 1
a2 b1 c2 10
a3 b1 c2 19
a1 b2 c2 4
a2 b2 c2 13
a3 b2 c2 22
a1 b3 c2 7
a2 b3 c2 16
a3 b3 c2 25
a1 b1 c3 2
a2 b1 c3 11
a3 b1 c3 20
a1 b2 c3 5
a2 b2 c3 14
a3 b2 c3 23
a1 b3 c3 8
a2 b3 c3 17
a3 b3 c3 26
dtype: int64
df5(Dict with tuple key) #5: Nested stacks, swaplevel and unstack
c1 c2 c3
a1 b1 0 1 2
b2 3 4 5
b3 6 7 8
a2 b1 9 10 11
b2 12 13 14
b3 15 16 17
a3 b1 18 19 20
b2 21 22 23
b3 24 25 26
df5(Dict with tuple key) #6: Swaplevel with axis=1
c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ... c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
b1 b1 b1 b2 b2 b2 b3 b3 b3 b1 ... b3 b1 b1 b1 b2 b2 b2 b3 b3 b3
a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 ... a2 a3 a3 a3 a3 a3 a3 a3 a3 a3
0 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25 26
[1 rows x 27 columns]
df6(Dict with tuple key and list values)
a1 a2 ... a3
b1 b2 b3 b1 ... b3 b1 b2 b3
c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ... c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
0 0 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25 26
[1 rows x 27 columns]