Pandas DataFrame について
Pythonによるデータ分析入門 ―NumPy、pandasを使ったデータ処理
- 作者: Wes McKinney,小林儀匡,鈴木宏尚,瀬戸山雅人,滝口開資,野上大介
- 出版社/メーカー: オライリージャパン
- 発売日: 2013/12/26
- メディア: 大型本
- この商品を含むブログ (12件) を見る
PandasのDataFrameは行列形式のデータ構造で、indexやcolumnにkeyが指定できたり*1、indexやcolumnをまとめたものにnameを付けられたりする。DataFrameに対する集計関数や条件選択などはある程度直感的に操作できるが、Python標準のデータ構造から階層的なindexを持ったDataFrameを作成する場合や、階層的なindexを持ったDataFrameをどのように自由に変換できるのか、などが分かりづらかったので色々試してみた。
Nested dictを元にDataFrameを作成する場合、一番外側のkeyがcolumn、次に内側のkeyがindex、となってDataFrameを作成してくれる。ただし、tupleをkeyに使ったdictを元にDataFrameを作成する場合は、indexが無いことになるので、indexを指定しないとValueError: If using all scalar values, you must pass an index
というエラーになる。そのため、indexを指定するか、あるいは値をscalarではなくlistとするか、が必要になる(df5とdf6の違い)。*2
$ python dataframe_run.py df1 col1 col2 col3 idx1 0 4 8 idx2 1 5 9 idx3 2 6 10 idx4 3 7 11 df2 col1 col2 col3 idx1 0 4 8 idx2 1 5 9 idx3 2 6 10 idx4 3 7 11 df1 == df2 col1 col2 col3 idx1 True True True idx2 True True True idx3 True True True idx4 True True True df3(MultiIndex) #1 col1 col2 col3 idx1-1 idx2-1 idx3-1 0 1 2 idx1-2 idx2-2 idx3-2 3 4 5 idx2-3 idx3-3 6 7 8 idx3-4 9 10 11 df3(MultiIndex) #2: Append column col1 col2 col3 col4 idx1-1 idx2-1 idx3-1 0 1 2 100 idx1-2 idx2-2 idx3-2 3 4 5 100 idx2-3 idx3-3 6 7 8 100 idx3-4 9 10 11 100 df3(MultiIndex) #3: Access to scalar value 10 df3(MultiIndex) #4: Access to scalar by ix property 11 df3(MultiIndex) #5: Access to record by ix with condtions col1 col3 idx1-2 idx2-2 idx3-2 3 5 idx2-3 idx3-3 6 8 df3(MultiIndex) #6: Access to record by ix with conds specified by isin method col2 col3 idx1-1 idx2-1 idx3-1 1 2 idx1-2 idx2-3 idx3-3 7 8 df3(MultiIndex) #7: Pivot table with fillna method col3 2 5 8 11 col2 1 0 -1 -1 -1 4 -1 3 -1 -1 7 -1 -1 6 -1 10 -1 -1 -1 9 df3(MultiInex) #8: Stack method idx1-1 idx2-1 idx3-1 col1 0 col2 1 col3 2 col4 100 idx1-2 idx2-2 idx3-2 col1 3 col2 4 col3 5 col4 100 idx2-3 idx3-3 col1 6 col2 7 col3 8 col4 100 idx3-4 col1 9 col2 10 col3 11 col4 100 dtype: int64 df3(MultiIndex) #9: Stack and unstack col1 col2 col3 col4 idx1-1 idx2-1 idx3-1 0 1 2 100 idx1-2 idx2-2 idx3-2 3 4 5 100 idx2-3 idx3-3 6 7 8 100 idx3-4 9 10 11 100 df3(MultiIndex) #10: Unstack method col1 col2 col3 \ idx3-1 idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4 idx3-1 idx1-1 idx2-1 0 NaN NaN NaN 1 NaN NaN NaN 2 idx1-2 idx2-2 NaN 3 NaN NaN NaN 4 NaN NaN NaN idx2-3 NaN NaN 6 9 NaN NaN 7 10 NaN col4 idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4 idx1-1 idx2-1 NaN NaN NaN 100 NaN NaN NaN idx1-2 idx2-2 5 NaN NaN NaN 100 NaN NaN idx2-3 NaN 8 11 NaN NaN 100 100 df3(MultiIndex) #11: Nested unstacks col1 \ idx3-1 idx3-2 idx3-3 idx3-4 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx1-1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN idx1-2 NaN NaN NaN NaN 3 NaN NaN NaN 6 NaN ... col4 \ ... idx3-1 idx3-2 idx3-3 idx3-4 ... idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx1-1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN idx1-2 ... NaN NaN 100 NaN NaN NaN 100 NaN NaN idx2-3 idx1-1 NaN idx1-2 100 [2 rows x 48 columns] df3(MultiIndex) #12: Swaplevel method col1 col2 col3 col4 idx3-1 idx2-1 idx1-1 0 1 2 100 idx3-2 idx2-2 idx1-2 3 4 5 100 idx3-3 idx2-3 idx1-2 6 7 8 100 idx3-4 idx2-3 idx1-2 9 10 11 100 df4(Nested dict) a1 a2 \ b1 {'c3': 2, 'c1': 0, 'c2': 1} {'c3': 11, 'c1': 9, 'c2': 10} b2 {'c3': 5, 'c1': 3, 'c2': 4} {'c3': 14, 'c1': 12, 'c2': 13} b3 {'c3': 8, 'c1': 6, 'c2': 7} {'c3': 17, 'c1': 15, 'c2': 16} a3 b1 {'c3': 20, 'c1': 18, 'c2': 19} b2 {'c3': 23, 'c1': 21, 'c2': 22} b3 {'c3': 26, 'c1': 24, 'c2': 25} df5(Dict with tuple key) #1 a1 a2 ... a3 b1 b2 b3 b1 ... b3 b1 b2 b3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ... c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 0 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25 26 [1 rows x 27 columns] df5(Dict with tuple key) #2: Stack method a1 a2 a3 b1 b2 b3 b1 b2 b3 b1 b2 b3 c1 0 3 6 9 12 15 18 21 24 c2 1 4 7 10 13 16 19 22 25 c3 2 5 8 11 14 17 20 23 26 df5(Dict with tuple key) #3: Nested stacks c1 b1 a1 0 a2 9 a3 18 b2 a1 3 a2 12 a3 21 b3 a1 6 a2 15 a3 24 c2 b1 a1 1 a2 10 a3 19 b2 a1 4 a2 13 a3 22 b3 a1 7 a2 16 a3 25 c3 b1 a1 2 a2 11 a3 20 b2 a1 5 a2 14 a3 23 b3 a1 8 a2 17 a3 26 dtype: int64 df5(Dict with tuple key) #4: Nested stacks and swaplevel a1 b1 c1 0 a2 b1 c1 9 a3 b1 c1 18 a1 b2 c1 3 a2 b2 c1 12 a3 b2 c1 21 a1 b3 c1 6 a2 b3 c1 15 a3 b3 c1 24 a1 b1 c2 1 a2 b1 c2 10 a3 b1 c2 19 a1 b2 c2 4 a2 b2 c2 13 a3 b2 c2 22 a1 b3 c2 7 a2 b3 c2 16 a3 b3 c2 25 a1 b1 c3 2 a2 b1 c3 11 a3 b1 c3 20 a1 b2 c3 5 a2 b2 c3 14 a3 b2 c3 23 a1 b3 c3 8 a2 b3 c3 17 a3 b3 c3 26 dtype: int64 df5(Dict with tuple key) #5: Nested stacks, swaplevel and unstack c1 c2 c3 a1 b1 0 1 2 b2 3 4 5 b3 6 7 8 a2 b1 9 10 11 b2 12 13 14 b3 15 16 17 a3 b1 18 19 20 b2 21 22 23 b3 24 25 26 df5(Dict with tuple key) #6: Swaplevel with axis=1 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ... c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 b1 b1 b1 b2 b2 b2 b3 b3 b3 b1 ... b3 b1 b1 b1 b2 b2 b2 b3 b3 b3 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 ... a2 a3 a3 a3 a3 a3 a3 a3 a3 a3 0 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25 26 [1 rows x 27 columns] df6(Dict with tuple key and list values) a1 a2 ... a3 b1 b2 b3 b1 ... b3 b1 b2 b3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ... c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 0 0 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25 26 [1 rows x 27 columns]