Pandas DataFrame について

Pythonによるデータ分析入門 ―NumPy、pandasを使ったデータ処理

Pythonによるデータ分析入門 ―NumPy、pandasを使ったデータ処理

PandasのDataFrameは行列形式のデータ構造で、indexやcolumnにkeyが指定できたり*1、indexやcolumnをまとめたものにnameを付けられたりする。DataFrameに対する集計関数や条件選択などはある程度直感的に操作できるが、Python標準のデータ構造から階層的なindexを持ったDataFrameを作成する場合や、階層的なindexを持ったDataFrameをどのように自由に変換できるのか、などが分かりづらかったので色々試してみた。

Nested dictを元にDataFrameを作成する場合、一番外側のkeyがcolumn、次に内側のkeyがindex、となってDataFrameを作成してくれる。ただし、tupleをkeyに使ったdictを元にDataFrameを作成する場合は、indexが無いことになるので、indexを指定しないとValueError: If using all scalar values, you must pass an indexというエラーになる。そのため、indexを指定するか、あるいは値をscalarではなくlistとするか、が必要になる(df5とdf6の違い)。*2

gist.github.com

$ python dataframe_run.py 
df1
      col1  col2  col3
idx1     0     4     8
idx2     1     5     9
idx3     2     6    10
idx4     3     7    11

df2
      col1  col2  col3
idx1     0     4     8
idx2     1     5     9
idx3     2     6    10
idx4     3     7    11

df1 == df2
      col1  col2  col3
idx1  True  True  True
idx2  True  True  True
idx3  True  True  True
idx4  True  True  True

df3(MultiIndex) #1
                      col1  col2  col3
idx1-1 idx2-1 idx3-1     0     1     2
idx1-2 idx2-2 idx3-2     3     4     5
       idx2-3 idx3-3     6     7     8
              idx3-4     9    10    11

df3(MultiIndex) #2: Append column
                      col1  col2  col3  col4
idx1-1 idx2-1 idx3-1     0     1     2   100
idx1-2 idx2-2 idx3-2     3     4     5   100
       idx2-3 idx3-3     6     7     8   100
              idx3-4     9    10    11   100

df3(MultiIndex) #3: Access to scalar value
10

df3(MultiIndex) #4: Access to scalar by ix property
11

df3(MultiIndex) #5: Access to record by ix with condtions
                      col1  col3
idx1-2 idx2-2 idx3-2     3     5
       idx2-3 idx3-3     6     8

df3(MultiIndex) #6: Access to record by ix with conds specified by isin method
                      col2  col3
idx1-1 idx2-1 idx3-1     1     2
idx1-2 idx2-3 idx3-3     7     8

df3(MultiIndex) #7: Pivot table with fillna method
col3  2   5   8   11
col2                
1      0  -1  -1  -1
4     -1   3  -1  -1
7     -1  -1   6  -1
10    -1  -1  -1   9

df3(MultiInex) #8: Stack method
idx1-1  idx2-1  idx3-1  col1      0
                        col2      1
                        col3      2
                        col4    100
idx1-2  idx2-2  idx3-2  col1      3
                        col2      4
                        col3      5
                        col4    100
        idx2-3  idx3-3  col1      6
                        col2      7
                        col3      8
                        col4    100
                idx3-4  col1      9
                        col2     10
                        col3     11
                        col4    100
dtype: int64

df3(MultiIndex) #9: Stack and unstack
                      col1  col2  col3  col4
idx1-1 idx2-1 idx3-1     0     1     2   100
idx1-2 idx2-2 idx3-2     3     4     5   100
       idx2-3 idx3-3     6     7     8   100
              idx3-4     9    10    11   100

df3(MultiIndex) #10: Unstack method
                col1                        col2                        col3  \
              idx3-1 idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4 idx3-1   
idx1-1 idx2-1      0    NaN    NaN    NaN      1    NaN    NaN    NaN      2   
idx1-2 idx2-2    NaN      3    NaN    NaN    NaN      4    NaN    NaN    NaN   
       idx2-3    NaN    NaN      6      9    NaN    NaN      7     10    NaN   

                                     col4                       
              idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4  
idx1-1 idx2-1    NaN    NaN    NaN    100    NaN    NaN    NaN  
idx1-2 idx2-2      5    NaN    NaN    NaN    100    NaN    NaN  
       idx2-3    NaN      8     11    NaN    NaN    100    100  

df3(MultiIndex) #11: Nested unstacks
         col1                                                                 \
       idx3-1               idx3-2               idx3-3               idx3-4   
       idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1   
idx1-1      0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
idx1-2    NaN    NaN    NaN    NaN      3    NaN    NaN    NaN      6    NaN   

        ...     col4                                                          \
        ...   idx3-1 idx3-2               idx3-3               idx3-4          
        ...   idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2   
idx1-1  ...      NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
idx1-2  ...      NaN    NaN    100    NaN    NaN    NaN    100    NaN    NaN   

               
               
       idx2-3  
idx1-1    NaN  
idx1-2    100  

[2 rows x 48 columns]

df3(MultiIndex) #12: Swaplevel method
                      col1  col2  col3  col4
idx3-1 idx2-1 idx1-1     0     1     2   100
idx3-2 idx2-2 idx1-2     3     4     5   100
idx3-3 idx2-3 idx1-2     6     7     8   100
idx3-4 idx2-3 idx1-2     9    10    11   100

df4(Nested dict)
                             a1                              a2  \
b1  {'c3': 2, 'c1': 0, 'c2': 1}   {'c3': 11, 'c1': 9, 'c2': 10}   
b2  {'c3': 5, 'c1': 3, 'c2': 4}  {'c3': 14, 'c1': 12, 'c2': 13}   
b3  {'c3': 8, 'c1': 6, 'c2': 7}  {'c3': 17, 'c1': 15, 'c2': 16}   

                                a3  
b1  {'c3': 20, 'c1': 18, 'c2': 19}  
b2  {'c3': 23, 'c1': 21, 'c2': 22}  
b3  {'c3': 26, 'c1': 24, 'c2': 25}  

df5(Dict with tuple key) #1
 a1                         a2 ...      a3                                
 b1       b2       b3       b1 ...  b3  b1          b2          b3        
 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ...  c3  c1  c2  c3  c1  c2  c3  c1  c2  c3
  0  1  2  3  4  5  6  7  8  9 ...  17  18  19  20  21  22  23  24  25  26

[1 rows x 27 columns]

df5(Dict with tuple key) #2: Stack method
    a1        a2          a3        
    b1 b2 b3  b1  b2  b3  b1  b2  b3
 c1  0  3  6   9  12  15  18  21  24
 c2  1  4  7  10  13  16  19  22  25
 c3  2  5  8  11  14  17  20  23  26

df5(Dict with tuple key) #3: Nested stacks
  c1  b1  a1     0
          a2     9
          a3    18
      b2  a1     3
          a2    12
          a3    21
      b3  a1     6
          a2    15
          a3    24
  c2  b1  a1     1
          a2    10
          a3    19
      b2  a1     4
          a2    13
          a3    22
      b3  a1     7
          a2    16
          a3    25
  c3  b1  a1     2
          a2    11
          a3    20
      b2  a1     5
          a2    14
          a3    23
      b3  a1     8
          a2    17
          a3    26
dtype: int64

df5(Dict with tuple key) #4: Nested stacks and swaplevel
  a1  b1  c1     0
  a2  b1  c1     9
  a3  b1  c1    18
  a1  b2  c1     3
  a2  b2  c1    12
  a3  b2  c1    21
  a1  b3  c1     6
  a2  b3  c1    15
  a3  b3  c1    24
  a1  b1  c2     1
  a2  b1  c2    10
  a3  b1  c2    19
  a1  b2  c2     4
  a2  b2  c2    13
  a3  b2  c2    22
  a1  b3  c2     7
  a2  b3  c2    16
  a3  b3  c2    25
  a1  b1  c3     2
  a2  b1  c3    11
  a3  b1  c3    20
  a1  b2  c3     5
  a2  b2  c3    14
  a3  b2  c3    23
  a1  b3  c3     8
  a2  b3  c3    17
  a3  b3  c3    26
dtype: int64

df5(Dict with tuple key) #5: Nested stacks, swaplevel and unstack
        c1  c2  c3
 a1 b1   0   1   2
    b2   3   4   5
    b3   6   7   8
 a2 b1   9  10  11
    b2  12  13  14
    b3  15  16  17
 a3 b1  18  19  20
    b2  21  22  23
    b3  24  25  26

df5(Dict with tuple key) #6: Swaplevel with axis=1
 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ...  c3  c1  c2  c3  c1  c2  c3  c1  c2  c3
 b1 b1 b1 b2 b2 b2 b3 b3 b3 b1 ...  b3  b1  b1  b1  b2  b2  b2  b3  b3  b3
 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 ...  a2  a3  a3  a3  a3  a3  a3  a3  a3  a3
  0  1  2  3  4  5  6  7  8  9 ...  17  18  19  20  21  22  23  24  25  26

[1 rows x 27 columns]

df6(Dict with tuple key and list values)
  a1                         a2 ...      a3                                
  b1       b2       b3       b1 ...  b3  b1          b2          b3        
  c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ...  c3  c1  c2  c3  c1  c2  c3  c1  c2  c3
0  0  1  2  3  4  5  6  7  8  9 ...  17  18  19  20  21  22  23  24  25  26

[1 rows x 27 columns]

*1:たぶんこちらをlabelと呼ぶのだと思うがnameとlabelが混同するのを避ける意味でkeyとしている

*2:data5には隠れindexがあるので、swaplevel()でのlevelの指定が直感より1つ大きい