2016-03-10

Pandas DataFrame について

作者: Wes McKinney,小林儀匡,鈴木宏尚,瀬戸山雅人,滝口開資,野上大介
出版社/メーカー: オライリージャパン
発売日: 2013/12/26
メディア: 大型本
この商品を含むブログ (12件) を見る

PandasのDataFrameは行列形式のデータ構造で、indexやcolumnにkeyが指定できたり*1、indexやcolumnをまとめたものにnameを付けられたりする。DataFrameに対する集計関数や条件選択などはある程度直感的に操作できるが、Python標準のデータ構造から階層的なindexを持ったDataFrameを作成する場合や、階層的なindexを持ったDataFrameをどのように自由に変換できるのか、などが分かりづらかったので色々試してみた。

Nested dictを元にDataFrameを作成する場合、一番外側のkeyがcolumn、次に内側のkeyがindex、となってDataFrameを作成してくれる。ただし、tupleをkeyに使ったdictを元にDataFrameを作成する場合は、indexが無いことになるので、indexを指定しないとValueError: If using all scalar values, you must pass an indexというエラーになる。そのため、indexを指定するか、あるいは値をscalarではなくlistとするか、が必要になる（df5とdf6の違い）。*2

gist.github.com

$ python dataframe_run.py 
df1
      col1  col2  col3
idx1     0     4     8
idx2     1     5     9
idx3     2     6    10
idx4     3     7    11

df2
      col1  col2  col3
idx1     0     4     8
idx2     1     5     9
idx3     2     6    10
idx4     3     7    11

df1 == df2
      col1  col2  col3
idx1  True  True  True
idx2  True  True  True
idx3  True  True  True
idx4  True  True  True

df3(MultiIndex) #1
                      col1  col2  col3
idx1-1 idx2-1 idx3-1     0     1     2
idx1-2 idx2-2 idx3-2     3     4     5
       idx2-3 idx3-3     6     7     8
              idx3-4     9    10    11

df3(MultiIndex) #2: Append column
                      col1  col2  col3  col4
idx1-1 idx2-1 idx3-1     0     1     2   100
idx1-2 idx2-2 idx3-2     3     4     5   100
       idx2-3 idx3-3     6     7     8   100
              idx3-4     9    10    11   100

df3(MultiIndex) #3: Access to scalar value
10

df3(MultiIndex) #4: Access to scalar by ix property
11

df3(MultiIndex) #5: Access to record by ix with condtions
                      col1  col3
idx1-2 idx2-2 idx3-2     3     5
       idx2-3 idx3-3     6     8

df3(MultiIndex) #6: Access to record by ix with conds specified by isin method
                      col2  col3
idx1-1 idx2-1 idx3-1     1     2
idx1-2 idx2-3 idx3-3     7     8

df3(MultiIndex) #7: Pivot table with fillna method
col3  2   5   8   11
col2                
1      0  -1  -1  -1
4     -1   3  -1  -1
7     -1  -1   6  -1
10    -1  -1  -1   9

df3(MultiInex) #8: Stack method
idx1-1  idx2-1  idx3-1  col1      0
                        col2      1
                        col3      2
                        col4    100
idx1-2  idx2-2  idx3-2  col1      3
                        col2      4
                        col3      5
                        col4    100
        idx2-3  idx3-3  col1      6
                        col2      7
                        col3      8
                        col4    100
                idx3-4  col1      9
                        col2     10
                        col3     11
                        col4    100
dtype: int64

df3(MultiIndex) #9: Stack and unstack
                      col1  col2  col3  col4
idx1-1 idx2-1 idx3-1     0     1     2   100
idx1-2 idx2-2 idx3-2     3     4     5   100
       idx2-3 idx3-3     6     7     8   100
              idx3-4     9    10    11   100

df3(MultiIndex) #10: Unstack method
                col1                        col2                        col3  \
              idx3-1 idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4 idx3-1   
idx1-1 idx2-1      0    NaN    NaN    NaN      1    NaN    NaN    NaN      2   
idx1-2 idx2-2    NaN      3    NaN    NaN    NaN      4    NaN    NaN    NaN   
       idx2-3    NaN    NaN      6      9    NaN    NaN      7     10    NaN   

                                     col4                       
              idx3-2 idx3-3 idx3-4 idx3-1 idx3-2 idx3-3 idx3-4  
idx1-1 idx2-1    NaN    NaN    NaN    100    NaN    NaN    NaN  
idx1-2 idx2-2      5    NaN    NaN    NaN    100    NaN    NaN  
       idx2-3    NaN      8     11    NaN    NaN    100    100  

df3(MultiIndex) #11: Nested unstacks
         col1                                                                 \
       idx3-1               idx3-2               idx3-3               idx3-4   
       idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1   
idx1-1      0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
idx1-2    NaN    NaN    NaN    NaN      3    NaN    NaN    NaN      6    NaN   

        ...     col4                                                          \
        ...   idx3-1 idx3-2               idx3-3               idx3-4          
        ...   idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2 idx2-3 idx2-1 idx2-2   
idx1-1  ...      NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
idx1-2  ...      NaN    NaN    100    NaN    NaN    NaN    100    NaN    NaN   

               
               
       idx2-3  
idx1-1    NaN  
idx1-2    100  

[2 rows x 48 columns]

df3(MultiIndex) #12: Swaplevel method
                      col1  col2  col3  col4
idx3-1 idx2-1 idx1-1     0     1     2   100
idx3-2 idx2-2 idx1-2     3     4     5   100
idx3-3 idx2-3 idx1-2     6     7     8   100
idx3-4 idx2-3 idx1-2     9    10    11   100

df4(Nested dict)
                             a1                              a2  \
b1  {'c3': 2, 'c1': 0, 'c2': 1}   {'c3': 11, 'c1': 9, 'c2': 10}   
b2  {'c3': 5, 'c1': 3, 'c2': 4}  {'c3': 14, 'c1': 12, 'c2': 13}   
b3  {'c3': 8, 'c1': 6, 'c2': 7}  {'c3': 17, 'c1': 15, 'c2': 16}   

                                a3  
b1  {'c3': 20, 'c1': 18, 'c2': 19}  
b2  {'c3': 23, 'c1': 21, 'c2': 22}  
b3  {'c3': 26, 'c1': 24, 'c2': 25}  

df5(Dict with tuple key) #1
 a1                         a2 ...      a3                                
 b1       b2       b3       b1 ...  b3  b1          b2          b3        
 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ...  c3  c1  c2  c3  c1  c2  c3  c1  c2  c3
  0  1  2  3  4  5  6  7  8  9 ...  17  18  19  20  21  22  23  24  25  26

[1 rows x 27 columns]

df5(Dict with tuple key) #2: Stack method
    a1        a2          a3        
    b1 b2 b3  b1  b2  b3  b1  b2  b3
 c1  0  3  6   9  12  15  18  21  24
 c2  1  4  7  10  13  16  19  22  25
 c3  2  5  8  11  14  17  20  23  26

df5(Dict with tuple key) #3: Nested stacks
  c1  b1  a1     0
          a2     9
          a3    18
      b2  a1     3
          a2    12
          a3    21
      b3  a1     6
          a2    15
          a3    24
  c2  b1  a1     1
          a2    10
          a3    19
      b2  a1     4
          a2    13
          a3    22
      b3  a1     7
          a2    16
          a3    25
  c3  b1  a1     2
          a2    11
          a3    20
      b2  a1     5
          a2    14
          a3    23
      b3  a1     8
          a2    17
          a3    26
dtype: int64

df5(Dict with tuple key) #4: Nested stacks and swaplevel
  a1  b1  c1     0
  a2  b1  c1     9
  a3  b1  c1    18
  a1  b2  c1     3
  a2  b2  c1    12
  a3  b2  c1    21
  a1  b3  c1     6
  a2  b3  c1    15
  a3  b3  c1    24
  a1  b1  c2     1
  a2  b1  c2    10
  a3  b1  c2    19
  a1  b2  c2     4
  a2  b2  c2    13
  a3  b2  c2    22
  a1  b3  c2     7
  a2  b3  c2    16
  a3  b3  c2    25
  a1  b1  c3     2
  a2  b1  c3    11
  a3  b1  c3    20
  a1  b2  c3     5
  a2  b2  c3    14
  a3  b2  c3    23
  a1  b3  c3     8
  a2  b3  c3    17
  a3  b3  c3    26
dtype: int64

df5(Dict with tuple key) #5: Nested stacks, swaplevel and unstack
        c1  c2  c3
 a1 b1   0   1   2
    b2   3   4   5
    b3   6   7   8
 a2 b1   9  10  11
    b2  12  13  14
    b3  15  16  17
 a3 b1  18  19  20
    b2  21  22  23
    b3  24  25  26

df5(Dict with tuple key) #6: Swaplevel with axis=1
 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ...  c3  c1  c2  c3  c1  c2  c3  c1  c2  c3
 b1 b1 b1 b2 b2 b2 b3 b3 b3 b1 ...  b3  b1  b1  b1  b2  b2  b2  b3  b3  b3
 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 ...  a2  a3  a3  a3  a3  a3  a3  a3  a3  a3
  0  1  2  3  4  5  6  7  8  9 ...  17  18  19  20  21  22  23  24  25  26

[1 rows x 27 columns]

df6(Dict with tuple key and list values)
  a1                         a2 ...      a3                                
  b1       b2       b3       b1 ...  b3  b1          b2          b3        
  c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 ...  c3  c1  c2  c3  c1  c2  c3  c1  c2  c3
0  0  1  2  3  4  5  6  7  8  9 ...  17  18  19  20  21  22  23  24  25  26

[1 rows x 27 columns]

*1:たぶんこちらをlabelと呼ぶのだと思うがnameとlabelが混同するのを避ける意味でkeyとしている

*2:data5には隠れindexがあるので、swaplevel()でのlevelの指定が直感より1つ大きい

2016-03-02

PythonでPandasのPlotを試す

qiita.com

こちらの記事のオマージュです。環境は、Windows/Python3.4です。pandasなどはpipでインストールします。groupbyによる集計や、日本語表記などに対応しました。

gist.github.com

f:id:nishidy:20160309002816p:plain f:id:nishidy:20160309002820p:plain f:id:nishidy:20160309002826p:plain f:id:nishidy:20160309002832p:plain f:id:nishidy:20160309002842p:plain f:id:nishidy:20160309002851p:plain f:id:nishidy:20160309002859p:plain f:id:nishidy:20160309002903p:plain

補記：可変長引数・キーワード可変長引数

$ cat kwargs.py 
def func(x,a=1,z=0,zz=100,*args,**kwargs):
    print("x=",x)
    print("=====")
    print("a=",a)
    print("=====")
    print("args=",args) 
    print("=====")
    print("kwargs",kwargs) 
    print("=====")
    print("z=",z) 
    print("=====")
    print("zz=",zz) 

args = [10,20,30,40,50]
kargs = {"a":10,"b":20,"c":30,"d":40,"e":50}

print("-----func(args)-----")
func(args)
print("-----func(*args)-----")
func(*args)
print("-----func(1,**kargs)-----")
func(1,**kargs)
print("-----func(1,*kargs)-----")
func(1,*kargs)

$ python kwargs.py 
-----func(args)-----
x= [10, 20, 30, 40, 50]
=====
a= 1
=====
args= ()
=====
kwargs {}
=====
z= 0
=====
zz= 100
-----func(*args)-----
x= 10
=====
a= 20
=====
args= (50,)
=====
kwargs {}
=====
z= 30
=====
zz= 40
-----func(1,**kargs)-----
x= 1
=====
a= 10
=====
args= ()
=====
kwargs {'d': 40, 'c': 30, 'e': 50, 'b': 20}
=====
z= 0
=====
zz= 100
-----func(1,*kargs)-----
x= 1
=====
a= e
=====
args= ('a', 'd')
=====
kwargs {}
=====
z= c
=====
zz= b

実引数の辞書に*を付けて関数に渡すと、辞書のキーが仮引数に渡される。仮引数のデフォルト引数への割り当ては、仮引数の宣言順序とは関係なく実施される。以下のように、実行するたびに割り当てが変わる。

-----func(1,*kargs)-----
x= 1
=====
a= b
=====
args= ('c', 'd')
=====
kwargs {}
=====
z= e
=====
zz= a

splitlines()

splitlines()は最後の空白要素を削除するが、split("\n")は削除しない。Trueを与えると改行を付けたまま分割して返す。

>>> a="aaa\nbbb\nccc\n"
>>> a.splitlines()
['aaa', 'bbb', 'ccc']
>>> a.split("\n")
['aaa', 'bbb', 'ccc', '']
>>> a="aaa\nbbb\nccc\n\n\n\n"
>>> a.splitlines()
['aaa', 'bbb', 'ccc', '', '', '']
>>> a.split("\n")
['aaa', 'bbb', 'ccc', '', '', '', '']
>>> a="aaa\nbbb\nccc"
>>> a.splitlines()
['aaa', 'bbb', 'ccc']
>>> a.split("\n")
['aaa', 'bbb', 'ccc']
>>> a.splitlines(True)
['aaa\n', 'bbb\n', 'ccc']

2016-02-26

Cythonによるパフォーマンス向上

Cython ―Cとの融合によるPythonの高速化

作者: Kurt W. Smith,中田秀基,長尾高弘
出版社/メーカー: オライリージャパン
発売日: 2015/06/19
メディア: 大型本
この商品を含むブログ (3件) を見る

Pythonのような書きやすいインタープリタ型言語で書き始めて結果を取得できるところまでいくと、その後にはパフォーマンスを上げるためにC++で書きなおそうか、と思ったりする。かといって最初からC++で書いていては、きちんと動くものを書き切るのに間違いなく時間がかかるし、とにかくまず結果が欲しいときには取るべき有効な手段ではないと思う。

そこでCythonの出番。Pythonのコードはそのままで有効なCythonのコードでもあるので、cythonをインストールして拡張子をpyxに変更し、import pyximport; pyximport.install()をつけるだけで速度改善の効果が望める。もちろん、静的な型指定などのチューニングを施すことによって、Pythonのセマンティクスを保ったまま、よりCに近い速度を達成することができる。これだけ簡単に導入できるので、もしCPUバウンドな処理であることが分かっているなら、Cythonを使わない手はないだろう。

どのような場合にどの程度の速度改善が望めるか、以下のようにPythonのコードを実行した結果と、PythonのコードをそのままCythonとしてコンパイルして実行した結果、Cythonとして書き直したコードを実行した結果、を比較した。プログラムの内容は、単にソートを実施しているだけだが、再帰呼び出しの有り・無し、でそれぞれプログラムを変えている。Cythonでは、stdlib.hのqsortを使っている。

再帰呼び出し有り、では関数呼び出しのオーバーヘッドが大きくなるため、PythonのコードをそのままCythonとしてコンパイルして実行した結果、が、Pythonのコードを実行した結果に比べて、速度改善率が高い。

再帰呼び出し無し、では、ソートするリストの長さを1000倍しており、Cythonのコードの速度改善率がかなり高い。

このようにある特定のモジュールをCython化するだけでも、かなり速度改善が望めることが分かる。Cそのもののコードを書かずにPython スクリプトの書きやすさを保ったままパフォーマンス向上が可能となる。あとは、プロファイラーを使ってCython化すべき箇所を特定する作業が必要になるだろう。

再帰呼び出し有り

$ cat pyqsort.py 
def pyqsort(x):
    if len(x)==1:
        return
    array = sorted(x)
    pyqsort(x[:-1])

$ cat cyqsort.pyx 
def cyqsort(x):
    if len(x)==1:
        return
    array = sorted(x)
    cyqsort(x[:-1])

$ cat cqsort.pyx 
cdef extern from "stdlib.h":
    void qsort(void *array,size_t count,size_t size,int (*compare)(const void*,const void*))
    void *malloc(size_t size)
    void free(void *ptr)

def cqsort(list x):
    cdef:
        int *array
        int i, N
    
    N = len(x)
    if N==1:
        return
    array = <int*>malloc(sizeof(int) * N)
    if array == NULL:
        raise MemoryError("Unable to allocate array.")

    for i in range(N):
        array[i] = x[i]
        
    qsort(<void*>array, <size_t>N, sizeof(int), int_compare)
   
    free(array)

    cqsort(x[:-1])

cdef int int_compare(const void *a, const void *b):
    cdef int ia, ib
    ia = (<int*>a)[0]
    ib = (<int*>b)[0]
    return ia-ib

$ ipython --no-banner

In [1]: from pyqsort import pyqsort

In [2]: from random import shuffle

In [3]: x=list(range(100))

In [4]: shuffle(x)

In [5]: %timeit pyqsort(x)
1000 loops, best of 3: 642 µs per loop

In [6]: import pyximport; pyximport.install()
Out[6]: (None, <pyximport.pyximport.PyxImporter at 0x106dc1eb8>)

In [7]: from cyqsort import cyqsort

In [8]: %timeit cyqsort(x)
1000 loops, best of 3: 555 µs per loop

In [9]: from cqsort import cqsort

In [10]: %timeit cqsort(x)
1000 loops, best of 3: 349 µs per loop

In [11]:

再帰呼び出し無し

$ cat pyqsort_once.py 
def pyqsort(x):
    array = sorted(x)

$ cat cyqsort_once.pyx
def cyqsort(x):
    array = sorted(x)

$ cat cqsort_once.pyx
cdef extern from "stdlib.h":
    void qsort(void *array,size_t count,size_t size,int (*compare)(const void*,const void*))
    void *malloc(size_t size)
    void free(void *ptr)

def cqsort(list x):
    cdef:
        int *array
        int i, N
    
    N = len(x)
    if N==1:
        return
    array = <int*>malloc(sizeof(int) * N)
    if array == NULL:
        raise MemoryError("Unable to allocate array.")

    for i in range(N):
        array[i] = x[i]
        
    qsort(<void*>array, <size_t>N, sizeof(int), int_compare)
   
    free(array)

cdef int int_compare(const void *a, const void *b):
    cdef int ia, ib
    ia = (<int*>a)[0]
    ib = (<int*>b)[0]
    return ia-ib

$ ipython --no-banner

In [1]: from pyqsort_once import pyqsort

In [2]: from random import shuffle

In [3]: x=list(range(1000000))

In [4]: shuffle(x)

In [5]: %timeit pyqsort(x)
1 loop, best of 3: 760 ms per loop

In [6]: import pyximport; pyximport.install()
Out[6]: (None, <pyximport.pyximport.PyxImporter at 0x10c9bbc88>)

In [7]: from cyqsort_once import cyqsort

In [8]: %timeit cyqsort(x)
1 loop, best of 3: 759 ms per loop

In [9]: from cqsort_once import cqsort

In [10]: %timeit cqsort(x)
1 loop, best of 3: 193 ms per loop

In [11]:

ループで呼び出し

$ ipython --no-banner

In [1]: from pyqsort_once import pyqsort

In [2]: from random import shuffle

In [3]: x=list(range(1000))

In [4]: shuffle(x)

In [5]: %timeit for i in range(1000): pyqsort(x)
1 loop, best of 3: 218 ms per loop

In [6]: import pyximport; pyximport.install()
Out[6]: (None, <pyximport.pyximport.PyxImporter at 0x10dd583c8>)

In [7]: from cyqsort_once import cyqsort

In [8]: %timeit for i in range(1000): cyqsort(x)
1 loop, best of 3: 218 ms per loop

In [9]: from cqsort_once import cqsort

In [10]: %timeit for i in range(1000): cqsort(x)
10 loops, best of 3: 84.8 ms per loop

In [11]:

2016-02-23

Pythonの正規表現でバックスラッシュを含む文字列とマッチする場合

Pythonのreモジュールでsearchなどの関数に正規表現を与える場合、rubyやperlなどのように正規表現を直接与えることができず、文字列として与える必要があるため、WindowsのPath区切りにマッチさせたいときなどに正規表現内にバックスラッシュを使いたい場合、一つのバックスラッシュに対して"\\\\"とする必要がある。rプレフィックスを用いることで、文字列としてのエスケープは行われなくなるので、正規表現としての一つのバックスラッシュをr"\\"と書ける。ただし文字列の最後にr"\"とは書けないことにも注意が必要。

reモジュールのsearch関数の第1引数は正規表現、第2引数はマッチ対象となる文字列だ。

>>> import re
>>> re.search("abc","abc")
<_sre.SRE_Match object; span=(0, 3), match='abc'>
>>> re.search(r"abc\\abc","abc\\abc")
<_sre.SRE_Match object; span=(0, 7), match='abc\\abc'>
>>> re.search("abc\\\\abc","abc\\abc")
<_sre.SRE_Match object; span=(0, 7), match='abc\\abc'>
>>> re.search(r"abc\\\\abc",r"abc\\abc")
<_sre.SRE_Match object; span=(0, 8), match='abc\\\\abc'>
>>> re.search(r"abc\\abc",r"abc\abc")
<_sre.SRE_Match object; span=(0, 7), match='abc\\abc'>
>>> re.search(r"abc\\abc\\",r"abc\abc\")
  File "<stdin>", line 1
    re.search(r"abc\\abc\\",r"abc\abc\")
                                       ^
SyntaxError: EOL while scanning string literal

>>> re.search(r"abc\\abc\\","abc\\abc\\")
<_sre.SRE_Match object; span=(0, 8), match='abc\\abc\\'>

以下のように正規表現の文字列として"\\"を与えてしまうと、文字列内でエスケープした結果として\が正規表現として与えられてしまいエラーとなる。\は単独では正規表現として有効ではない。

>>> re.search("abc\\","abc\\")
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.4.3_2/Frameworks/Python.framework/Versions/3.4/lib/python3.4/sre_parse.py", line 206, in __next
    c = self.string[self.index + 1]
IndexError: string index out of range
...

>>> re.search("\\","")
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.4.3_2/Frameworks/Python.framework/Versions/3.4/lib/python3.4/sre_parse.py", line 206, in __next
    c = self.string[self.index + 1]
IndexError: string index out of range
...

バックスラッシュを含む文字列が変数に代入されており、それを正規表現として扱いたい場合、文字列中のバックスペースは二回エスケープされることになるので、予め"\\\\"とするか、あるいはrプレフィックスを与えてr"\\"としておく必要がある。

>>> a="abc\\\\abc"
>>> a
'abc\\\\abc'
>>> re.search(a,"abc\\abc")
<_sre.SRE_Match object; span=(0, 7), match='abc\\abc'>
>>> a=r"abc\\abc"
>>> a
'abc\\\\abc'
>>> re.search(a,"abc\\abc")
<_sre.SRE_Match object; span=(0, 7), match='abc\\abc'>

以下は誤ったやり方になる。上記の通り、rプレフィックスは文字列としてのエスケープを行わない、という指示になるので、新たにrプレフィックスを付けた文字列に変数をフォーマットしても期待通りには動かない。また、%rでフォーマットし直してもうまくいかない。どうにかして（正規表現として使われることを意識してしない）元の変数をsearchの第1引数に与えようと、replやevalを試してみてもうまくいかない。

>>> a="abc\\abc"
>>> re.search(a,"abc\\abc")
>>> re.search(a,"abc\abc")
<_sre.SRE_Match object; span=(0, 6), match='abc\x07bc'>
>>> r"{0}".format(a)
'abc\\abc'
>>> "%r"%a
"'abc\\\\abc'"
>>> re.search("%r"%a,"abc\\abc")
>>>

>>>repr(a)
"'abc\\\\abc'"
>>> eval(repr(a))
'abc\\abc'

そのため、以下のようにsub関数による置換が必要になるかもしれない。

>>> a="abc\\abc"
>>> a=re.sub(r"\\",r"\\\\",a)
>>> a
'abc\\\\abc'
>>> re.search(a,"abc\\abc")
<_sre.SRE_Match object; span=(0, 7), match='abc\\abc'>

2016-02-10

stroll(nicer walk)のC++バージョンを書いた

ディレクトリを幅優先探索で掘っていき、ディレクトリやファイルに対する条件を適用し、それらを与えた条件（関数）によってソートし、見つけたファイルパスを返すライブラリを書いた。ちょうどPythonのos.walkを高機能にしたもので、所望のファイルだけを取り出したいときに使える。

boost不要で、C++11/C++14も不要なので、古いバージョンのg++(clang)でもコンパイルできる（はず）。Cでファイル処理を書くことはほとんど無いはずなので、C++で使えれば良いかなと。CとC++がごっちゃになっていてなんとなく居心地が悪いけどこんなもんなのだろうか。ちなみに、directory_iteratorは#incude <filesystem>が必要でMac(clang)だとコンパイルできないようなので、Cを使うことにした。

正規表現マッチをする際に、matchとsearchの違いを忘れてしまいがち。matchは先頭からマッチするが、searchはどこでもマッチする。

gist.github.com

$ cat main.cpp
#include <iostream>
#include "stroll.cpp"

using namespace std;
using namespace stroll;

void ex1(){
    cout<<"=== ex1 (mtime) ==="<<endl;
    Stroll stroll = Stroll("basedir");
    stroll.set_sortby(sortByMtime);

    while(stroll.read()){
        cout<< stroll.get() << endl;
    }
}

void ex2(){
    cout<<"=== ex2 (mtime/reverse) ==="<<endl;
    Stroll stroll = Stroll("basedir");
    stroll.set_sortby(sortByMtimeDesc);

    while(stroll.read()){
        cout<< stroll.get() << endl;
    }
}

void ex3(){
    cout<<"=== ex3 (no_file) ==="<<endl;
    vstr no_file;
    no_file.push_back("1$");
    no_file.push_back("2$");

    Stroll stroll = Stroll("basedir");
    stroll.set_no_file(no_file);

    while(stroll.read()){
        cout<< stroll.get() << endl;
    }
}

void ex4(){
    cout<<"=== ex4 (yes_path) ==="<<endl;
    vstr yes_path;
    yes_path.push_back("1$");

    Stroll stroll = Stroll("basedir");
    stroll.set_yes_path(yes_path);

    while(stroll.read()){
        cout<< stroll.get() << endl;
    }
}

void ex5(){
    cout<<"=== ex5 (no_root/len/reverse) ==="<<endl;
    vstr no_root;
    no_root.push_back("2$");

    Stroll stroll = Stroll("basedir");
    stroll.set_no_root(no_root);
    stroll.set_sortby(sortByLenDesc);

    while(stroll.read()){
        cout<< stroll.get() << endl;
    }
}

void ex6(){
    cout<<"=== ex6 (yes_file/yes_root) ==="<<endl;
    vstr yes_file;
    yes_file.push_back("2$");

    vstr yes_root;
    yes_root.push_back("2$");

    Stroll stroll = Stroll("basedir");
    stroll.set_yes_file(yes_file);
    stroll.set_yes_root(yes_root);

    while(stroll.read()){
        cout<< stroll.get() << endl;
    }
}

int main(int argc, char* argv[]){

    ex1();
    ex2();
    ex3();
    ex4();
    ex5();
    ex6();

    return 0;
}

$ cat Makefile 

main:main.o
    g++ -L./ -lstroll -o main main.o

main.o:libstroll.a
    g++ -c main.cpp

libstroll.a:stroll.o
    ar r libstroll.a stroll.o

stroll.o:stroll.cpp
    g++ -Wall -O2 -c stroll.cpp

.PHONY: clean
clean:
    rm -f libstroll.a stroll.o main

$ ./main 
=== ex1 (mtime) ===
basedir/file0
basedir/dir1/file11
basedir/dir1/file12
basedir/dir1/file13
basedir/dir1/file111
basedir/dir1/file1111
basedir/dir2/file21
basedir/dir2/file22
basedir/dir2/file23
basedir/dir2/file222
basedir/dir2/file2222
basedir/dir3/file31
basedir/dir3/file32
basedir/dir3/file33
basedir/dir3/file333
basedir/dir3/file3333
=== ex2 (mtime/reverse) ===
basedir/file0
basedir/dir3/file3333
basedir/dir3/file333
basedir/dir3/file33
basedir/dir3/file32
basedir/dir3/file31
basedir/dir2/file2222
basedir/dir2/file222
basedir/dir2/file23
basedir/dir2/file21
basedir/dir2/file22
basedir/dir1/file1111
basedir/dir1/file111
basedir/dir1/file13
basedir/dir1/file12
basedir/dir1/file11
=== ex3 (no_file) ===
basedir/file0
basedir/dir1/file13
basedir/dir2/file23
basedir/dir3/file33
basedir/dir3/file333
basedir/dir3/file3333
=== ex4 (yes_path) ===
basedir/file0
basedir/dir1/file11
basedir/dir1/file111
basedir/dir1/file1111
basedir/dir1/file12
basedir/dir1/file13
=== ex5 (no_root/len/reverse) ===
basedir/file0
basedir/dir1/file1111
basedir/dir1/file111
basedir/dir1/file11
basedir/dir1/file12
basedir/dir1/file13
basedir/dir3/file3333
basedir/dir3/file333
basedir/dir3/file31
basedir/dir3/file32
basedir/dir3/file33
=== ex6 (yes_file/yes_root) ===
basedir/dir2/file22
basedir/dir2/file222
basedir/dir2/file2222

2016-02-06

os.walkを便利に使える関数strollを書いた

リスト指定と正規表現マッチに対応し、ディレクトリパスとファイル名を返すようにして、関数名をnicewalkからstroll*1に変更（短い単語の方が良かった）
ディレクトリの扱いを明確化。rootはカレントディレクトリのディレクトリ名で、pathはカレントディレクトリのパス。ignorecaseを追加。
Python 3.5以上ではglobにrecursiveオプションが付いたため、"**"とrecursive=Trueを与えることで再帰的に辿ることができるようになった。そのため、以下はPython3.4以下が対象。

os.walk()はディレクトリを再帰的に辿りながら、ファイルを読み込んだり書き込む処理をする際に非常に便利だ。globだとディレクトリを再帰的に辿ることができず、ディレクトリ階層を意識しないといけない*2し、ディレクトリとファイルの違いをワイルドカードで示す必要があるので柔軟さに欠ける。またos.walk()では、取得したディレクトリやファイルのリストに対して、インプレースでの変更が可能になっているため、任意の条件で自由にソートやフィルタを行うことができる。そこで、各ディレクトリ・ファイルに対する条件を簡単に盛り込めるstrollを書いた。

reduceのこの使い方は好きで、リストをループでチェックするよりワンラインで書ける方が見やすくて良い。一度慣れてしまえば、それほど考えこまなくても流用できるので、いろんなところで多用している。リストが空の場合には、reduceに与える初期値がそのまま判定結果になってしまうが、それが嫌ならリストを三項演算で判定して初期値を与えれば良い。

gist.github.com

$ python --version
Python 3.4.3

$ cat main.py 
from stroll import stroll
import os

print()
print("=== ex1 (mtime) ===")
for filepath in stroll("basedir",sortby=os.path.getmtime):
    print( filepath )

print()
print("=== ex2 (mtime/reverse) ===")
for filepath in stroll("basedir",sortby=os.path.getmtime,reverse=True):
    print( filepath )

print()
print("=== ex3 (no_file) ===")
for filepath in stroll("basedir",no_file=["1$","2$"],):
    print( filepath )

print()
print("=== ex4 (yes_path) ===")
for filepath in stroll("basedir",yes_path=["1$"],):
    print( filepath )

print()
print("=== ex5 (no_root/len/reverse) ===")
for filepath in stroll("basedir",no_root=["2$"],sortby=len,reverse=True):
    print( filepath )

print()
print("=== ex6 (yes_file/yes_root) ===")
for filepath in stroll("basedir",yes_file=["2$"],yes_root=["2$"]):
    print( file path )

$ find basedir
basedir
basedir/dir1
basedir/dir1/file11
basedir/dir1/file111
basedir/dir1/file1111
basedir/dir1/file12
basedir/dir1/file13
basedir/dir2
basedir/dir2/file21
basedir/dir2/file22
basedir/dir2/file222
basedir/dir2/file2222
basedir/dir2/file23
basedir/dir3
basedir/dir3/file31
basedir/dir3/file32
basedir/dir3/file33
basedir/dir3/file333
basedir/dir3/file3333
basedir/file0

$ ls -Rlrt --time-style=+"%m/%d_%H:%M:%S" basedir/ | awk '{print $5 " " $6 " " $7}'
  
0 02/09_21:24:50 file0
238 02/09_21:25:29 dir1
238 02/09_21:25:33 dir2
238 02/09_21:25:38 dir3
  
0 02/09_21:24:57 file11
0 02/09_21:24:58 file12
0 02/09_21:24:59 file13
0 02/09_21:25:28 file111
0 02/09_21:25:29 file1111
  
0 02/09_21:25:06 file22
0 02/09_21:25:06 file21
0 02/09_21:25:07 file23
0 02/09_21:25:32 file222
0 02/09_21:25:33 file2222
  
0 02/09_21:25:10 file31
0 02/09_21:25:11 file32
0 02/09_21:25:12 file33
0 02/09_21:25:37 file333
0 02/09_21:25:38 file3333

$ python main.py 

=== ex1 (mtime) ===
basedir
('basedir', 'file0')
basedir/dir1
('basedir/dir1', 'file11')
('basedir/dir1', 'file12')
('basedir/dir1', 'file13')
('basedir/dir1', 'file111')
('basedir/dir1', 'file1111')
basedir/dir2
('basedir/dir2', 'file21')
('basedir/dir2', 'file22')
('basedir/dir2', 'file23')
('basedir/dir2', 'file222')
('basedir/dir2', 'file2222')
basedir/dir3
('basedir/dir3', 'file31')
('basedir/dir3', 'file32')
('basedir/dir3', 'file33')
('basedir/dir3', 'file333')
('basedir/dir3', 'file3333')

=== ex2 (mtime/reverse) ===
basedir
('basedir', 'file0')
basedir/dir3
('basedir/dir3', 'file3333')
('basedir/dir3', 'file333')
('basedir/dir3', 'file33')
('basedir/dir3', 'file32')
('basedir/dir3', 'file31')
basedir/dir2
('basedir/dir2', 'file2222')
('basedir/dir2', 'file222')
('basedir/dir2', 'file23')
('basedir/dir2', 'file21')
('basedir/dir2', 'file22')
basedir/dir1
('basedir/dir1', 'file1111')
('basedir/dir1', 'file111')
('basedir/dir1', 'file13')
('basedir/dir1', 'file12')
('basedir/dir1', 'file11')

=== ex3 (no_file) ===
basedir
('basedir', 'file0')
basedir/dir1
('basedir/dir1', 'file13')
basedir/dir2
('basedir/dir2', 'file23')
basedir/dir3
('basedir/dir3', 'file33')
('basedir/dir3', 'file333')
('basedir/dir3', 'file3333')

=== ex4 (yes_path) ===
basedir/dir1
('basedir/dir1', 'file11')
('basedir/dir1', 'file111')
('basedir/dir1', 'file1111')
('basedir/dir1', 'file12')
('basedir/dir1', 'file13')

=== ex5 (no_root/len/reverse) ===
basedir
('basedir', 'file0')
basedir/dir1
('basedir/dir1', 'file1111')
('basedir/dir1', 'file111')
('basedir/dir1', 'file11')
('basedir/dir1', 'file12')
('basedir/dir1', 'file13')
basedir/dir3
('basedir/dir3', 'file3333')
('basedir/dir3', 'file333')
('basedir/dir3', 'file31')
('basedir/dir3', 'file32')
('basedir/dir3', 'file33')

=== ex6 (yes_file/yes_root) ===
basedir/dir2
('basedir/dir2', 'file22')
('basedir/dir2', 'file222')
('basedir/dir2', 'file2222')

*1:to walk somewhere in a slow relaxed way

*2:rubyのDir.glob()は**/を指定することで再帰的にディレクトリ階層を辿ることができる

2016-02-02

OpenCV入門

Max OS X(10.10.5)でOpenCVに入門してみた。写真はtoeさんです。導入はbrewで簡単にインストールできるし、インクルード検索パスやライブラリ検索パスもpkg-configで簡単に指定できる。

ネガポジ反転は、~演算子（ビット反転・1の補数）を実行するだけで取り出せる、というところも面白い。Mat型において、~演算子がオーバーロードされているのだろう。最後に、演算子オーバーロードのコード例を追記してみた。

CascadeClassifierによる分類では、loadするデータを変えることによって、どんな特徴を矩形抽出するかを変えている。顔の特徴と下半身の特徴。顔の矩形座標が取れれば、あとはFacebookのようにフレンドリストの教師データから名前を割り当てることはできそう。これはやってみたいところ。

$ brew update
$ brew tap homebrew/science
$ brew install opencv
$ g++ opencv.cpp `pkg-config --cflags opencv` `pkg-config --libs opencv`

$ pkg-config --cflags opencv
-I/usr/local/Cellar/opencv/2.4.12_2/include/opencv -I/usr/local/Cellar/opencv/2.4.12_2/include 

$ pkg-config --libs opencv
-L/usr/local/Cellar/opencv/2.4.12_2/lib -lopencv_calib3d -lopencv_contrib -lopencv_core -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_nonfree -lopencv_objdetect -lopencv_ocl -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_ts -lopencv_video -lopencv_videostab

gist.github.com

f:id:nishidy:20160202001020j:plain f:id:nishidy:20160202001027j:plain f:id:nishidy:20160202001033j:plain f:id:nishidy:20160202001617j:plain f:id:nishidy:20160202001048j:plain f:id:nishidy:20160202001053j:plain f:id:nishidy:20160202003702j:plain

#include <iostream>
using namespace std;

class Klass{
    public:
    int i;
    int j;
    Klass():i(0),j(1){
    }
    Klass* operator ~ () {
        Klass* obj = new Klass();
        obj->i = ~(this->i);
        obj->j = ~(this->j);
        return obj;
    }

};

int main(int argc, char* argv[]){
    Klass* obj = new Klass();
    Klass* rev = ~(*obj);
    printf("%d,%d\n",obj->i,obj->j);
    printf("%d,%d\n",rev->i,rev->j);
    return 0;
}

$ ./a.out 
0,1
-1,-2