python - Parallelism inside of a function? -

i have function counts how list of items appears in rows below:

def count(pair_list):     return float(sum([1 row in rows if all(item in row.split() item in pair_list)]))  if __name__ == "__main__":     pairs = [['apple', 'banana'], ['cookie', 'popsicle'], ['candy', 'cookie'], ...]     # grocery transaction data     rows = ['apple cookie banana popsicle wafer', 'almond milk eggs butter bread', 'bread almonds apple', 'cookie candy popsicle pop', ...]      res = [count(pair) pair in pairs]

in reality, len(rows) 10000 , there 18000 elements in pairs, computing cost of list comprehension in count() , 1 in main function expensive.

i tried parallel processing:

from multiprocessing.dummy import pool threadpool import multiprocessing mp  threadpool = threadpool(processes = mp.cpu_count())  res = threadpool.map(count, pairs)

this doesn't run quickly, either. in fact, after 15 minutes, quit job because didn't ending. 2 questions: 1) how can speed actualy searching takes place in count()? 2) how can check status of threadpool.map process (i.e. see how many pairs left iterate over)?

1) overall complexity of calculations enormous, , comes different sources:

a) split row on low level of calculation, python has create new row split every iteration. avoid this, can pre-calculate rows. job (with minor changes in "count" function):

rows2 = [row.split() row in rows]

b) compare list items 1 one, though need check existence of word in list. here can tweak more (and use rows3 instead of rows2 in "count" function):

rows3 = [set(row.split()) row in rows]  def count(pair_list):     return float(sum([1 row in rows3 if all(item in row item in pair_list)]))

c) check every word in pairs every word in rows. calculation takes 2*len(row)*len(rows) iterations per call of "count" function original version, while can take less. option b) can down 2*len(rows) in case, it's possible make 1 set lookup per pair, not 2. trick make preparation of possible word*word combinations every row , check if corresponding tuple exists in set. so, in main function create complex immutable search structure:

rows4 = [set((a, b) in row b in row) row in rows2]

and "count" different, takes tuple instead of list:

def count2(pair):     return float(len([1 row in rows4 if(pair in row)]))

so call bit different: res = [count2(tuple(pair)) pair in pairs]

note search structure creation takes len(row.split())^2 per row in time , space, if row can long, it's not optimal. after all, option b) can better.

2) can predict number of calls "count" - it's len(pairs). count calls of "count" function , make debug print in for, say, every 1000 calls.

Search This Blog

Living

python - Parallelism inside of a function? -

Comments

Post a Comment

Popular posts from this blog

elasticsearch python client - work with many nodes - how to work with sniffer -

unity3d - Rotate an object to face an opposite direction -

angular - Is it possible to get native element for formControl? -