python - Parallelism inside of a function? -
i have function counts how list of items appears in rows
below:
def count(pair_list): return float(sum([1 row in rows if all(item in row.split() item in pair_list)])) if __name__ == "__main__": pairs = [['apple', 'banana'], ['cookie', 'popsicle'], ['candy', 'cookie'], ...] # grocery transaction data rows = ['apple cookie banana popsicle wafer', 'almond milk eggs butter bread', 'bread almonds apple', 'cookie candy popsicle pop', ...] res = [count(pair) pair in pairs]
in reality, len(rows)
10000
, there 18000
elements in pairs
, computing cost of list comprehension in count()
, 1 in main function expensive.
i tried parallel processing:
from multiprocessing.dummy import pool threadpool import multiprocessing mp threadpool = threadpool(processes = mp.cpu_count()) res = threadpool.map(count, pairs)
this doesn't run quickly, either. in fact, after 15 minutes, quit job because didn't ending. 2 questions: 1) how can speed actualy searching takes place in count()
? 2) how can check status of threadpool.map
process (i.e. see how many pairs left iterate over)?
1) overall complexity of calculations enormous, , comes different sources:
a) split row on low level of calculation, python has create new row split every iteration. avoid this, can pre-calculate rows. job (with minor changes in "count" function):
rows2 = [row.split() row in rows]
b) compare list items 1 one, though need check existence of word in list. here can tweak more (and use rows3 instead of rows2 in "count" function):
rows3 = [set(row.split()) row in rows] def count(pair_list): return float(sum([1 row in rows3 if all(item in row item in pair_list)]))
c) check every word in pairs every word in rows. calculation takes 2*len(row)*len(rows) iterations per call of "count" function original version, while can take less. option b) can down 2*len(rows) in case, it's possible make 1 set lookup per pair, not 2. trick make preparation of possible word*word combinations every row , check if corresponding tuple exists in set. so, in main function create complex immutable search structure:
rows4 = [set((a, b) in row b in row) row in rows2]
and "count" different, takes tuple instead of list:
def count2(pair): return float(len([1 row in rows4 if(pair in row)]))
so call bit different: res = [count2(tuple(pair)) pair in pairs]
note search structure creation takes len(row.split())^2 per row in time , space, if row can long, it's not optimal. after all, option b) can better.
2) can predict number of calls "count" - it's len(pairs). count calls of "count" function , make debug print in for, say, every 1000 calls.
Comments
Post a Comment