python - Grouping indices of unique elements in numpy -


i have many large (>100,000,000) lists of integers contain many duplicates. want indices each of element occur. doing this:

import numpy np collections import defaultdict  = np.array([1, 2, 6, 4, 2, 3, 2]) d=defaultdict(list) i,e in enumerate(a):     d[e].append(i)  d defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]}) 

this method of iterating through each element time consuming. there efficient or vectorized way this?

edit1 tried methods of acorbe , jaime on following

a = np.random.randint(2000, size=10000000) 

the results are

original: 5.01767015457 secs acorbe: 6.11163902283 secs jaime: 3.79637312889 secs 

this similar asked here, follows adaptation of answer there. simplest way vectorize use sorting. following code borrows lot implementation of np.unique upcoming version 1.9, includes unique item counting functionality, see here:

>>> = np.array([1, 2, 6, 4, 2, 3, 2]) >>> sort_idx = np.argsort(a) >>> a_sorted = a[idx] >>> unq_first = np.concatenate(([true], a_sorted[1:] != a_sorted[:-1])) >>> unq_items = a_sorted[unq_first] >>> unq_count = np.diff(np.nonzero(unq_first)[0]) 

and now:

>>> unq_items array([1, 2, 3, 4, 6]) >>> unq_count array([1, 3, 1, 1, 1], dtype=int64) 

to positional indices each values, do:

>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count)) >>> unq_idx [array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),  array([3], dtype=int64), array([2], dtype=int64)] 

and can construct dictionary zipping unq_items , unq_idx.

note unq_count doesn't count occurrences of last unique item, because not needed split index array. if wanted have values do:

>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],))) >>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1])) 

Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -