python - Grouping indices of unique elements in numpy -
i have many large (>100,000,000) lists of integers contain many duplicates. want indices each of element occur. doing this:
import numpy np collections import defaultdict = np.array([1, 2, 6, 4, 2, 3, 2]) d=defaultdict(list) i,e in enumerate(a): d[e].append(i) d defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})
this method of iterating through each element time consuming. there efficient or vectorized way this?
edit1 tried methods of acorbe , jaime on following
a = np.random.randint(2000, size=10000000)
the results are
original: 5.01767015457 secs acorbe: 6.11163902283 secs jaime: 3.79637312889 secs
this similar asked here, follows adaptation of answer there. simplest way vectorize use sorting. following code borrows lot implementation of np.unique
upcoming version 1.9, includes unique item counting functionality, see here:
>>> = np.array([1, 2, 6, 4, 2, 3, 2]) >>> sort_idx = np.argsort(a) >>> a_sorted = a[idx] >>> unq_first = np.concatenate(([true], a_sorted[1:] != a_sorted[:-1])) >>> unq_items = a_sorted[unq_first] >>> unq_count = np.diff(np.nonzero(unq_first)[0])
and now:
>>> unq_items array([1, 2, 3, 4, 6]) >>> unq_count array([1, 3, 1, 1, 1], dtype=int64)
to positional indices each values, do:
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count)) >>> unq_idx [array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64), array([3], dtype=int64), array([2], dtype=int64)]
and can construct dictionary zipping unq_items
, unq_idx
.
note unq_count
doesn't count occurrences of last unique item, because not needed split index array. if wanted have values do:
>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],))) >>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))
Comments
Post a Comment