python - Grouping indices of unique elements in numpy -

- January 15, 2014

i have many large (>100,000,000) lists of integers contain many duplicates. want indices each of element occur. doing this:

import numpy np collections import defaultdict  = np.array([1, 2, 6, 4, 2, 3, 2]) d=defaultdict(list) i,e in enumerate(a):     d[e].append(i)  d defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})

this method of iterating through each element time consuming. there efficient or vectorized way this?

edit1 tried methods of acorbe , jaime on following

a = np.random.randint(2000, size=10000000)

the results are

original: 5.01767015457 secs acorbe: 6.11163902283 secs jaime: 3.79637312889 secs

this similar asked here, follows adaptation of answer there. simplest way vectorize use sorting. following code borrows lot implementation of np.unique upcoming version 1.9, includes unique item counting functionality, see here:

>>> = np.array([1, 2, 6, 4, 2, 3, 2]) >>> sort_idx = np.argsort(a) >>> a_sorted = a[idx] >>> unq_first = np.concatenate(([true], a_sorted[1:] != a_sorted[:-1])) >>> unq_items = a_sorted[unq_first] >>> unq_count = np.diff(np.nonzero(unq_first)[0])

and now:

>>> unq_items array([1, 2, 3, 4, 6]) >>> unq_count array([1, 3, 1, 1, 1], dtype=int64)

to positional indices each values, do:

>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count)) >>> unq_idx [array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),  array([3], dtype=int64), array([2], dtype=int64)]

and can construct dictionary zipping unq_items , unq_idx.

note unq_count doesn't count occurrences of last unique item, because not needed split index array. if wanted have values do:

>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],))) >>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))

Search This Blog

GHI

python - Grouping indices of unique elements in numpy -

Comments

Post a Comment

Popular posts from this blog

reporting services - Visible Export Data Feed option SSRS report -

git - Initial Commit: "fatal: could not create leading directories of ..." -

multithreading - OPAL (Open Phone Abstraction Library) Transport not terminated when reattaching thread? -