Fast way to filter through data

I am looking for a good way to look through a data set and remove all values smaller then for example (50) and bigger then (100), all methods i have tried do not work well when dealing with very large arrays, any suggestions?

What methods have you tried? Simple loop (perhaps split over N threads) is about as good as it gets.

Oh boy, an excuse to use Java 8!


import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;

public class Test {
	public static void main(String... args){

		List<Integer> numbers = new ArrayList<Integer>();

		for(int i = 0; i < 10000; i++){
			int n = (int) (Math.random()*150);
			numbers.add(n);
		}

		System.out.println("Unfiltered size:" + numbers.size());

		List<Integer> filteredNumbers = 
				numbers.parallelStream()
				.filter(t -> t >= 50 && t <= 100)
				.collect(Collectors.toList());

		System.out.println("Filtered size:" + filteredNumbers.size());
	}
}


Depending on your data, it might also make sense to store it as a sorted data structure, then just remove anything outside of your limits.

You might also consider using a database and letting it do the work for you.

Id love to use a DB but its for the stored chunks in the game, its just to free up some un-used data, but ill give that a shot, ty :slight_smile:

if you do not care about order :

T[] stuff = ...
int size = stuff.length;

for(int i = 0; i < size; i++)
{
  T element = stuff[i];

  if(removeThing(element))
  {
    stuff[i] = stuff[--size];
    stuff[size] = null;

    if(stuff[i] != null) i--;
  }
}

T newstuff[] = new T[size];

System.arraycopy(stuff, 0, newstuff, 0, size);

// newstuff contains what was not removed from stuff.


if order must be kept then things wont be fast. the way ArrayList works :

void removeKeepOrder(int index)
{
  int numMoved = size - index - 1;
  if(numMoved > 0) System.arraycopy(stuff,index + 1,stuff,index,numMoved);
  stuff[--size] = null;
}

but this is slow when you remove items in close to the start.

another somewhat fast way to remove things from a ordered list is to use a doubly-linked-list.

@Kevin

Unboxed:

import java.util.Arrays;

public class Test {
	public static void main(String... args) {
		
		int[] numbers = new int[1024];
		int size = 0;
		
		for (int i = 0; i < 10000; i++) {
			int n = (int) (Math.random() * 150);
			
			if (size == numbers.length)
				numbers = Arrays.copyOf(numbers, numbers.length * 2);
			numbers[size++] = n;
		}
		
		System.out.println("Unfiltered size:" + size);
		
		int[] filteredNumbers =
				Arrays.stream(numbers).parallel()
						.filter(t -> t >= 50 && t <= 100)
						.toArray();
		
		System.out.println("Filtered size:" + filteredNumbers.length);
	}
}

EDIT: actually doesn’t seem that much faster until ~1M elements.

If you can cheaply sort the list or the list is already sorted, then it’s simply a matter of cutting off all elements after a certain index, which can easily be found using a binary search in the sorted list.

I hope this thread isn’t considered too old to post on, but you might try this:

for (int i = 0; i < numbers.length; i++)
	if (((numbers[i] - lower) & 0xFFFF) <= (upper - lower)) 
		// numbers[i] is between lower and upper.

You can adjust that as necessary. This yielded a significantly better performance than a simple loop with a check between bounds. The way it works is that it treats numbers[i] - lower as an unsigned number. If the result was negative, it will be perform integer wrapping and go to a higher value than upper - lower will be.

Like BurntPizza has mentioned, splitting this across multiple threads will likely be your biggest performance improvement.

Sorting the array first may be quicker in some cases, depending on what algorithm you use and how sporadic the numbers are in sequence. A quick test using Arrays.sort() on an array with completely arbitrary values yielded significantly worse performance than a simple loop with a check between bounds.

Ideally, you don’t want to keep removing elements from the array you have, a huge amount of time will be spent doing that. Instead, if you can, create a second array, and push appropriate numbers in to that one. You can reserve the array size to be the probability of a number being pushed there multiplied by the amount of numbers in your data set.

:slight_smile:

Sort is O(n log n). Which then enables O(log n) binary search method.
Check bounds is O(n).

It might also be possible to not inserting those values on list all together.

The quickest way if its already in memory just iterate over the whole thing with an insertion point… this is O(n), you can’t do better and this is quite a bit faster than sorting.

For an array… something like this: (probably with some +1 -1 error… )


int insert=0;
for(int i=0;i<size;i++){
    if(list[i] passes keep test){
        list[insert]=list[i];
        insert++;
    }
}
//set everything from insert+1 to size to some "empty value" 

Awesome tip delt0r, I didn’t think about that, I’ll be sure to use that whenever appropriate.