Ok, memory management problem

i30817 · July 6, 2009, 12:55am

So i have ebook reader. And naturally want to create a Gutenberg book downloader so first time users can read something out of the box.

Fine. So i envisioned a library panel, over GlazedLists, like i already have for the local files, but that instead of showing all possibilities only begins to shows after 3-4 characters inserted. Should be enough for filtering right? And works almost the same way - i like consistency.

So i need a way to search. Looking at the Gutenberg site there is a rdf file 5 mb zipped. Wunderbar i think. Actually it is 100 mb unzipped - so don’t unzip.

So then i need a indexation method for rapid searching of rdf. Lucene comes to mind. Google finds LuceneSail easily.

First Problem : LuceneSail indexes and duplicates the text, so those 100mb become 210mb in the disk somewhere. But searches are fast at least.

Second Problem : Indexing takes forever (5-8 minutes) and too much memory for the main program. The too much memory for the main program can be alleviated if you have a monster machine and lots of memory. Then i can do this :


    /**
     * This method creates a new process that will run a new jvm
     * on the main of the given class, with the selected arguments.
     * It already flushes the output and inputstream of the forked jvm
     * into the current jvm.
     * The forked jvm uses the same java.exe and classpath as the current
     * one.
     * @param javaClass class with main method
     * @param args jvm properties.
     */
    public static void forkJavaAndWait(Class klass, String ... args) throws IOException, InterruptedException{
        String javaExe = System.getProperty("java.home") + File.separator + "bin" + File.separator + "java.exe";
        String classpath = System.getProperty("java.class.path");
        List<String> l = new ArrayList<String>(4+args.length);
        l.add(javaExe);
        l.add("-cp");
        l.add(classpath);
        l.addAll(Arrays.asList(args));
        l.add(klass.getCanonicalName());
        ProcessBuilder pb = new ProcessBuilder(l);
        pb.redirectErrorStream(true);
        final Process p = pb.start();
        //process builder stupidity (would need 2 threads if redirectErrorStream(false))
        new Thread(new Runnable(){
                    @Override
        public void run() {
            String line;
            BufferedReader bufferedStderr = new BufferedReader(new InputStreamReader(p.getInputStream()));
            try {
                while ((line = bufferedStderr.readLine()) != null) {
                    System.out.println(line);
                }
            } catch (IOException ex) {
                Logger.getLogger(IoUtils.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
        }, "ProcessBuilderInputStreamConsumer").start();
        int e = p.waitFor();
        if (e != 0) {
            p.destroy();
            throw new IllegalStateException("couldnt fork the java process, error code "+e);
        }
    }

Third Problem : But the files are not deleted for some stupid reason if the java process is killed (in a finally in the given class main - have i to use a shutdown hook or the SignalHandler?).

What would you prefer:

stupid search that scraps project gutenberg webpages and doesn’t show possibilities as you type.
Smart search that shows possibilities after some typing and eats 210mb.
2a) and that you need a beastly machine to use takes 5 minutes to create (once) or update (more than once).
2b) and that you need to download a (24mb) zipped index and unzip it (once).
2c) and that is created on the installer (that i don’t have now) and works as 2a.

Json · July 6, 2009, 9:53am

I’m using Lucene for a project at work and I managed to have an index at the third of the size of the original data. I also did some index speed tests.

Test facts
Number of files: 95.000
Total amount of data: 1.13 GB

Test 1 (single thread indexer)
In this first test I just wanted to index all the data.

Time consumed: 12 minutes 12 seconds

Test 2 (multithreaded indexer)
In this second test I used multiple threads to index all the data to see if I could speed things up.

Time consumed: 6 minutes 17 seconds

Test 3 (searching)
My final test was a search test. In this test I have changed one of the 95.000 files to contain my name and I run a search on it.

Time consumed: 93 milliseconds

Those are my very simple test results for Lucene. In the end I went for single thread indexing because I usually don’t have that many things to index. I index things as they are added. My test files were basically text files with about 12Kb data each, the typical thing I wanted to index.

// Json

i30817 · July 12, 2009, 7:16pm

Managed to reduce the index time to 41 seconds and the space to 35mb by filtering the parts of the rdf that i care for and optimizing them (removing redundant tags etc).

The LuceneSail guys should do this themselves (they are after all indexing for normally static queries).
Sent the appropriate bitching mail with suggestions.