random seed sequences as a form of data compression

If, for example I want to have 5 colour values.

0xFF0000, 0xFF00, 0xFF, 0x0,0xFFFFFF

but I’m not fussy on their exact value (+/-50 per colour channel), nor am I fussy about the order they appear.

Then I use a simple random number generator that exposes the seed.


         static long seed;
	 static int nextInt() {
	       seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
	       return (int)(seed >>> 16);
	 }

I find that the seed: 60792109829226
Gives the order, and approximations:

1st) ff0000~=f80407
2nd) ff00~=ed11f
3rd) 0~=81622
4th) ffffff~=e8f9fd
5th) ff~=1b21e3

Obviously finding the desired seed is a computationally expensive operation, and the likelyhood of finding a match rapidly degrades the longer the sequence of values it is that you’re looking for.
You’d also have to make quite good use of it, to recoup the cost of the extra method. (though that mostly pays for itself if you were using the Random class anyway).

It avoids array declarations (new int[]{0xFF0000, 0xFF00, 0xFF, 0x0,0xFFFFFF}), which are obviously costly in terms of bytecode.
Alternatively, it avoids the need for code to unpack said data from Strings/a class file attribute.

What do people think to this technique of lossy compression?
Would it interfere with the zip/pack200 compression enough to negate any gains?

You only have a three bit numbers here: 100, 010, 001, 000, 111 (4,2,1,0,7). So there’s very likely to be a shorter code sequence…esp if the order is changeable.

True, though that data was intended as a trivial example.

it’s a general purpose solution that could be used to store anything; rgb values, vertex coordinates, or any other data that takes your fancy - no new code, or data.

I don’t know what the longest viable chain is; it might be possible to ‘store’ 6 or more imprecise integers into the long seed.

Ah. At one point I actually did toy around with something like this for strings…used a simple LCG and mod, then brute force search for all things that fit an English word like pattern of lengths greater than 4 (32-bit seed)…it was surprisingly long the it spewed out list…humm…I wonder where that code is…

For a general solution, you’re basically fighting the odds.

The chance that any seed results in an error less than or equal to 50, you’re dealing with these odds: (50/256)^(3*5) = 1 / 43,556,142,965

My output after a few minutes was this:


seed: 60792109829226, max error: 46
seed: 60943934310711, max error: 46
seed: 60880856400310, max error: 34
seed: 61032294632588, max error: 49
seed: 61033662597542, max error: 39
seed: 61034205984358, max error: 47
seed: 61053432578156, max error: 49
seed: 61057972948251, max error: 49
seed: 61086760880060, max error: 49
seed: 61122022919419, max error: 50
seed: 61286123365503, max error: 48
seed: 61161596990334, max error: 49
seed: 61359747097172, max error: 49
seed: 61037664983589, max error: 50
seed: 61243433059414, max error: 49
seed: 61503002433787, max error: 49
seed: 61372816723026, max error: 37

Which means in this range it found 1 match, for every 34,159,229,047 seeds.

Once you enter the the territory of vertex coordinates, your odds are so small that it simply becomes impracticle to find even one good seed. You’d end up with mutiple seeds to define a shape, deminishing the compression significantly.

If you have a lot of CPU cycles at your disposal, with no electricity costs (or having money to burn)… it may be feasible.

In the past i did investigate doing exactly what you tried, i even used it as part of a sprite compression method where the deltas from between a random number and the pixel colour was recorded. It was only useful in the case of extremely noisy sprites that were already too expensive to compress any way. So over all i did not achieve a net loss of bytes :frowning:

Mind due, I have used it successfully when finding the optimal ordering of sprites… but that was part the “build” process and not part of the applet code.

I had a lovely dream where you gave a seed to a PRNG and that generated the next 2 seeds which made the next 4 seeds… and so on… until it made an image…
As Riven says, the tough bit is finding the first seed!
I bet a really good mathematician could devise a convergence system where you worked both ways, finding seeds for pixels and pixels for seeds, improving the approximation every time. Maybe.

Can anyone recall a (possibly fictional) Russian algorithm based on prime numbers where you could get perfect data recovery from something like 23442791-1?

I really don’t think it is a great method to take a set of data and then look for a seed that will generate that data.

A better idea is to look for seeds that will generate the appropriate data. For example, say you have a map generation algorithm. You can plugin various seeds for that map generation (along with maybe other parameters like ‘snowy’, ‘cold’ etc…) and it will generate a map. Look for seeds that will produce a desirable map, then save your ‘map’ as a seed.

Then add features (such as quests etc) to that map that will be generated with that seed.

TLDR: Going from seed to desirable data I think is practical. But trying to find a seed for a set of predefined data is almost impossible (practically speaking.)

Depending on how intelligent the generation algorithm is, it may be practical to search for desirable sets of data generated with seeds, rather than going in the reverse direction.

I agree with Jeremy and Riven, it’s only practical to go from a seed to data, not the other way.

PRNGs are a classic example of the usefulness of one-way functions, where it’s easy to generate a sequence of evenly distributed numbers, but it’s very hard to determine the seed that lead to that sequence. If you are able to take a set of data and reliably figure out a seed that generates that data, your PRNG is severly broken. For example, if JGO were using a broken PRNG I could:

  • take the login token JGO generated for me when I logged in here
  • determine the seed
  • use that seed to figure out the next 50 or 500 or 5000 tokens
  • keep swapping in tokens and impersonating everyone who logged in after me

There are ways to mitigate the risk of such an attack, and I expect banking sites do that, but you get my point. PRNGs use one-way functions that are designed to be hard to reverse, so you’re probably wasting time if you make that effort.

I remember that there was a tool out there to predict the cards in an online poker game. You would insert all the cards you see and over time the predictions would get better.

That’s why you have to shuffle the deck prior to every hand, or people/apps will start counting cards.

It’s more than card counting; it’s seed (and algorithm) prediction, so would work just fine through a shuffle.

Frequently regenerating the seed from the system clock seems to be the only reliable defence in that situation.

The ability to deduce the seed make no statement about a PRNG’s statistical quality. All the “excellent” PRNGs you can deduce the seed. Also none of them are suited for use as a cryptographic hashing function.

That might work in practice but I won’t want to have an investment an online gambling site that use this as a strategy. Correlated seeding yields correlated sequences…esp in early entries.

Short of a brute-force attack on a PRNG with a small output-space, I’m not sure how you would deduce the seed for a PRNG. Deducing the seed of a PRNG is equivalent to deducing the next bit in the sequence with a probability of success greater than 50%. The ability to do that within a reasonable amount of calculations mean your PRNG isn’t very good. To put it another way, good PRNGs are implemented using one or more one-way functions. If you are able to take the output of a PRNG and deduce the seed, you have successfully broken that combination of one-way functions.

Not all applications of PRNGs require CSPRNGs.

Using a crypto rng for non-crypto usage is IMHO rather silly. The cost vs. quality isn’t there to make it worthwhile. Folks do…there’s various paper using them on the GPU for instance as a PRNG. But they’d be much better of using a standard hashing or PRNG (stateless vs. with state) function instead.

A non-crypto PRNG is better described as a permutation sequence when the state is viewed as integer (not always true, but close enough…true for all reasonable choices).