Java4K Sourcecode Compressor Community Effort (JSCCE)

Here is a first lousy attempt at it kicking off the effort to create a community sourcecode compressor for Java :point:


import java.util.regex.*;

public class LousySourcecodeCompressor {
	public static void main(String[] args) throws Exception {
		File file = new File(args[0]);

		int origLength = 0;
		StringBuilder trimmed = new StringBuilder();
		BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
		while (true) {
			String line = br.readLine();
			if (line == null)
				break; // reached end of file

			origLength += line.length() + 1; // +1 for line-break
			line = line.trim(); // strip optional whitespace
			if (line.isEmpty())
				continue; // strip empty lines

		String code = trimmed.toString();

		// make a lousy attempt at stripping comments
		code = Pattern.compile("//[a-zA-Z\\s]+$", Pattern.MULTILINE).matcher(code).replaceAll("");

		// make a lousy attempt at stripping annotations
		code = Pattern.compile("^@[a-zA-Z]+$", Pattern.MULTILINE).matcher(code).replaceAll("");

		// make a lousy attempt at stripping optional whitespace
		code = code.replaceAll("\\s*([\\+\\-\\*/%,\\(\\)\\{\\}\\[\\]=;:<>!])\\s*", "$1");

		System.out.println(code.length() + "/" + origLength);

which takes and produces the following (N.B.: newlines inserted by me)


package net.indiespot.java4k.entries;import java.awt.*;import net.indiespot.java4k.Java4kRev1;public
class OddEntry extends Java4kRev1{public OddEntry(){name="Odd Entry";}public void render(Graphics2D
g){g.setColor(new Color(128,64,128));g.drawString("Drag the mouse a little...",8,20);g.setColor(new
Color(0,64,128));int b,r,a,c,q,t;for(r=150;r>=30;r-=15){t=((elapsed()+130_000)/((200-r)/25));b=r/3;
!=null){g.setColor(new Color(128,64,128));Rectangle w=mouse.dragArea;g.drawRect(w.x,w.y,w.width,w.height);}}}

Note that it does not yet respect string literals, which makes it borderline useless :slight_smile:

Another neato feature might be to automatically compress identifiers as well to their minimal representation eg. A, B, C, etc. but that’d require a bit more of a sophisticated parser…

Cas :slight_smile:

Cas: ‘auto refactoring’ is left as an exercise for the reader. :point:

src → [javac → proguard → decompiler] → src → strip whitespace → :persecutioncomplex: → win!

I’m working on improving the minifier, most significant new feature so far is string literal preservation.
The only wrench in the gears with it currently is anything that looks like a string literal in a comment screws up other literals in the file, but I know how to fix it.

It also still uses Riven’s crazy whitespace eliminator expression, I haven’t toyed with that at all yet.
Test results:

Thought about limited identifier compression, it’ll be tough I think, at least for anything other than primitive types.
Any other features it should have?

It’s not that hard actually. You only need a parser capable of finding two kinds of comments, string literals and char literals, using a simple state machine. You replace these ranges by placeholders, apply whatever transformation that would corrupt what you replaced, and then inject the literals back in. I just didn’t feel like actually doing it… :persecutioncomplex:

So much to do, so little time.

Placeholders is exactly what I did, but I got greedy and used all the same placeholder (so as to only have to confirm the file doesn’t contain 1 sequence), but they’ll have to be numbered.

Numbered or cryptographically hashed, whichever utility code is around :point:

After about an hour, I reached 890 too, without placeholders, just tokens :slight_smile:

Here is my horrific state-machine / tokenizer:
(ironic how my PHP parser b0rks)

Nice. Currently both minifiers are tied at 927 for my current test case:


I’d like to see if anyone can manage to break either of them!
I suspect mine would be flakier, maybe around some annotation edge cases…

It’s relatively easy to put them through a stress-test. You simply minify the minifier, and see whether it produces a working version of itself again… The version I posted can’t do it, fixing it now. (actually, going to bed…)

(I don’t know if I’m/we’re hijacking the thread yet, but…)

There was one problem (maybe you already found it?):

-key="#&"+++seed // error: invalid operation ++/--
+key="#&"+ ++seed

Post-increment takes precedence over pre-increment (in parsing even!). Whitespace eliminator should probably take that into account.

No worries, I’ll just split it off, tomorrow :slight_smile:

s = s.replaceAll("(\\G|([\\+|\\-]))\\s+(\\1)", "[$2,$3]");
s = s.replaceAll("\\s*([\\+\\-\\*/%,\\.\\(\\)\\{\\}\\[\\]=;:<>!&\\|\\^])\\s*", "$1");
s = s.replaceAll("\\[\\,\\]", "").replaceAll("\\[([\\+|\\-]?)\\,(\\1?)\\]", "$1 $2");

Input: - [icode]abc ++ + 6 + ++ xyz – - 7 - – pqr – + x ++ - 4[/icode]
Output:- [icode]abc++ +6+ ++xyz-- -7- --pqr–+x+±4[/icode]

Minifier: :persecutioncomplex:
Minified minifier: (with free linefeeds @ column ~80)

Update 1
Input: - [icode]abc ++ + ++ xyz – - 7 - – pqr – + x ++ - 4[/icode]
Output:- [icode]abc++ +++xyz-- -7- --pqr–+x+±4[/icode] :emo:
Required:[icode]abc+++ ++xyz-- -7- --pqr–+x+±4[/icode] :emo:

Fix 1

@@s = new StringBuilder(s).reverse().toString();
s = s.replaceAll("(\\G|([\\+|\\-]))\\s+(\\1)", "[$2,$3]");
s = s.replaceAll("\\s*([\\+\\-\\*/%,\\.\\(\\)\\{\\}\\[\\]=;:<>!&\\|\\^])\\s*", "$1");
s = s.replaceAll("\\[\\,\\]", "").replaceAll("\\[([\\+|\\-]?)\\,(\\1?)\\]", "$1 $2");
@@s = new StringBuilder(s).reverse().toString();

Input: - [icode]abc ++ + ++ xyz – - 7 - – pqr – + x ++ - 4[/icode]
Output:- [icode]abc+++ ++xyz-- -7- --pqr–+x+±4[/icode] ;D (hey it’s 3:30AM, gimme a break!)

Update 2
Input: - [icode]def – - 5 - – xyz[/icode]
Output:- [icode]def-- -5- --xyz[/icode] :emo:
Optimal:-[icode]def—5- --xyz[/icode] :emo:

Fix 2

s = new StringBuilder(s).reverse().toString();
s = s.replaceAll("(\\G|([\\+|\\-]))\\s+(\\1)", "[$2,$3]");
s = s.replaceAll("\\s*([\\+\\-\\*/%,\\.\\(\\)\\{\\}\\[\\]=;:<>!&\\|\\^])\\s*", "$1");
s = s.replaceAll("\\[\\,\\]", "").replaceAll("\\[([\\+|\\-]?)\\,(\\1?)\\]", "$1 $2");
s = new StringBuilder(s).reverse().toString();
@@s = s.replaceAll("(\\w([\\+\\-])\\2) (\\2\\w)", "$1$3");

Somebody will one day be extremely happy with this optimization :expressionless:

Nerd-sniping yourself, huh? :smiley:

Also, I don’t see [icode]a ? b : c -> a?b:c[/icode] in there, I added the ? to my version.
Also &+ and |+, EDIT: although I guess that is handled by the double-sided replacement.

EDIT2: Eclipse says [icode]abc ++ + 6 + ++ xyz – - 7 - – pqr – + x ++ - 4[/icode] is bad:
[icode]abc ++ + 6 + ++ [xyz --] - 7 - – [pqr --] + x ++ - 4 // invalid in [][/icode]
So I don’t think that is valid input.

I’m done for the night, but at least I’m leaving off at a good place:

Current test case is both of our classes lumped in one file, mine runs both to ensure correctness of each.
The file will compress itself to 7804 7879 (EDIT: forgot extra stress tests) chars and run again with no differences either still compressed or after eclipse formatter expansion, so at least for everything tested here, it’s sound.

Unfortunately yours isn’t working? It compresses ~100 chars more, but they look to all be invalid deletions…

Just here to point out the ovious and say that this is finally a thread that about 98% of the community have literally nothing to contribute to while 90% of the remaining 2% seem to not have the time needed to contribute to it.
Go on guys, continue writing your own legend.

Cleaned it up, added some things. Moving to Gist for easy revisions:

It’s got a basic CLI now:

$ java CoffeeGrinder
CoffeeGrinder: Java source code minifier
        by BurntPizza

Usage: [options] file

        -i      Print compression info
        -c      Only strip comments and annotations
        -w:n    Attempt line wrapping at n columns
                Use 0 for no wrapping. Default: 80

Bug reports welcome, although I expect it would break on input other than valid java source, and I’m not sure I care.

I was swamped today, will be swamped the entire weekend and then will encounter some light swamping. I will join the party immediately afterwards.

I’m gonna try it: Identifier compression :persecutioncomplex:

Main effort at the moment is declaration extraction, and preliminary results are promising:

Current pipeline is minification -> large scary ‘broad phase’ regex -> split by semicolons/newlines -> series of several filters
Result is a dump of things which have declarations in them:

public class CoffeeGrinder
public static void main(String[]args)throws IOException
String path
int lineWrapping=80;
boolean aggressive=true,printInfo=false;
for(String s
catch(NumberFormatException e
StringBuilder preprocessed=new StringBuilder();
for(String line
String code
int originalLength
private static void printUsage
private static String minify(String src,int lineWrap,boolean aggressive
PreservationResult pr=preserveStringLiterals(src);
String code
private static String lineWrap(String text,int width
StringBuilder lineWrapped=new StringBuilder();
StringBuilder sb=new StringBuilder
for(int i
String line
private static PreservationResult preserveStringLiterals(String in
Deque<Interval>intervals=new ArrayDeque<>();
Map<String,String>mapping=new HashMap<>();
int seed=0;
String key;
String prefix
boolean strmode=false,charmode=false,linecomment=false,blockcomment=false,escaped=false;
for(int i
char c
boolean inComment
StringBuilder sb=new StringBuilder
Interval i
PreservationResult pr=new PreservationResult
private static void compressIdentifiers(String text
for(Interval i
private static Set<String>identifiers(String text,Interval scope
Set<String>idens=new HashSet<>();
Matcher m
StringBuilder sb=new StringBuilder
String s
List<String>decs=new ArrayList<>();
for(String s
for(String s
private static void filter(List<String>list,Pattern p,boolean allMatch
for(int i
Matcher m
private static List<Interval>matchNestedIntervals(String text,char begin,char end
List<Interval>topLevels=new ArrayList<>();
int idx
int start=idx;
int nestLevel
char c
private static class PreservationResult
String output,key;
String revert(String text
Matcher m
Deque<Interval>intervals=new ArrayDeque<>();
Deque<String>matches=new ArrayDeque
StringBuilder sb=new StringBuilder
Interval i
private static class Interval
int start,end;
Interval(int x,int y
String subString(String in
public String toString

It’s barely tested, but I do believe that is every declaration of an identifier in the file, and no false entries.
Of course I’ll need to process much more source to see how it holds up. (It won’t)