Judging panel results!

It was a lot closer than I recall in previous years, which have tended to have a clear winner. This year every single judge ranked a different game top, and the community voting yet another one. So congratulations especially to Marwane Kalam-Alami for The Little Scientist and zeroone for Laser Pinball, but also to 3darray for Nameless, Russell A Spitzer for 4Khords, dapy for Abstract Glowy Vector Wars, and Damocles for Cave4k.

Oh, Marwane, nice reference to Saint-Exupéry. I picked up on it after writing my review, and figured that someone else would be sure to mention it, but since no-one has I thought I’d let you know that it didn’t go entirely unnoticed.

You know how magazines tend to rate everything 7/10 or higher? I was deliberately trying to avoid that and use something nearer the full range of points available, and given how long a sequence there is of games which are separated only by 1% in my ranking I’m glad I did. Appel has previously suggested that rather than giving a percentage score the judges should just bin the games roughly and then order them within the bins, and now that I’ve seen it from the judging side I think that suggestion has a lot of merit.

Not to mention that it mucks up my spreadsheet to calculate correlations between everyone’s rankings. :wink:

Seriously, though, it does seem a bit low. Appel, didn’t you make the scoring system ignore 0s after the brouhaha a couple of years ago?

Your rankings have a Pearson rank correlation coefficient of 0.60, which is the highest of any pairing of judges but not fantastically high. In terms of individual judges vs community, Appel is at 0.69, followed by you at 0.67. Overall, though, the combined judges’ ranking vs the community ranking has a PRCC of 0.73.

Thank you. :smiley:

I now know why teachers’ reports are always full of clichés. It’s a real effort to avoid repeating yourself when writing 50 reports.

One final comment: I had Linux compatibility problems with more than 10% of the games. This may in part be Oracle’s fault, but note that most games didn’t have problems. I highly recommend reading the discussion threads from previous years on applet templates.

Drabiters Bubo4K game shouldn’t have been reviewed by the judges, thus it shouldn’t have received any score.

[quote]Due to the fact there is community voting, judges are free to submit a game into the contest.

Note: These games will only be voted on in the community vote. They will not be considered by the judging panel, and therefore will receive no score from the judging panel. Community voting is separate and independent from the judging panel, and cannot affect or influence each other. This allows judges to submit games for community voting without worrying about any conflict of interest.

[/quote]
Rules: http://java4k.com/index.php?action=view&page=rulesjudg

My system doesn’t yet allow making exceptions like these, haven’t added it because there’s never been a need for it. So, his game slipped through, I didn’t even think of it when reviewing the games :slight_smile: But no biggie, I see no reason to remove the game from the list, especially when it’s so low. :o

[quote]Appel has previously suggested that rather than giving a percentage score the judges should just bin the games roughly and then order them within the bins, and now that I’ve seen it from the judging side I think that suggestion has a lot of merit.
[/quote]
I had forgotten about that :slight_smile: Percentages seem to work alright, but perhaps too “absolute”. I mean what’s the difference between 58% and 63%? I do prefer bucketing the games into buckets like:

  1. Superb!
  2. Excellent
  3. Good
  4. Decent
  5. Mediocre
  6. SO HORRIBBLE MY EYEESS BLEEEEEDD!! (ok maybe not)

Once that process is done, judges can prioritize games within each bucket. Bucketing the games like this is super fast.

Problem is… assigning the score isn’t the most time consuming task, it’s actually playing the game and then write a review. Also, what if one judge decided to put one game into Superb! bucket but another into Mediocre? I don’t like a whole lot of obscurity in how the score is calculated, e.g. everyone understands percentages and normalized averages. So, my idea to solve that would for the judges to decide on which games go into which bucket together, but then individually order them within the bucket. Debatable. But this is similar to what I do at work in SCRUM, we all as a group try to reach an agreement of how much work a story is. But that would introduce another issue though… some don’t want judges to coordinate the results. I admit, I like the “surprise” factor, “oh, that’s how the others voted and wow, that’s the winning game?”-feeling.

However, there’s another practical problem. I’d say around 50 games is the upper limit of what a judge can be expected to review, if we get 100 games I wouldn’t know what to do, it’d take a full week to review probably… 20 games a day, and with each game taking maybe 5-10 minutes it’s around 2-3 hours per day. With buckets the judges could only write a review for the games in top 3 buckets and skip the lower buckets.

Buckets solve a few problems, but don’t necessarily satisfy everything or everyone.

What Saint-Exupéry reference? :clue:

I love how this makes absolutely no sense to me ;D

I know the feeling exactly :slight_smile:

Thanks for the reviews and a big hand to Marwane, awesome game!

I realize now that i focused too much on the music,
forgetting to test the game properly which in turn lead to buggy physics on some platforms, damn :slight_smile:

The novel The Little Prince tells the story of a boy who travels from planet to planet, meeting different silly adults ( although no Java programmer if I remember it well ). He finally arrives to Earth where he meets the story’s narrator.

I hadn’t noticed the link, but now I find it quite obvious.

I should also add that The Little Scientist’s graphics are reminiscent of the novel’s original illustrations, which were also made by its writer,
Antoine de Saint-Exupéry.

Haha congrats, you’re the first one who mentioned the reference ;D

Then I guess I’m lucky, I didn’t bother at all testing my applet in various environments… But luckily I had to work on the game alternatively on Linux & Windows x)

Now regarding the ranking system, even if I understand that the current way to do it is time consuming for reviewers, I totally agree with Appel in that using ‘bins’ would make the ranking a bit shady for others. Now, I guess that what takes a lot of time is to write the reviews, so what about splitting the games between the judges? They would still have to give grades to everyone (else it wouldn’t be fair), but only write reviews for a subset. For instance, 2 reviews per game would be fair enough.

Another (smaller) suggestion: an alternative to the idea of bins would be to give notes between 0 and 10, they can easily be converted to percents once averaged, and are probably faster to choose.

I had forgotten about that :slight_smile: Percentages seem to work alright, but perhaps too “absolute”. I mean what’s the difference between 58% and 63%? I do prefer bucketing the games into buckets like:

  1. Superb!
  2. Excellent
  3. Good
  4. Decent
  5. Mediocre
  6. SO HORRIBBLE MY EYEESS BLEEEEEDD!! (ok maybe not)

Once that process is done, judges can prioritize games within each bucket. Bucketing the games like this is super fast.

Problem is… assigning the score isn’t the most time consuming task, it’s actually playing the game and then write a review. Also, what if one judge decided to put one game into Superb! bucket but another into Mediocre? I don’t like a whole lot of obscurity in how the score is calculated, e.g. everyone understands percentages and normalized averages. So, my idea to solve that would for the judges to decide on which games go into which bucket together, but then individually order them within the bucket.
[/quote]
The scoring system could work on the basis of ranks: the game I consider the best gets n points, the next gets n-1, and the worst gets 1. Then average over the judges. It has normalisation built in automatically, and from a practical point of view it’s not so far off how I ended up scoring with all the 1% differences between games. The finesse you lose is the ability to have larger gaps between games at the two extremes.

Collaborating would make things take much longer: the judges would have to be all online at the same time, or it would spread over several days. In cases where one disagrees it might be easy to reach a quick compromise, but there were some games this year which split the judges 2-2 (Dawn, Drift Driver, Electron Golf, Fleazy, and Galactic Conquest being the clearest). Given the time pressure they’re already under, the stubbornest judge would have disproportionate influence.

That’s not a bad idea.

When you’re judging 50 games and want to express an opinion on their relative merit you want at least 20 different scores to give them. Ranking 0-10 would mean giving 0 to some games just to get the differentiation, which would be quite discouraging for the people who made those games, and a lot of post-review agonising over games on the borderline between 6 and 7.

Given that judges have a personal difference in scoring, this would distort the results.
Every judge has his own personal “metric”.

If 2 judges are more critical (give lower scores on average) and the other 2 give higher scores,
the the first batch of games would have a lower expected ranking than the second by expectation.
Thus comparing the scores would be like comparing oranges with apples.

Its just a different testbase then.
I dont see a way to realistically compare scores without each judge to judge all games.

I guess you’re right, if no one dares to give scores below 5 it’s no finer than the bin system…

I was only suggesting to split the written reviews, not the scoring. Else the rankings would not make sense indeed.

In last line of defence, if you ask me how to simply all of this, make a form where there’s percentage option for each graphics, sound, gameplay and else then calculated final score for them. But this is really bad idea.

I don’t understand even a single thing on what he said ;D maybe it’s statistic term?

From Wikipedia. Essentially, he’s saying that there’s a close relationship between the scores different people gave to the same games.

Antoine de Saint-Exupéry who wrote Le Petit Prince!!! I can’t believe I didn’t notice that! :o

I read that book last semester for French class…such a great book! :slight_smile:

Am I the only one who got the Little Prince reference from the moment I read the title?

(btw, on Debian or Ubuntu, try “apt-get moo”, then “aptitude moo”, then “aptitude -v moo”, then “aptitude -vv moo” and so on with more v’s)

I also got The Little Prince reference. Read it in French class in high school. :slight_smile:

As for future options:

I would also do away with percentages, but not replacing them with bins. Instead, I’d rather just sort the games in the order I like them. That way there is a clear winner and loser - boom, done. It might feel a little dirtier (I didn’t want to make this game get third to last place!) but honestly since it’s a contest I think it makes more sense. I should be aware of exactly how I’m ranking the games, and the judge form already sorts them for you every time you update them just for this reason. So the percentages are already just an unneeded proxy for ordering by winners. Similarly as an entrant I’d rather see “1st with this judge, 6th with this, etc.” rather than having to figure it out myself.

The only potential advantage I see to percentages is that you theoretically compare the games against some gold standard you have in your head, rather than against each other – which is what I did. In other words, my lowest game was only 45% because I see a 0% as a total piece of dog shit that crashes repeatedly and could hardly be called a game. A 50%, on the other hand, is not really very fun and is pretty uninteresting, but it runs and it’s a game of some kind. a 100% is close to the best thing I’ve ever played and I couldn’t imagine how it could be improved (I gave no 100%'s either, and never have in the 3 Java4k’s I’ve judged).

But personally I would rather compare games against each other. It frees me from needing to have some sort of golden standards in my brain, and I almost always end up with like 6 88% games for some reason. :stuck_out_tongue: Of course, doing this would probably require some kind of drag and drop interface to make it all easier - why not code up some cool jquery javascriptyness?

That’s exactly what I use the Java4k for. Apologies to the judges for boring them with car games every year! :smiley: Their reviews along with the community results are invaluable information. So a big thanks to all! :slight_smile:

A combination of my first few game went on to become https://play.google.com/store/apps/details?id=com.craigsrace.headtoheadracing

I’m still trying to nail down what my second game will be (looks like I now have the physics, just need the game play) :slight_smile:

Congratulations on The Little Scientist. Such a fun game! I look forward to seeing it as a mobile app! 8)

Thanks everyone. Especially appel that rated my game so high (not that it seems to have had much effect on my average rating). :slight_smile: 8)

(EDIT: Of course the less positive reviews contained great feedback too. BTW the ‘m’ for Engineer is the standard NATO symbol (but I also thought it looked enough like a bridge to give a hint what the unit is for) http://en.wikipedia.org/wiki/NATO_Military_Symbols_for_Land_Based_Systems#Unit_icons)

Although a NATO soldier may know what the ‘M’ may stand for it’s not true for anyone else who isn’t a NATO soldier. Icon design is difficult, because it’s difficult to find the right symbol design to describe exactly what it does and is easily understood by anyone who isn’t familiar with the context. Sometimes it’s easier just to spell things out. Some 4k games abbreviated some options using just one character, which really made it impossible to play because you had to read the instructions text to understand what each character stood for. Programmers often forget that ordinary people are supposed to use the interface and they’ve never seen it before.

At least he did some research to do it well in place and suit the theme.

Yes, documentation for my game was too sparse. I was told earlier, and then again by the judges. Will think more about that if I get something done for 2013.

It wasn’t a programmer thing in this case, only that I am too familiar with the genre (also didn’t think it would be a problem to figure out which unit was the engineer, thinking that the symbol would be interpreted as a bridge anyway). There wasn’t really any research involved, so I can’t take credit for coming up with something “clever” either.

This was a problem I had when playing some of the other games too. It’s a difficult balance to make documentation that includes everything but is not so overwhelming as to scare the player away.