A Proposal For A New Rating System

My proposal for a possible rating system is somewhat complicated. I spent a lot of time on it, so please don’t disregard it too lightly :slight_smile:

At the core of my proposal is a category-based rating system. Instead of having a single large point scale and leaving it up the judges to divvy out the points, let’s separate the points into specific, meaningful categories and ask the judges to rate against those categories. The rating scale for each category should be minimal, to help keep the results meaningful. My belief is that a scale of 1-5 for each category should be more than enough.

Game designers want feedback from judges on how well they achieved each of the categories that make up a game. Feedback also helps “justify” the scores taht judges give. By specifying these categories, the results become more self-documenting and require less writing by the judges. This also opens the door for “best in category” awards, which I believe is meaningful.

We need to try to reduce the impact on total score from judges that “just don’t like this genre” or “just don’t get it”. Given, “just don’t get it” responses should be minimized by having appropriate documentation on your game’s download page and/or forum page, but there’s not much you can do if a judge dislikes the genre that your game is in. Even so, judges can make unbiased opinions on a game’s graphics, controls, polish, technical achievement, etc. thus providing valuable feedback and still giving games a chance to win “best in category” awards.

Judges should be discouraged from the act of giving a lower score because they “might see something better later on”. My proposal would encourage the judges to rate each game on its own merits. After a judge has finished rating all of the games, the system will compute the judge’s top 4 games based on total weighted score. If there are any ties within those top 4 games, the judge will be given a chance to rank the tied games. This tie-break ranking will be completely subjective and the judge will not have to justify his decision. This may seem like a silly feature to include, but I believe that judges have implicitly stated their desire for such a feature by giving games a ‘99’ so that they have the option of giving a ‘100’ later on if they find what they believe to be the “best” game. This manual tie-breaking would happen on the single judge level only. For games that are tied based on total combined score from all the judges, the system would try to determine if a game appears to be “more liked” by the judges. This would be based on both the complete point-sorted list for each judge and by any tie-break rankings given by each judge. If the system could not determine if one game was “more liked” than another, then those games would be considered tied for their position.

I believe that we all agree that some categories are “more important” than others and should carry a larger weight in the totals. The thing that we’re not going to all agree on is which categories are the most important. My proposal is to let each judge decide which categories are most important to them. Before they can start judging the games, each judge decides which categories should carry higher weights. These weights will be made public during and after the judging period.

Here is my category list:

Originality
This is an indication of whether or not the game “brings anything new”. For games in well-established genres, this can be taken to mean “Does the game bring any new gameplay elements to the genre”. For games in smaller genres, just having an entry may be enough to earn a good score in this category. By definition, full clones are going to score lower in this category.

Gameplay
Also called “playability” or “replay value”. This is an indication of how much fun/addictive the game is.

Technical Achievement
This a perceived value indicating whether or not you feel the game achieved something that is difficult or even impossible, given the limitations of the contest. For instance, a game may feel like it has more graphics or levels than you should be able to fit into 4K. You can also give a game a low “Technical Achievement” rating, indicating that you feel that the game could (or should) contain more than it does. It should be clear that 4K Java games are no longer considered technical achievements in and of themselves, so please do not give middle ratings for technical achievement just because someone submitted a 4K Java game.

Graphics
This is an indication of the quality of graphics used in the game. This is not an indication of whether or not the game “packed a lot of graphics”. That belongs under “Technical Achievement”.

Controls
This is an indication of the intuitiveness and ease-of-use of the controls for the game.

Responsiveness
This is an indication of how well the game interacts with the player.

Polish
This is an indication of how “complete” the game feels.

Progression Of Difficulty
This is an indication of how well the game balances the increase in difficulty as the levels progress. This includes the difficulty of the initial (first) level. Does the game start out too hard or too easy? Do the levels progress in a consistent, intuitive manner that builds off of knowledge gained from the previous levels?

Incentive To Keep Playing
This is an indication of how well the game “rewards” you for your efforts and keeps you playing in a effort to improve your results.

Sound
This allows you to give bonus points to a game that uses sound. Please do not give middle scores just because the game has sound. A game with bad sound could almost be considered worse than a game with no sound at all.

0-5 Bonus Points
These are bonus points that can be given for any reason, but you must give a clear explanation of what they’re for. Typical reasons might be multi-player support, excellent physics or sweet A.I.

0-5 Penalty Points
These are penalty points that can be deducted for any reason, but you must give a clear explanation of what they’re for.

I compiled this category list by analyzing all of the judge’s feedback on every game for the 2006 results.

My proposal breaks down the weights for categories as follows:

  • 2 categories would have a weight of 3
  • 4 categories would have a weight of 2
  • 6 categories would have a weight of 1

This will give you a maximum combined score of 100 8)

NOTE: Both the “Bonus Points” category and the “Penalty Points” category can only have a weight of 1.

I have thought about several more aspects to this, but I think this will be enough and get some conversation going.

All feedback welcome, thank you for reading my post.

-Dave

  1. Have you evaluated the results of the 2005 contest? Then you should know that this has been tried before?

  2. How do you plan to avoid repeating the poor results from that contest?

  3. If your idea differs appreciably from the 2005 contest, please describe how.

Thank you for taking a moment to reply to my post.

Here’s a partial quote of yours from the “who needs judges?” topic:

This does not say (or even imply) that having categories is bad. What I read from it is that the (few) categories chosen were biased. My proposal doesn’t do that at all. In fact, my proposal is clearly an attempt to identify all of the (hopefully meaningful) categories that make a compelling game.

Another partial quote of yours from the “JUDGES!” topic:

My opinion on why the categories “meant different things to different judges” is becuase there were too few categories. Judges were left trying to cram all of their opinions on Responsiveness, Progression Of Difficutly, Controls, etc. into the “Gameplay” category.

And if too few categories is a problem, then you can’t get any fewer categories than NONE (i.e This years rating system).

Having only a 0-100 scale is the ultimate expression of “meaning different things to different judges”.

Here’s a quote from Markus_Persson in favor of the 0-100 system, also from the “JUDGES!” topic:

Truth is, having sound is likely to give you extra points no matter what the rating system is. To me, this quote implies a wish that the games were rated just on “fun”. The problem is, “fun” is too vague a concept and will certainly “mean different things to different judges”. For instance, some judges are certain to put more emphasis on sound that you or I would care for.

Now, to try to answer your questions from this topic:

  1. I have evaluated the results and I do not believe that “having categories” is the reason that the system failed.

  2. My “plan” is to acknowledge that avoiding last years results is not up to me alone. That is why I am attempting to create an open discussion.

  3. I believe that my response here, plus my original post, contain concepts that go far beyond last years rating setup.

Thank you again for taking the time to respond,

-Dave

In my opinion, we should absolutely keep the simple judging process from this year’s contest.
The only changes I think are needed are (very basic… two sentences tops) guidelines for scoring so all judges have the same average score, and perhaps a truncated mean as someone suggested in That Other Thread.

Complicating the scoring zaps the fun out of a contest that’s supposed to be about just having a good time making and playing tiny and impressive games. And it also leads to people being able to win on a technicality.

I would like more judges though. And it’s not a good thing for a judge to write “ok” as a comment to 1/3 of the games, being a judge he got to have something more constructive to say. And why limit them to max two sentences? 4-5 is a better max, and 2 might be a minimum.

DavidPFarrell,

I disagree to the category rating system. The quote you got from the “who needs judges?” topic…

[quote]…What I think the process showed last year was that we tried to get too clever with the scoring system. It was based on a system that one person had adjusted to meet how he wanted to score a given year’s games, and just didn’t extrude very well into a real world system…
[/quote]
…is an obvious point. Using categorys only limits the ratings to be judged based on one man’s oppinion on how a game should be rated. However, I agree that the judges need to more fully describe why they liked or disliked a game. For example, shelton gave Miners4k a 95, and this was his only comment, one word:

[quote]addictive
[/quote]
And tim summed his oppinion into five words:

[quote]I just don’t get it!
[/quote]
Yet he gave the game a 76. Neither tim nor shelton explained why they liked or disliked the game, in any way at all. Now on the other hand, kingaschi and Malohkan both gave to-the-point oppinions on exactly why they liked the game. Now with peggy,

[quote]Long game - impressive
[/quote]
Long game? That’s how she (I’m assuming she by the name, correct me if I’m wrong) describes the game, “Long game”. That’s not a rating. She needs to put more emphases on what impressed her about the game.

I’m not trying to get a point across that I dislike Miners4k, infact I loved the game. I’m not going into detail about what I liked about the game since I don’t believe it’s relavant here. However, I do think that based on the judges “addictive game” oppinions most of the judges had with this game which was summed down to only a few words, it’s not a significant explanation.

To sum it all down, the rating system is fine as it is. I just think the judges need to put more ephases on exactly what they liked or disliked.

As I’ve said before, I was honestly more-or-less happy with the judging this year. When you’ve got a setup like this, there are always going to be a couple of people who downvote your game because they “didn’t get it” or don’t like the genre or didn’t play long enough to get the hang of it, but that sort of thing will average out if you have enough judges. Really, what I’d like to see next year are a few tweaks rather than an overhaul.

First of all, I think there should be some mechanisim to ensure that your game works on the judges’ machines, to avoid suffering the fate that befell Xero this year. I’d suggest two ways to do this:

  1. Post the specs for the computers that each judge will be running on (at least as far as operating system and version of Java go). This would at least give entrants a better idea of what platforms to test on; IIRC, in the 2005 contest there was a minor commotion because one of the calls that a lot of people used didn’t work under Mac OS X.
  2. Have each of the judges go through the list of entrants, like, a week or two before the deadline and then post a list of any games that wouldn’t boot up properly. This would take some effort on the judges’ parts, so I should make it clear that what I’m suggesting is not a “pre-judging” by any means; just for each person to go through and post the names of any games that crashed on startup.

I’d also like to see some sort of guideline for the judges on feedback- maybe a sentence on each game? Feedback is really useful for the entrants, and I’d like to emphasize that I don’t expect a full review, just a brief “the graphics were nifty but the controls were hard” or some such. Pretty much what Jamison said above.

The drawback to all of these requests, of course, is that they put a greater burdeon on the judges- who are, after all, volunteers. The first two would also require judge selection to happen earlier.

  • HC

If a game doesn’t work for a judge on a certain platform or Java version, the judge should post something like “Game doesn’t work on Mac OS X, Java version X.”
That would eliminate the overhead of going through each entry once, a week before the actually judging occurs.

I still say public voting where people pick their top 5 favorite.

their #1 pick gets +5 points
#2 pick gets +4 points… etc.

then we just list the games by points in descending order and see who won. (all of this with an open comments system of course; which I’ve put off for over a year ::))

I like that idea of the public votes. :slight_smile:

I’m a bit wary of public voting alone; I stand by what I said here.

That being said, public comments would be nifty.

  • HC

the contest is 9 months away, so who knows. but since I’m rewriting Java Unlimited as we speak, I’ll probably just integrate a people’s choice along with the current to-be-tweaked judging system