Difference between revisions of "WN8"
(→Expected Stats Matrix)
(Expected WN8 is 1565, not 1500)
|Line 148:||Line 148:|
To normalize that, a player with all rSTATSc = 1 would receive
To normalize that, a player with all rSTATSc = 1 would receive WN8. A player with all rSTATS = 1 would also have all rSTATSc = 1 because (1-c)/(1-c) = 1.
=== Step 3 ===
=== Step 3 ===
Revision as of 10:16, 20 March 2014
- 1 Introduction
- 2 Why WN8?
- 3 How is WN8 different?
- 4 Limitations and Threats to Validity
- 5 Nuts & Bolts
- 6 The Steps of WN8 - The Formula
- 7 Summary, TL;DR
- 8 Color Scale
- 9 FAQ & responses to uninformed criticism
- 10 Sites using WN8
- 11 Other Languages
- 12 Credits
WN8 is the latest iteration of the WN8* (WNx) project lead by Praetor77.
WN8 seeks to measure the observable contribution to matches, across an account, and hopefully infers some information about the latent variable "skill". WN8 should not be considered the be-all-end-all of skill evaluation. It is intended to be looked at next to win-rate, battle count, average tier, and will never replace the inspection of the Service Record or platooning with someone or watching their replays. This is because WN8 cannot capture things like timing, target priority, teamwork or decision making abilities.
As with all ratings before it WN7 had some weaknesses and limitations. Some of these were known at the time of release of WN7 and others were discovered and publicized shortly thereafter. In order of seriousness, as judged by the WN* team:
Damage per tier scaling – namely that while tier is a nice linear 1-10, damage values scale up in a curvilinear fashion. Also, damage dealing capabilities are hardly balanced across all tanks of the same tier (think T40 vs A-20 for example or the ARL V39 and M18).
Kill per tier scaling – lower skilled players are found in lower tiers, and thus getting kills at tier 1 is easier than getting kills at tier 10, holding constant for player ability above a certain threshold.
Tier 1-10 problem – because of the two issues above, the tier 1-10 problem arose with WN7. Precambrian explained this very well in this post, which has some out of date info it, but still captures the problem very well, on a small, understandable scale. It is excerpted it below.
What complicates the metric is the fact that it is easier to rack up amazing stats at lower tiers than it is at higher tiers, since at lower tiers, the average skill level of new players is extremely low, and the damage potential of low tier cannons, relative to the small hitpoint pools, allows experienced players to destroy their noob counterparts without any difficulty whatsoever. This phenomena is perfectly illustrated by tier 1 autocannons, which can clip out opposing tier ones before the new players even can turn their turrets and react. This sort of destructive potential allows players with relatively low skill to win an extraordinary portion of low tier matches and inflate their stats. WN7 sought to address this by slapping on a low tier penalty; however this is easily avoided simply by alternating playing higher tiers and lower tiers. Take the following example (the categories are as follows:Tank, Result, Survived?, Damage Dealt, Damage Received, Kills, XP, Detected, Capture Points, Defense Points, WN7):
If you were to average the individual WN7 values of these games, you would arrive at 2578. However, if you average all the stats of the individual games together, and then calculate the WN7, you arrive at 3260 WN7! There are multiple reasons for this. One has to do with the nonlinear nature of hitpoint scaling; a tier 8 does not have 8 times the hitpoints of a tier one, and thus much more damage can be farmed in tier 8 matches. Also, it is very easy to rack up kills in a tier one tank, skewing KPG upward. Ultimately, average tier gets skewed far lower than average damage does, and KPG gets skewed upward, causing a humongous differential in WN7 when calculated this way!
Now consider what would occur if I averaged all the stats of the tier 8 tanks, without considering the Cunningham game. The result would be 2960 WN7. So although the WN7 of the Cunningham game, considered individually, was below 1600, it HUGELY inflates my stats when included in the calculation!
Obviously small sample size is small but you get the idea.
SPG and LT – Finally, these classes simply don’t output similar numbers to their per tier counter-parts. HT/TD/MT can be roughly comparable, but a tier 8 LT doesn’t put out the dmg or frags of the ISU/IS-3/T69/50 100. SPGs who scout are doing it wrong (and thus their spots should be a bit lower), and their damage values can be very high in some tiers, particularly before they were re-classified in patch 0.8.6 (although they a now lower due to the DPM changes)!
How is WN8 different?
Dmg/tier vs Per-Tank ratings
WN1-7 and Efficiency v1 and v2 were all formulas directly applied to WG’s web API released stats, they selected different weights for each value, and transformed them, to try to make a meaningful total rating value. However, the problems listed above persist for any rating that uses “dmg/tier” computations. There is no way to get around the problem that damage isn’t worth the same (or as much is available) per tier, nor that kills weigh more or less by tier.
So in WN8 we’ve adopted an entirely different method. We’re not the first to use this method, Mr. Noobmeter’s Performance Rating has used a per-tank rating since its inception, and while PR was initially skeptically received because the formula was closed source, Mr. Noobmeter released it months ago and explained its formulation. The per-tank ratings are significantly harder to develop (getting good data) and apply (requirea lot more computation power) but the results control for the tank composition of an account history in a way that dmg/tier method ratings can never duplicate. So WN8 becomes a “per-tank” rating, instead of a “dmg/tier” rating. This is why you won’t see the average tier term in the final equation, although the information from which tiers were played is in the methodology, in the earlier steps. So part of the information in WN8 now includes both the tanks a player has chosen, but also the number of games played on an account.
One of the first things a reader will notice is that the scale of WN8 is different than the scale introduced by Efficiency and adopted for WN1-7. The reasons for re-scaling are dependent on some decisions made during the development of WN8. You can see the technical side and logic for this in the Nuts & Bolts section. But the simplest explanation is that the Efficiency scale was not sufficiently discriminating of differentiating between player abilities. Top and bottom ends of the distribution were being compressed. You can see the WN8 scale, overlaid with the WN7 scale, and noobmeter’s PR below
The upshot of this scale change is that the value of WN7 isn’t really comparable to WN8, even though they are both generally 3-4 digit numbers. We realize this is an inconvenience, especially as all of the color cut-offs also moved (the ones on the bottom are for WN8). However sometimes scales need to change or be re-centered or re-zeroed. See the Fahrenheit/Celsius/Kelvin scale for a physical science’s reference, or the College Board’s SAT history for examples of scale evolution. As before, a very small number of players fall in the extremes of the scale, and the majority of the differentiation is being applied to the middle of the population.
Limitations and Threats to Validity
History – Myriad things have changes since WoT was released and accounts began accumulating stats. Tier changes (T30, T34, IS-4, Batchat 25t, AMX LTs, arty, MT-25, VK2801, etc), balance changes, mechanics changes (physics, premium rounds for credits, 2 sigma shell distribution, +2/-2 MM) are easily pointed to. Things are not the same now, as they were, and they will also be different in the future. The WN* team fully recognizes that history is always a threat to validity of measurements, but because we can do nothing about it, we always remind people to check Service Records and ask questions if they need to know more about someone’s account history.
WG is terrible at book-keeping. They could have left everyone’s T-50-2 stats in place, and published a new tank ID for the MT-25. We wish they had done so (same for all the tanks every moved or replaced). But in reality, 60-day or recent battle stats will have to do.
Maturation – players, both individually and as a group, can get relatively better or worse over time. There is more info out on the web now than there was in 2011, and so player progression can be faster or slower depending on how much they can and do research the mechanics and meta-game of WoT. Additionally, since dossiers are always over-time, major changes in performance can take a long time to influence over-all or cumulative ratings, especially at higher battle counts.
Damage Upon Detection – WG has indicated that this might be included in the API stats sometime in the future, but right now it is not available.
Although DUD is a big part of quality play, especially LT play, we did manage to get a much better rating of LT players, even without having DUD data. Hooray for per-tank methods!
Heavy play in a single tank – This makes WN8 work less well, because some folks will play so many games in a single tank that their account WN8 begins to approach the values for that single tank. But WN8 was formulated around whole accounts, not single tanks and while there is an assumption for playing a variety of tanks, that assumption is much weaker than in WN7 and prior ratings. At some point, someone can play enough games in a single tank to “break” the normalization assumptions of WN8. How many games is enough to break it? We have not conducted a formal analysis, but as a rule of thumb, wewould suggest that if someone has more than 50% of the games on their account in a single tank, it might be enough to make WN8 invalid. But as a reminder, it is OK to play a tank you like for thousands upon thousands of games. WN8 isn’t “penalizing” you if you do so, it just makes it hard to compare your account to everyone else’s account. We cannot account for all the outliers!
Per-tank variability – Related to the above, although WN8 uses per-tank values, the variation possible (and observable) among tanks is not the same. To use an in-game example, lets consider the T49 vs the AT-2. The AT-2 being slow, blind but heavily armored is going to see a smaller range of possible stats than the T49, also a tier 5 TD. The T49 is fast, fragile and a camo master, but those attributes generally result in either much greater or much worse stats compared to its median values. The AT-2 will have less variability than the T49. So even though we have well-sourced average values for the AT-2 and T49, in the population you will see much higher and lower performances in the T49, than in the AT-2.
WN8 does not include a per-tank variability factor in its weighting. We didn’t feel that the data available was sufficient to give good estimates, and the WN8 was already several orders of magnitude more complex to calculate compared to WN7. As the WN* team does not run the servers which make WN* calculations available to the public, we elected to leave exploration of per-tank variability factors to WN9 or later. For an example of per tank variability, and the math on why tanks with higher means or variability move your rating up faster (with good play) check WN8: Appendix A.
Nuts & Bolts
The Major Assumption of WN8
WN8 makes a major assumption that is not shared with any previous rating system. We set expectation for mean performance players, based on their ability to influence the games played. This is a tricky concept, and one certainly open to debate. But basically, we hazard that there is an amount of output (dmg/frag/spot/def) that influences the outcome of the game (in terms of win-rate), but that there exists a baseline below which the output does not influence the outcome of the game regularly enough to be determinable. The exact value of this threshold was computed and then subtracted (the rSTATSc step, below) to compare all players to the theoretical (and sadly real) player(s) who do not perform enough measurable output to influence their win-rates beyond simply loading in and having MM weigh. The advantage of this assumption is that it makes the rating more meaningful for both high and low values. There were “free” points in WN7, basically for showing up, because virtually no accounts manage to accumulate ZERO stats, even the worst programmed bots. But many accounts do manage to accumulate so little stats that they do not manage to positively affect their win-rates in any measurable way. By adjusting for these baseline values, we get better differentiation at the lower levels, and also at the upper levels, because we’ve removed “noise”. And of course there is also better differentiation in the middle. We are postulating a “zero” point, below which manifest stats (damage/spots/frags/defense) don’t influence win-rate meaningfully. As noted earlier, this assumption is open to debate. Please bring a solid knowledge of stats and measurement as well as tanks to this debate though! Conveniently, this also turns the interval scale used in efficiency, PR and WN7 into a pseudo-ratio scale, which brings about numerous advantages. Debatably the most important one after improved accuracy in measurement is that by applying a baseline we can now say that a 2400 WN8 player contributes twice as much for his team to win than a 1200 WN8 player. This was not true for any rating before WN8.
Because WN8 was a per-tank rating, we needed data per tank, which as always is not available via the WG web API. We turned to Phalynx of vBAddict.net, who kindly handed over his database of 17k dossiers. The database was filtered for players with less than 1000 games played, and tanks that were played for less than 50 games. From this database we determined, using linear regression the stats to be expected on each tank for a median ability player. For each tank/player combination, we calculated playerWN8alpha and tankWN8alpha. WN8alpha was approximately WN7 in formulation, basically a means to measure per tank effectiveness. Afterwards, we filtered to the 50% of players who play that tank, who perform well ON THAT TANK, not overall. This incorporated a good mix of high win-rate and low win-rate players. We posit that using the top half of players in a given tank is a good way to compare tanks to each other, since they can squeeze out every last ounce of performance a tank has to offer. Otherwise, at the low end, you would be comparing tanks based on the performance of players who don´t know basic mechanics, or how to properly use a given tank. That being said, I use the top 50% of players to do the linear regression, because simply using the top player values would be biased and not generalizable to the entire population.
To check that expected stats for each tank were balanced, we looked at the tankWN8/accountWN8 ratio. We checked that the players with top 10% tankWN8/accountWN8 corresponded to about 1.15 for all the tanks in the game.
When a tank had a lower ratio, for example, we lowered the expected values used to regress with the top 50% of players, and then checked what the top 10% ratio was. This took several iterations of recalculating tankWN8 and playerWN8 until a balance was reached, and tankWN8/accountWN8 was about 1.15.
The purpose of this was to try to, controlling for player skill, determine expected values which would normalize the dmg/frag/spot/def outputs across tanks. To find out how much dmg the same player would do in the ARL v39 and M18 Hellcat, given all other things being equal.
A handful of tanks required a more in depth analysis of the distribution of the tankWN8/playerWN8 ratio, due to an abnormally low number of high level players playing the tank (A-20), or due to gross nerfs/buffs (like M41). We tried to come to a middle-ground for tanks that have been severely nerfed/buffed (like M48A1, AMX50B or T110E5) looking to get a wide representation of players that played the tank during different time periods, so that the value doesn’t simply represent the tank´s original or current most powerful state ( so that players who play it while better balanced do not get unfairly treated) or completely ignores it (so that players who played it while very powerful, and then never again, do not get unfair bonuses).
Note: This manual process was the most scientifically weak portion of the WN8 creation. However, personal bias of the creators was not introduced during this section, and the team of individuals working on these adjustments comprised of dozens of contributors on WoTLabs combing the per-tank tables and collaboration between players from NA, EU, SEA and RU. When possible values from the “nearest possible match” were used for tanks with oddly distributed player histories, like the A-20, in which no one can be bothered to even try (the data shows this…). If you are upset over this manual process, please contribute to further refinement of the WN8 per-tank tables, by uploading your dossier at http://www.vbaddict.net/wot.php
Also, the per-tank expected values table were compared with the table used for the PR rating by Noobmeter, and a table of top 1%/100 players of each tank for the RU server kindly provided by Seriych (similar to what was in the service record with XVM for 8.6 and older). Most expected damage values are pretty close with noobmeter´s (from tier 3-8 ), and if you multiply those damages by 1.5 (to see what a 2400 WN8 player would need to get on a tank), you get unicum values, which are quite close to Seriych´s values for top player numbers from RU. Also, using this approach resulted in numbers for low tier tanks that are obviously high for the new player, but that isn´t really an issue since average players only have 3% or less of their total games in tier 1. This conveniently also functions as a control against seal-clubbing your way into a high rating. It means you can still club tier 1 players, you just have to actually be good at it! No longer will averaging 1.7 kill/game (a good value in tier 10) at tier 1 make you appear good. This isn’t because the WN* team has any bias against folks who play low tiers, but simply that we wish to identify player skill irrespective of tier played (review the Why WN8? for reasoning).
Reminder: What actually MATTERS from the table is the relationship of values between different tanks. We could divide all those values by 3, and it wouldn´t make a difference. It’s the relationships between the numbers that are important, not the actual values. Same goes for the 1.15 ratio used in balancing tanks, we could have used any number. We left them in “WoT dmg scale” for readability and ease of sourcing though!
A dataset of all players with more than 10000 games on several servers was kindly provided by Mr. Noobmeter (we needed games played on each tank), a 4GB database that can hardly be opened in Excel! Nevertheless, we filtered EU and NA only players from there, to end up with a 115000 player database, which is about as large Praetor77’s limp PC can handle. With this database, we determined expected stats, rSTATS and then rSTATSc. Using all the rSTATSc values, we used Eureqa (a very nice and intelligent program which uses iterative genetic algorithms to search for mathematical relationships between a set of input data) to determine the optimum mathematical formula which using the rSTATSc could “explain” (fit) the rWINc of the players in the database.
By data analysis, we found that some players were clearly outliers on some rSTATSc (all of them except rDAMAGEc, actually), which lead us to implement a series of “caps” or maximum values to improve the usefulness of WN8. These same stats also seemed to be more correlated to rWINc individually only up to a certain value, after which the correlation decreased substantially.
The caps implemented were:
rFRAGcMAX = rDAMAGEc+0.2 rSPOTcMAX = rDAMAGEc+0.1 rDEFcMAX = rDAMAGEc+0.1
We re-entered the capped rSTATS into Eureqa which actually came up with a very similar solution as prior to the caps, but handled lots of outlying accounts. The final formula output was:
rWINc = 0.09 + 0.613*rDAMAGEc + 0.131*rFRAGc*rDAMAGEc + 0.097*rFRAGc*rSPOTc+0.047* rFRAGc*rDEFc
Then we multiplied every term in the formula by 1600, which leads to a similar central value for the players in the database as for WN7, which should make the server-wide WN8 average quite similar to WN7, in the 900-1000 range.
Expected Stats Matrix
Without using Excel, you can view the expected tank values here: http://www.wnefficiency.net/wnexpected
Additionally, Mr. Noobmeter has kindly hosted the expected stats matrix on his website, along with his PR values, for your perusal: http://www.noobmeter.com/tankList
The Steps of WN8 - The Formula
rDAMAGE = avgDmg / expDmg rSPOT = avgSpot / expSpot rFRAG = avgFrag / expFrag rDEF = avgDef / expDef rWIN = avgWinRate / expWinRate
Step 1 takes the counts of tanks played on account, and multiplies them by the expected stats to get the account total expected values. Then the actual account totals (your total dmg, frags, spots, def, win-rate) are divided by the total expected values to give the ratios.
rWINc = max(0, (rWIN - 0.71) / (1 - 0.71) ) rDAMAGEc = max(0, (rDAMAGE - 0.22) / (1 - 0.22) ) rFRAGc = max(0, min(rDAMAGEc + 0.2, (rFRAG - 0.12) / (1 - 0.12))) rSPOTc = max(0, min(rDAMAGEc + 0.1, (rSPOT - 0.38) / (1 - 0.38))) rDEFc = max(0, min(rDAMAGEc + 0.1, (rDEF - 0.10) / (1 - 0.10)))
Step 2 sets the zero point for the ratios. See the assumptions section for more info on why this happen. min and max are functions to ensure the ratios stay within bounds. The constants are in the format of
(rSTAT – constant) / (1 – constant)
To normalize that, a player with all rSTATSc = 1 would receive 1565 WN8. A player with all rSTATS = 1 would also have all rSTATSc = 1 because (1-c)/(1-c) = 1.
WN8 = 980*rDAMAGEc + 210*rDAMAGEc*rFRAGc + 155*rFRAGc*rSPOTc + 75*rDEFc*rFRAGc + 145*MIN(1.8,rWINc)
Step 3 takes the weighted (in Step 1) and normalized (in step 2) performance ratios and processes them through the coefficients determined for the final formula, reported above. This puts the scale on the more meaningful 0-5000, gives the relative weights of damage and reflects the interactions between frags*spots, def*frags and dmg*frags.
A Note on Interactions If you played 5000 games on T50, and the expected spots are 4 per game, and you average 4 spots per game, your rSPOTc is 1. If you played only the E100, and the expected spots are 0.88 per game, if you average 0.88 spots per game, your rSPOTc is also 1.
As such rSPOTc correlation with winrate is significantly higher than for average spots:
And also, in the WN8 formula rSPOTc and rDEFc are multiplied by rFRAGc, which after which those terms are well correlated with winrate (as measured by rWINc)…
So, rSPOT*rFRAG appear to measure something important for winning. These interactions seem to be properly measuring players that can do multiple things and adapt to what needs to be done to win as opposed to players who play absolutely safe, and simply deal out damage (rDMGc only).
Per player analysis indicated (and Eureqa agrees apparently) that rFRAGc *rSPOTc can tell you a lot about how much a player actually manages to win. The authors believe it has to do with aggresiveness and willingness to create opportunities for the team. If you are solidly getting high rSPOTSc values, you are putting yourself in more risky positions on the map, and if you are doing so while maintaining high avg frags, damage, defense and wins, IMHO you are a better player than if you manage the same damage and frags sitting in the back and shooting targets your teammates light up. The most often repeated advice in this game is “get your gun in the game and stay alive to keep it there”, and rSTATc values appear to support that advice.
A short summary of this document can be found here: WN8: Summary
A draft of the new Color Scale
FAQ & responses to uninformed criticism
If there are questions left, read the WN8: FAQ.
Sites using WN8
See a list of Sites using WN8
Deutsch/German: WN8 (deutsch)
Praetor77, bjshnog, Crabeatoff, Mr. Noobmeter of NoobMeter.com, seriych, Phalynx of vBAddict.net, Orrie, Twistoon, NextToYou, Folterknecht, Precambrian, HibachiSniper, HubertGruber, juicebar, perpixel, sr360, jacg123 and anyone else we might have forgotten (sorry!).
Neverwish, Allurai, Mr. Noobmeter and stumpjumper8 also get shout outs for their websites, which let us carry on our work and implement it for the public to use.