Grooming the Data Set

8/22/2017

Yes, "grooming" is the word of choice here, and for good reason.

The data set that we initially received was inherently "ungroomed." We used Basketball-Reference's handy search tool to just grab out all players drafted first overall from 1985-2015, drafted second overall in that span, third, and etc all the way through the 30th draft slot. The data came with more information than we actually needed on each player, which was simply "team drafted by" and "career WS/48."

Except there was a problem there right off the bat. 31 drafts, times 30 draft slots = 930, except we only had 828 data points.

"Expansion teams." I pointed out to my partner when he raised the issue. "Haven't always been thirty draft slots in each draft, because there haven't always been thirty teams in the league."

"Oh. Right." He replied, easily as unconvinced as I was at my own words, but nothing more was said on the matter. We were too focused on dealing with several issues actually within the data set and its analysis to pay any mind to any inherent problems that might exist in its very foundation. We accepted my explanation because we wanted to believe it, and wanted to dismiss the issue. The lack of critical thinking on the question: "Do expansion teams really account for all 102 missing data points?" ended up just costing us more time and effort in the end. Be advised, readers: don't make the same mistake that we did. Ask yourself the hard questions and answer them (!), whether you want to or not, because if you don't, they'll just bite you in the ass right when you think you're through. And these questions got teeth.

I assume that I've made the answer to the question in italics rather obvious by now, but it wasn't until we were entirely finished with the project -- having made all our team graphs and rankings and such super-neat and all -- did we actually address the issue. We call them "ghost players" or "ghosts."

"Ghosts" are players who never showed up in the data set that Basketball-Reference provided us with, because their career WS/48 was either negative or never cracked zero. The truth was, both of us had learned of their existence far before we actually corrected for the issue: I knew that the Bocats/Hornets were the most recent expansion team, having joined the league in 2004, but never really pressed myself on why we only had eight 30th picks instead of eleven (2004-2015). It seems so obvious now looking back on it, but at the time, I have to admit that I failed to give myself a satisfactory explanation and simply moved on nonetheless without addressing the issue... learn from my mistakes, readers!

So what do we make of these ghosts? They are aptly named for sure, given the fact that we can't easily see them in the data set, but they do exist (we just need to call Peter Venkman, of course) indeed! In fact, these are some of the most important players to add to the entire data set, because these draft picks are the worst of the worst, and teams had been slipping on up in the rankings simply because a couple of their disastrous picks had been disastrous enough to not be accounted for. So how do we bust these ghosts?

Well, the process isn't nearly entertaining enough to make a movie out of, so I'll try to give you the short version. We leafed through every single draft slot and searched for blips in the years. For example, let's say we are searching through the 26th picks, sorted by year, and we see that the data skips from 1996 to 1998. Well, now we search up who the 26th pick was in 1997, force Basketball-Reference to spit out a WS/48, and voila! We insert them into our Excel document (by this point named Project5 due to how many times we have had to change it due to errors) and re-run our R-code. And repeat: find another ghost, and bust him. And repeat. And repeat, and repeat, and repeat. In fact, 67 repeats.

That's right. The "expansion teams" that we half-assedly (at the very least, it should be a word, okay?) thought accounted for all 102 missing data points? Turns out they only comprise about a third of that pie, while the ghosts make up the rest. Whoops.

56 of these 67 ghosts posted negative Win Shares per 48 Minutes, and pretty soon it wasn't difficult to see why Basketball-Reference didn't even spit out a number for them at all: most of these guys had played far fewer than 10 (yes, ten) career games, and the website most reasonably assumed that you couldn't reliably generate such an advanced statistic for such a small sample size. In fact, I actually agree with those standards.

See, in most projects, you can make cutoffs. "This many field goals attempted", or minutes played, or games played; switching sports: innings pitched, plate appearances, whatever. Making thresholds to qualify into your main data set only makes sense, because small data sets (only a few games played, say) are easily skewed by even just one or two really good or really bad games. See: Jimmy Butler and the 30 picks, pre-ghosts especially. It's all fairly intuitive.

But we don't have that luxury here. We can't just throw out these data points, because by doing so, we’d be rewarding teams for drafting so extremely terribly that your pick hardly even played, over teams that just drafted really poorly (a guy with a low, yet positive, WS/48), if that all follows. Remember: we're assessing the impact that a player had on your team. That's zero if he hardly even played any games for you.

Fun tangent here that I get to give because we’re speaking about guys with a career WS/48 of zero due to the fact that they played zero NBA games. Frederic Weis was selected by the Knicks with the 15th pick in the 1999 draft (in true Knicks fashion, they controversially passed on Ron Artest, who was taken with the next pick). Weis played for his native France in the 2000 Summer Olympics, and Vince Carter may have ended the poor seven-footer’s career with the legendary: “Le dunk de la mort” (the dunk of death).

So yeah, anyways, Frederic Weis is a guy with a career WS/48 of 0.000. Moving on then.

So what to do with these 56 ghosts? Take 'em. Nothing else to do. I know that their career WS/48 isn't reliable, but if you think about it, it's not necessarily an inaccurate measure of how much they actually helped their team either way -- whether they slipped in the shower rookie year or enjoyed a successful overseas career or whatever. It's all the same. So now we've added our first negative WS/48's to the data set. Welcome, ghosts.

Now it's time to come clean: I lied before when I told you there were only 56 guys out of the 67 who posted negative win shares. There were actually fifty-seven.

Ladies and gentlemen, meet Troy Bell, the 16th pick in the 2003 draft. Mr. Bell, clearly intent on ruining our project, played in six career NBA games for a grand total of 34 minutes. He also posted a career WS/48 -- when we bend Basketball-Reference's arm -- of -0.326.

Think about that for a moment. -0.326. David Robinson and Chris Paul are the two best players in this entire data set by a decent amount, and their career WS/48 is +0.250. Is Troy Bell seriously even worse than David Robinson is good? Would I rather have two guys who never played in the NBA (thus career WS/48 = zero?) over Robinson and Bell?

Obviously, the answer to both those questions is a resounding no. Bell just got unlucky (or something) in an extremely small sample size (for reference, the next closest negative WS/48 is a ways off at -0.197 -- more reasonable, I suppose).

So what the hell do we do with him?

Well we can't just give Memphis a pass here. They evidently burned a number sixteen pick -- Bell clearly didn't help them much (or hurt them, rather). We have to keep him in the data set. But how to punish the Grizz justly?

We strongly and seriously considered resetting all of the “negative win shares folks” -- Bell included, of course -- back to zero WS/48, and then going from there. After all, they are all more or less the same: how much different can their overall impacts really be in only a few games?

It’s not an unfair argument, but after much deliberation, we opted to keep the negative win shares as-is. As I write this now, I’m still not 100% confident in the decision, but at the end of the day, editing people’s career stats -- and making them all the same when they’re not -- needs an immense burden of proof. It’s only right to measure how much a guy who hurt his team’s record more than helped it (which is essentially what a negative win share indicates -- that it would be better that the team played four guys instead of putting him on the court) for this project. As well, it’s worth adding that our data set is very resistant and sturdy: think Troy Bell dragged his team down like no other player? Guess again. Despite his z-score of -3.81 checking in at second-worst of all players in the entire data set, the 16th slot’s coefficient of under 1.05 keeps his final slot score in fact just under -4 and only 20th-worst of all players.

So that’s enough on the ghosts. The final thing we did to groom our data set was to adjust for the fact that the teams were wrong for several of the players.

Yup.

Take the (should-be!) infamous draft-day deal in 1998 between the Bucks and the Mavericks. Dallas entered the draft with the sixth pick; Milwaukee the ninth and 19th. The two teams agreed to swap these selections on draft day -- a fully valid and enforceable trade according to the NBA, but the logos next to the teams on the official draft board do not change! Robert Traylor was drafted sixth (career averages of 4.3 points and 3.7 rebounds) while some European dude named “Dirk Nowitzki” and Pat Garrity were taken at ninth and nineteenth respectively (amazingly, Dallas immediately swung Garrity for some young, also-foreign backup point guard named Steve Nash!). But even though Milwaukee is clearly the one making the pick at number six now, and Dallas at nine and nineteen, the draft board says otherwise -- Dirk Nowitzki was, is, and forever will be officially listed as being drafted a Buck.

Except that’s BS. The Mavericks made that selection! Of course they should get credit for that pick (other notables that swing the scale like this include Kawhi Leonard [IND to SA] at 15th overall in 2011 and Rudy Gobert [DEN to UTA] at 27th in 2013), but according to our spreadsheet based on data from Basketball-Reference (no specific fault to them, of course), the Mavericks don’t receive said credit; Milwaukee cheats them out of a pick that they, Dallas, made. So what to do to correct this gross injustice?

I wish there was a better answer than “manually go through every draft-day trade from every year from 1985 to 2015, and change the team on the spreadsheet to the way it should be,” but unfortunately there isn’t. But we should be very clear about what we changed and why: we want to give credit to the team that made/controlled that selection. Contrary to popular belief, Kobe Bryant was not a draft-day (per se) trade from the Charlotte Hornets (now the New Orleans Pelicans) to the Lakers in exchange for Vlade Divac. Rather, that deal occurred well after the draft; therefore, even though it was before Kobe had played a single game, we leave the credit to the now-Pelicans because it was ultimately the Pelicans who made the pick.

On the other hand, there were some tough decisions that we had to make going the other way. Kawhi Leonard, as aforementioned, was dealt from Indiana to San Antonio the day after the 2011 draft (or his draft rights were, rather) in exchange for George Hill. Is that a draft-day trade by the literal definition? No, but it seems more than likely that the Spurs had been eyeing the rangy forward and quite possibly had a trade lined up: did these negotiations, and the Spurs’ impetus for desiring Kawhi on their team, really just come up out of nowhere less than 24 hours after the draft had completed? Unlikely, and that’s why we (arbitrarily, yet defensibly) give them credit over Indy for the two-time Defensive Player of the Year (sorry, Pacers fans).

So that’s pretty much it. Finding ghosts and correcting for draft-day trades (and Kawhi Leonard), and the data set is all groomed. Surprised I managed to write this much about just that. Hope I didn’t bore you, but now you’ve got the really good stuff coming up: the final rankings! Enjoy.

Next: Final Rankings

0 Comments

Evaluating NBA Teams' Drafting: Grooming the Data Set

Grooming the Data Set

Author

Archives

Categories