Steam Spy scrapes Steam user accounts to estimate sales data

Steam Spy, a new web tool developed by blogger and podcaster Sergey Galyonkin, takes publicly available web data and uses it to extrapolate sales and ownership information for Steam games. Inspired by Ars Technica’s Steam Gauge experiment, it relies on similar information but cuts the data differently. The result is a free, up-to-date listing of sales through Valve’s online marketplace.

Galyonkin hopes the information will be useful for indie developers, journalists, students “and all parties interested in PC gaming and its current state of affairs.”

Like the Steam Gauge, Steam Spy gathers data from a “limited number” of Steam’s 125 million users’ profiles every minute, and then updates its visualizations at night. Those visualizations allow anyone to dig into the sales history for a particular Steam game, including the price the game was purchased at and what country the purchaser logged in from.

steamspy_sample_1

“The data is based on several days samples,” Galyonkin states on Steam Spy’s about page. “From three days for individual apps to seven days for location-based info. It means that Steam Spy is completely unreliable for recently released games.”

“I’m sampling roughly 150 people per minute,” Galyonkin told Polygon. “But half of those profiles are empty. So, it’s around 100,000 valid profiles per day. I’ve optimized the algorithm and tomorrow should have data for 150,000 valid profiles, give or take. … Data points are games per user, that’s why there are millions of them. I’m using rolling samples for three days to increase accuracy.”

So how much margin of error is there in the results? Galyonkin takes a page from Ars Technica here as well by comparing Steam Spy to a political survey.

“Your usual political surveys are pretty correct mostly because you don’t have much choice,” Galyonkin writes, again on Steam Spy’s about page. “It’s going to be candidate A, B or maybe C in some countries, so margin of error less than 0.1% should be good enough.

“It doesn’t work this way with Steam. Imagine users as voters, but instead of voting for one of three candidates, they’re voting for several games from tens of thousands available in Steam catalog. Even the most popular paid games are reaching maybe 5% of this audience and most are in realms of 0.1% or even less.

steamspy_sample_2

“So 0.1% margin of error for a game with 0.1% of Steam audience would produce results that are mostly useless. That’s why Steam Spy has to gather millions of points of data daily to predict games sales and audience. And that’s why Steam Spy is often wrong. Not by much, but still wrong.”

Galyonkin has already been in touch with several developers, and they say he’s not far off the real numbers.

“So far my friends from several game companies have confirmed that data is accurate and within the specified margin of error,” Galyonkin told Polygon.

One thing that definitely throws data off is a free weekend. For instance, Men of War jumped from 172,000 owners to 25 million owners, because essentially Steam gave the game to everyone for a short period of time. But once those blips fall off the end of the chart, the Steam Spy data becomes useful once again