Posted by rjonesx.
It’s all wrong
It always was. Most of us knew it. But with limited resources, we just couldn’t really compare the quality, size, and speed of link indexes very well. Frankly, most backlink index comparisons would barely pass for a high school science fair project, much less a rigorous peer review.
My most earnest attempt at determining the quality of a link index was back in 2015, before I joined Moz as Principal Search Scientist. But I knew at the time that I was missing a huge key to any study of this sort that hopes to call itself scientific, authoritative or, frankly, true: a random, uniform sample of the web.
But let me start with a quick request. Please take the time to read this through. If you can’t today, schedule some time later. Your businesses depend on the data you bring in, and this article will allow you to stop taking data quality on faith alone. If you have questions with some technical aspects, I will respond in the comments, or you can reach me on twitter at @rjonesx. I desperately want our industry to finally get this right and to hold ourselves as data providers to rigorous quality standards.
Quick links:
Home
Getting it right
What’s the big deal with random?
Why not Common Crawl?
How to get random
The starting point: Getting seed URLs
Selecting based on size of domain
Selecting pseudo-random starting points
Crawl, crawl, crawl
Now what? Defining metrics
Size metrics
Speed metrics
Quality metri… Read More