How long does it take to test 25 Billion web pages

If you started it during the reign of Thutmose I of Egypt, you’d be done soon.
Or you could invest several million dollars.
Or maybe doing it is just a stupid idea in the first place

On July 6, 2016, Michelle Hay of the company Sitemorse published an “article” (a term I’m using loosely here) titled “WCAG 2.0 / Accessibility, is it an impossible standard that provides the basis for excuses?“. Overall, I found the article to be very poorly written, based on a false premise, and a demonstration of extreme ignorance at Sitemorse. Many others felt the same way, and Léonie Watson’s comments address many of the factual and logical shortcomings of the article. Something I personally found interesting in the Sitemorse article are the following two sentences:

What we are suggesting, is to create a list of priorities that can be done to improve accessibility. This will be based on the data we have collected from 25+ billion pages and feedback from industry experts, clients and users.

25 billion pages is a massive number of pages. It is also extremely unlikely to be true and definitely not at all useful . To prove my point, I’ve used Tenon to gather the data I need.

Historically, Tenon averages about 6 seconds per distinct URL to access each page, test it, and return results. There are a number of factors involved in the time it takes to process a page. We frequently return responses in around a second, but some pages take up to a minute to return a response. I’ll discuss the contributing factors to the response time in more detail further down below.

Tenon does its processing asynchronously, which means that it won’t get choked down by those pages that take a longer time to test. In other words, if you test 100 pages it won’t take 6*100 seconds to test each one. The average time needed across the entire set will be shorter than that because Tenon returns results as soon as they’re available in a non-blocking fashion. For example, if one page takes 30 seconds to test, Tenon could easily test and return results for a dozen or more other pages in the meantime. The goal of this experiment is to see how long it would take to test 25,000,000,000 pages using Tenon.

Sitemorse’s article does not disclose any details about their tool. Their website is chock full of vague platitudes and discloses no substantive details on the tool. They don’t even say what kind of testing it does. Regardless, given my personal history with automated tools, I’m fairly confident that across an identical sample size Tenon is at least as fast, if not faster.

Test Approach

The test approach I used is outlined as follows, in case anyone wants to replicate what I’ve done:

  1. I wanted to test at least 16,641 distinct URLs. Across a population size of 25,000,000,000 URLs, this gives us a 99% Confidence Level with a Confidence Interval of just 1.
  2. The list of URLs piped into Tenon all come from a randomized list of pages within the top million web domains listed by Alexa and Quantcast.
  3. The testing was performed on a completely fresh install of Tenon on my local machine. That means no other users on the system, no other processes running, and all available resources being dedicated to this process (subject to some caveats below)
  4. This testing used a Bulk Tester that populates a queue of URLs and submits URLs to the Tenon API at a rate of 1 URL per second via AJAX. It does this testing asynchronously. In other words, it just keeps sending the requests without ever waiting for a response. I could have reduced the time between requests, but it was a local install and I didn’t want to DOS my own machine that I’m also using for work while this is going on.
  5. While the bulk tester does other things like verifying that the API is up and verifying the HTTP status code of the tested page before sending it to Tenon’s API, the elapsed time is tracked solely from the time the API request is sent to the time the API responds. This avoids the count being skewed by the bulk tester’s other (possibly time-intensive) work.

Caveats and Concerns

This test approach carries a few caveats and concerns. In an ideal world, I’d would have deployed a standalone instance that fully replicates our Production infrastructure, including load balancing, database replication, and all of that. I don’t think that’s truly necessary given the stats on my local machine, which I discuss below.

Assessing System

Any accessibility testing software will be subject to the following constraints on the machine(s) hosting it. These factors will impact the ability to respond to the request, assess the document, and return a response:

  1. Available Memory: Memory allows a testing tool to store frequently accessed items in cache. For instance, Tenon makes extensive use of Memcached in order to store the results of repetitious queries on infrequently changing data. The Macbook Pro I used for this has 16GB of 1600 Mhz DDR3 RAM.
  2. Available CPU: The more CPU and the more cores means the server can do more work. The Macbook Pro I used for this has a 2.8GHz Intel Core i7 Processor. The processor has 4 cores.
  3. Network performance: Simply put, the faster the connection between the Assessing system and the Tested System, the less time necessary to wait for all of the assets to be transferred before testing can begin. I’m on a Verizon FiOS connection getting 53MBps both up and down.

Overall, Tenon performs well as a local install. That said, it would be more “scientific” if this was the only thing the machine was doing, but like I said before, it is my work machine. In a Production environment, Tenon is provisioned with far more resources than it needs so it retains its responsiveness when under high demand. Provisioned locally on a Virtual Machine, Tenon doesn’t require very much RAM, but it loves CPU. Although the amount of CPU I provide to the VM is sufficient, I’m also aware that I could easily throw more requests at it if I could fully dedicate all 4 cores to the VM. Also, there were times when the local Tenon install competed heavily for network traffic against Google Hangouts and GoToMeeting. All-in-all, I doubt that across the entire test set the local instance’s power will play too heavily into the results.

Tested System

All of the above concerns apply to the tested system. The following additional concerns on each tested URL may also impact the time needed to return results:

  1. Client-side rendering performance: One of Tenon’s most important advantages, in terms of accuracy, is that it tests the DOM of each page, as rendered in a browser. This gives Tenon significant power and broadens the range of things we can test for. One downside to this is that Tenon must wait for the page and all of its assets (images, CSS, external scripts, etc.) to load in order to effectively test. A poorly performing page that must also download massive JavaScript libraries, unminified CSS, and huge carousel images will take longer to test. For instance, if a page takes 10 seconds to render and 1 second to test, it will take a total of 11 seconds for Tenon to return the response. This is probably the most significant contributor to the time it takes to test a page in the real world.
  2. Size of the document/ Level of (in)accessibility: Among the many factors that contribute to the time it takes to assess a page and return results is how bad the page is. In Tenon’s case, our Test API doesn’t test what isn’t there. For instance, if there are no tables on a page then the page won’t be subjected to any table-related tests. In other words, even though Tenon can test nearly 2000 specific failure conditions, how many of those that it actually tests for is highly dependent on the nature of the tested document – smaller, more accessible documents are tested very quickly. The converse is also true: Larger, more complex, documents and documents with a lot of accessibility issues will take longer to test. The most issues Tenon has ever seen in one document is 6,539.

Results

  • The very first result was sent at 7/12/16 21:59 and the very last result was 7/13/16 18:38.
  • The total number of URLs successfully tested was 16,792.
  • That is 74,340 seconds total with an average time across the set of 4.43 seconds
  • There were several hundred URLs along the way that returned HTTP 400+ results. This played into the total time necessary, but I purged those from the result set to give Sitemorse’s claim the benefit of the doubt.

Total Issues

Minimum 0.00
Maximum 2015.00
Mean 66.89
Median 37.00
Mode 0.00
Standard Deviation 85.79
Kurtosis 49.02
Skewness 4.35
Coefficient of Variation 1.28

Errors

Minimum 0.00
Maximum 2011.00
Mean 47.51
Median 28.00
Mode 0.00
Standard Deviation 67.92
Kurtosis 126.13
Skewness 7.46
Coefficient of Variation 1.43

Warnings

Minimum 0.00
Maximum 464.00
Mean 19.38
Median 1.00
Mode 0.00
Standard Deviation 55.10
Kurtosis 12.39
Skewness 3.64
Coefficient of Variation 2.84

Elapsed Time, (measured on a per URL basis)

Minimum 0.37 seconds
Maximum 49.87 seconds
Mean 9.50
Median 7.50
Mode 6.43
Standard Deviation 7.26
Kurtosis 3.20
Skewness 1.66
Coefficient of Variation 0.77

Using this to assess Sitemorse’s claim

As a reminder, the sample size of 16,792 pages is more than enough to have a 99% Confidence Level with a Confidence Interval of just 1. One possible criticism of my methods might be to suggest that it would be more "real-world" if I tested pages by discovering and accessing them via spidering. That way true network and system variations could have had their impact as they normally would. Unfortunately that would also add another unnecessary factor to this: the time and resources necessary to run a spider. Having all of the URLs available to me up front allows me to focus only on the testing time.

Given this data, lets take a look at Sitemorse’s claim that they’ve tested 25,000,000,000 pages:

At 4.43 seconds per page, it would have taken Sitemorse’s tool 3,509.5 years to test 25,000,000,000 pages running around the clock – 24 hours a day, 7 days a week, 365 days a year with zero downtime. Could they have done it faster? Sure. They could have used more instances of their tool. All other things being equal, running 2 instances could cut the time in half. With an average assessment time of 4.43 seconds, they would need 3,510 instances running 24/7/365 to do this work in less than a year.

(4.43 seconds each * 25,000,000,000) / (60 seconds per minute * 60 minutes per hour * 24 hours per day * 365 days per year)

Using Tenon’s average monthly hosting costs, testing 25,000,000,000 pages would cost them nearly $10,530,000 in server costs alone to run the necessary number of instances to get this analysis done in less than a year. This monetary cost doesn’t include any developer or server admin time necessary to develop, maintain, and deploy the system. The Sitemorse article doesn’t disclose how long the data gathering process took or how many systems they used to do the testing. Regardless, it would take 351 instances to perform this task in less than a decade.

Why have I focused on a year here? Because that’s the maximum amount of time I’d want this task to take. They could have done it across the last decade for all we know. However, the longer it takes to do this testing the less reliable their results would be. Across a decade – nay across more than a year – the more likely that technology trends would necessitate changes to how & what is tested. A few years ago, for instance, it was prudent to ensure all form fields had explicitly associated LABEL elements. Now, with the proliferation of ARIA-supporting browsers and assistive technologies, your tests need to include ARIA in your testing. Data gathered using old tests would be less accurate and less relevant the longer this process took. I realize I’m assuming a lot here. They could have continually updated their software along the way, but I strongly doubt that to have been the case. Keep in mind that this 24/7/365 test approach is vital to getting this process done as fast as possible. Any downtime, any pause, and any change along the way would have only added to the time.

Giving them the benefit of the doubt for a moment, let’s assume they had the monetary and human resources for this task. Even if they did something like this, it also begs the question: Why?

The entire idea is ridiculous

I’m not saying that it isn’t possible to test 25,000,000,000 pages. In fact massive companies could perform such a task in no time at all. But I also think doing it is a ridiculous idea. And when I say "ridiculous" I mean it in the strictest sense of the word. No matter how they performed a project such as using 351 instances across a decade or 3,510 instances for less than a year, or something in between, doing so is an ignorant, uninformed, and useless pursuit. It indicates a woeful lack of knowledge and experience in development, accessibility, and statistics.

In their article they state:

With this information we will consider the checkpoints of WCAG 2.0 and come up with 10 things that should be dealt with to improve accessibility which will all be understandable, manageable, measurable and achievable.

The idea of making such decisions based on rigorous data gathering sounds impressive. I have a lot of respect for approaches draw their conclusions from data rather than opinion. The question that must be asked, however, is whether or not the type of information they seek might already exist or, barring that, could it be gathered using a different/ cheaper/ faster/ more accurate approach. If you were to ask accessibility experts what their “Top 10 Things” are, you’d get a pretty wide variety of answers. You’d probably get things that are vague, overly broad, or driven by personal bias. However, if you were to moderate such a process using the Delphi Method [PDF] you’d probably come to concensus rather quickly on what those “Top 10 Things” would be. In fact, I argue that given a hand-picked list of respected industry experts, this process could be completed in a weekend. This illuminates the first characteristic of Sitemorse’s claim that makes it worthy of ridicule.

The second characteristic that makes this claim worthy of ridicule is the fact that they used an automated tool for this task. That’s right, I’m the founder of a company that makes an automated tool and I’m telling you that using data from an automated tool to do research like this is stupid. This is because there’s only so much that an automated tool can detect. Automated testing tools are not judges. They cannot prove or disprove any claims of conformance and they cannot even definitively tell you what the most frequent or highest impact issues are on a specific page.

Automated testing tools are excellent at doing one thing and one thing only: finding issues that the tool has been programmed to find. Nothing more. Any time you use an automated testing tool, you’re subjecting the tested system against a pre-defined set of checks as determined by the product’s developer. The nature, number, accuracy, and relevance of those checks will vary from one tool to the other. There are a large number of things that cannot be tested for via automation and an equally large number of things that are too subjective to test for.

The application of automated testing results to a process like this is only relevant if it is being used to validate the “Top 10 Things” that were determined by the experts. I believe that taken on their own, the opinions of experts and the data gathered from a tool would differ significantly. For instance, one of the Top 10 issues – by volume – detected by Tenon is for images that have alt and title attributes that are different. The reason we raise this issue is because there’s a likelihood that only one of these values is the actual text alternative for the image. Supplying both attributes – especially when they’re different from each other – leaves you with at least a 50/50 chance that the supplied alt is not an accurate alternative. After all, what could be the possible purpose of providing the differing title? Even though that’s a Top Ten issue by volume, it certainly isn’t going to make any Top Ten list created by experts. In the vast majority of cases this issue could be best characterized as an annoyance, especially because the information is (ostensibly) there in the DOM and can be discovered programmatically.

Finally there’s the complete lack of understanding of statistics and sample sizes. If we assume that the purpose of Sitemorse’s testing of 25,000,000,000 pages is to gather statistically significant information on the accessibility of the web, they’ve overshot their sample size by a ridiculous amount. And again, by “ridiculous” I truly mean worthy-of-ridicule. The size of your sample should be large enough that you’ve observed enough of the total population that you can make reliable inferences from the data. A small sample means that you’ll be unable to make enough observations to compensate for the variations in the data. The sample size, when compared to the population size, allows you to calculate a Confidence Level and Confidence Interval. In layperson terms, the Confidence Level is how “certain” you can be that your results are accurate. The Confidence Interval is also what people refer to as the margin of error. For instance, if you have a margin of error of “2” then the variance in the actual result could be plus or minus “2”. If I said the average result of a survey is “10” with a Confidence Interval of “2” then the actual answer could be between “8” and “12”.

What kind of sample size do you need to make inferences on the accessibility of the entire web? You might think that number would be pretty massive. After all, the total number of web sites is over 1 Billion and growing literally by the second. How many distinct URLs are there on the web?

In August 2012, Amit Singhal, Senior Vice President at Google and responsible for the development of Google Search, disclosed that Google’s search engine found more than 30 trillion unique URLs on the Web… (Source)

Apart from the statement above, getting an authoritative and recent number on the total number of distinct URLs is really difficult. Fortunately it doesn’t really matter because 30 Trillion unique URLs is, for our purposes, the same as Infinity. Sample size calculation isn’t a set relative amount based on the population size. After a certain point, you don’t really add much in the way of reliability to your inferences just because you’ve gathered a huge sample. Once you’ve gathered a sufficiently large sample, you could double it, triple it, or even quadruple it and not get any more reliable data. In fact, doing so is a waste of time and money with zero useful return.

What’s the right size of the necessary sample pages? 16,641. In other words, it is the same for the entire web as it is for Sitemorse’s claim that they tested 25,000,000,000 pages. This is because, as I’ve said, there comes a point where continued testing is wholly unnecessary. Sitemorse claim to have tested 24,999,983,359 more pages than they needed to. A sample size of 16,461 has a 99% Confidence Level with a Confidence Interval of just 1. If you want 99.999% Confidence Level you could bump the sample size to around 50,000 but I’m willing to bet the results wouldn’t be any different than if you’d stuck with 16,461.

In other words, Sitemorse didn’t just overshoot the necessary sample size. Even if we say we want a 99.999% Confidence Level, they overshot the necessary sample size by over fifty thousand times. That’s not being extra diligent, that’s being colossally stupid. They could have gotten the same data by investing 0.0002% as much work into this effort.

What does this mean?

I can’t speak to Sitemorse’s intent in making this claim that they’ve tested 25,000,000,000 web pages. I can only comment on its level of usefulness and logistical likelihood. On these counts Sitemorse’s claim is preposterous, extraordinarily unlikely, and foolish. The level of effort, the necessary resources, overall calendar time, and money necessary for the task is absurdly high. Even if their claim of testing 25,000,000,000 web pages is true, the act of doing so illustrates that they’re woefully inept at doing research and all too eager to waste their own time on fruitless endeavors.

Why do I care about this? What started me down this path was simple curiosity. Tenon has some ideas of some research we’d like to take on as well. It was immediately obvious to me that Sitemorse’s claim of testing 25,000,000,000 pages was absurdly large, but I also immediately wondered how much time such an undertaking would require. I decided to write about it merely because of how absurd it is to test 25,000,000,000 pages.

The Sitemorse article is an obvious sales pitch. Any time someone says they have “special knowledge” but doesn’t tell you what that knowledge is, they’re using a well-known influencing technique. In this regard, Sitemorse isn’t any different than others in the market and certainly not more worthy of negative judgement than others who do the same thing. The only difference in this case is that they used a huge number to try to establish credibility for their “special knowledge” that actually harms their credibility rather than helps it. The reality is that there’s nothing special or secret out there.

An Actual Knowledge Share

Finally I would like to close off this post with a real knowledge share. While SiteMorse are attempting to hold special knowledge based on their research, I believe that the following information doesn’t actually hold any special surprises.

Top 10 issues, by Volume

  1. Element has insufficient contrast (Level AA)
  2. This table does not have any headers.
  3. This link has a `title` attribute that’s the same as the text inside the link.
  4. This image is missing an `alt` attribute.
  5. This `id` is being used more than once.
  6. Implicit table header
  7. This link has no text inside it.
  8. This link uses an invalid hypertext reference.
  9. This form element has no label.
  10. These tables are nested.

Issues by WCAG Level

WCAG Level Count Percent
Level A: 765278 52%
Level AA: 357141 24%
Level AAA: 339664 23%

Issues by WCAG Success Criteria

Success Criteria Num. Instances Percent
1.1.1 Non-text Content (Level A) 110582 7%
1.3.1 Info and Relationships (Level A) 195544 12%
1.3.2 Meaningful Sequence (Level A) 20613 1%
1.4.3 Contrast (Minimum) (Level AA) 352562 22%
1.4.5 Images of Text (Level AA) 255 0%
2.1.1 Keyboard (Level A) 204361 13%
2.1.2 No Keyboard Trap (Level A) 4901 0%
2.1.3 Keyboard (No Exception) (Level AAA) 185056 12%
2.3.1 Three Flashes or Below Threshold (Level A) 23 0%
2.3.2 Three Flashes (Level AAA) 23 0%
2.4.1 Bypass Blocks (Level A) 24286 2%
2.4.2 Page Titled 1033 0%
2.4.3 Focus Order (Level A) 18776 1%
2.4.4 Link Purpose (In Context) (Level A) 139296 9%
2.4.6 Headings and Labels (Level AA) 4324 0%
2.4.9 Link Purpose (Link Only) (Level AAA) 139296 9%
2.4.10 Section Headings (Level AAA) 15289 1%
3.1.1 Language of Page (Level A) 4497 0%
3.3.2 Labels or Instructions (Level A) 24248 2%
4.1.1 Parsing (Level A) 56843 4%
4.1.2 Name, Role, Value (Level A) 103883 6%

Issues By Certainty

Certainty Num. Instances Percent
40% 2371 0%
60% 323036 29%
80% 40453 4%
100% 756559 67%

Issues By Priority

Priority Num. Instances Percent
42% 2363 0%
47% 435 0%
51% 1033 0%
54% 352562 31%
57% 920 0%
65% 56843 5%
76% 5145 0%
81% 15289 1%
85% 161 0%
86% 255 0%
90% 34531 3%
96% 81881 7%
100% 571001 51%

Conclusion

  1. Nobody holds any special secrets when it comes to knowing how to make stuff accessible. If you’re interested in learning about accessibility there are already excellent resources out there from The Web Accessibility Initiative, WebAIM, and The Paciello Group. Each of those organizations freely and openly share their knowledge.
  2. (Not necessarily specific to Sitemorse) Anyone who claims to have special knowledge or expects you to sign up for their special downloadable whitepaper is full of shit and should be treated as such.
  3. Every. Single. Piece. Of. Data. Above. indicates one thing: The nature and volume of accessibility issues that are automatically detectable make it obvious that people constantly make high-impact-yet-easy-to-fix accessibility problems. Ignorance is the #1 roadblock to a more accessible web

Everything that this experiment showed me is that while we know for a fact that there’s only so much that automated testing can find there’s enough people making these common mistakes over and over. Further: you don’t need to run a tool to run a test against 25,000,000,000 pages to tell you this, all you have to do is listen to users with disabilities. Maybe Sitemorse should’ve started there.

Get the full data set used in this.

One Trackback

  • […] of priorities. There were many strong and negative reactions to the article and I took them to task on my own site for the alleged research approach used to create their claim. I also took them to task on the Top […]