In 2003, I began work on a web-based application called SiteQuality. It was intended to be a tool I could use to sell people website redesigns or usability work. At around the same time, I began work on a different project I called AccessTest. It was designed to do automatic accessibility testing using Tidy. It could crawl, perform the tests, and generate reports rather quickly. The problem was, neither of these projects ever got finished because I quickly realized they were severely flawed. In the case of AccessTest, it (and all other tools which used Tidy for accessibility testing) was prone to volumes of false positives. So, I scrapped them.
From these two sprung AQUA (Accessibility, Quality, and Usability Analysis). This web-based system can do automated testing of any arbitrary type. Full details on AQUA aren’t really necessary for the purposes of this post because I’m neither finished nor interested in selling it as a product.
What is important for this conversation is that having AQUA at my disposal allows me to gather lots of data on the State of Accessibility across the web. Like similar such tools, AQUA has a spider which crawls web pages. It can be configured to do a crawl based on hostname or it can do what I call a “free crawl” which arbitrarily follows any link it finds. In AQUA’s case, it takes a copy of the DOM of each page it finds, turns the DOM information into a multidimensional array and stores it in the database for analysis. After it has performed the analysis on the page, it dumps the DOM information as a form of garbage collection. For now, I’ve turned that garbage collection feature off for now so I can do unit testing on the test engine, making sure the tool is getting the right results, etc.
What I recently realized is that storing this information affords me new abilities – namely the ability to mine the data for information of interest to the accessibility community. Over the next several weeks (and possibly months) I’ll be making a series of posts based on this data. In the interest of openness, I’ll share raw data to allow others to see where I came to the conclusions I come to and to form their own opinions.
Finally, as a slight disclaimer: Its not my intention to claim that any of the data I’ll present is definitive. The web is a huge place. It is estimated that Google has indexed 25,000,000,000 pages on the web. So, for hardcore statistics nerds, the information I’ll present is not statistically significant.