Alibaba’s Tmall Global now features goods from 14,500 overseas brands, 80% of them selling in China for the first time.
Though search engine indices may pull from as many as 5 billion documents, even more web content is “invisible” if web crawlers miss deep or dynamic content. Now Overture and Yahoo are out to reveal that invisible content.
Most search engine indices contain about 4 to 5 billion documents, but adding in documents from the so-called “invisible web” raises the number of documents for potential inclusion in search indices considerably higher, Chris Bolte, vice president of strategic alliances at Overture Service Inc., tells Internet Retailer. Overture this month with parent Yahoo Inc. launched a new service that aims to surface the web’s invisible content. While Bolte says deep content on academic or government sites is one example of web content less easily discovered and indexed by web crawlers -- and is, therefore, “invisible” – areas of commercial sites face the same challenge.
Catalogers, for example, might have dynamic content. “A cataloger would want all its products and the most current pricing for those products to go out to the broadest audience,” Bolte says. However, web crawlers face technical limitations in discovering and indexing dynamic content. If the algorithm changes or the product changes, a listing may simply be dropped. For a search engine to stay updated on dynamic content, “You have to build a feed between the content provider and the search engine to transmit that information into the search engine on a regular basis,” Bolte says.
Beyond the limitations dynamic content poses to web crawlers, web content also may be “invisible” if it’s from a complex site, or proprietary. Overture’s and Yahoo’s new Content Acquisition Program – which involves a cost-per-click component for commercial sites -- seeks to make it easier for Yahoo crawlers to find such deep content with content submission guidelines and technology. “We’re essentially telling content providers exactly how to deliver information to us and establish a relationship so we can get access to proprietary, dynamic and complex content, and then ensure that it happens on a continuing basis,” Bolte says.