Search engines index Web addresses without a conscious or concern about how other technologies interact with the Web addresses they index. As a result they tend to index the HTTP and HTTPS version of a Web page. Indexing both versions causes duplicate content, security concerns, and can be easily avoided.
Search engines dislike duplicate content, but they fail to do anything about it on their side of the equation when it can be easily done. They expect all business owners, Web designers and anyone else to know how to keep the search engine spiders out of areas they shouldn’t be in. However, search engine spiders are like “Curious George” and always cause some kind of problem.
We can look at almost any forum focused upon Web design or search engine optimization (SEO) and we’re bound to find at least one post about duplicate content. These posts range from duplicate content caused by uneducated programmers to that caused by the search engines themselves.
One may ask, “What is duplicate content?” The answer varies depending upon who you ask. However, the accurate answer is any content that is significantly duplicated on other pages within a single Web site. The question then evolves to, “When all the Web pages of a Web site have the same navigation, header and footer how does that affect the duplicate content equation?” No one has definitively answered that question; however some have cited that when more than 51% of the text content is the same on one or more pages then duplicate content exists.
Fortunately for this thesis I’m not here to answer that question. However I am here to say that when search engines err in indexing both the HTTP and HTTPS version of a Web page they cause duplicate content. Regrettably, the search engines will turn and state that if the Web site blocked their access then they couldn’t index those pages and therefore duplicate content wouldn’t exist.
In examining the security issues involved in search engines indexing the HTTPS version of a Web page we easily find the search engines at fault. One might question how a security problem exists when this happens. Actually it comes down to more than just a security problem; it’s more of causing a merchant to lose a potential sale.
Current browser technology examines the secure site certificate’s assigned Web address and compares it to the requested Web address. If a mismatch exists the browser will not show the requested page, rather it shows a notice that recommends the person not proceed any further. This security problem then causes the merchant to lose a potential sale because the shopper’s concerns for security elevate to the level of “flight” versus conducting business with the merchant.
Far too often search engines ignore techniques available to them to eliminate duplicate content for the simple fact that search engines thrive on content. In other words, the more space they can fill up in their databases the better they think they are doing their jobs.
Search engines should take it upon themselves to not present or even index the HTTPS version of a Web site. Unfortunately their sense of social responsibility seems to not exist.
Over the years many people have questioned how they can fix the problem. The solutions provided, at least the ones I can find, focus upon using a .htaccess file to direct search engine spiders to a robots_ssl.txt file with instructions to not visit pages within the HTTPS environment. These solutions presume that Apache is the Web server software being used.
A better solution exists that will work with any programming language and Web server software. I’ll simply explain the logic instead of provide a sample code base.
If Server Port is 443
Then Add to the section of the Web page.
This simple method will prevent search engines from indexing the Web page and following any link found on the Web page in the secure environment.
As Web site owners, Web designers and search engine optimization experts we must take it upon ourselves to clean up the mess “Curious George” causes. As a shopping cart developer, Merchant Metrix has incorporated the above method to prevent duplicate content and most importantly instruct the search engines to not index or follow links in the HTTPS environment.
If search engines accepted their social responsibility, they could strip the “s” from the HTTPS and eliminate the duplicate content and security problems they have caused. Whether they do accept this social responsibility or not … time will only tell. Until then, we must do our parts to aid the search engines and provide them “search engine friendly” Web sites to navigate and index.
About the Author:
Lee Roberts, CEO/Founder of Merchant Metrix, Inc, has been working in the industry since 1996. Roberts pioneered the “search engine friendly shopping cart” in 2000. Merchant Metrix, Inc was awarded The Journal Record’s (Oklahoma’s major business newspaper) Innovator of the Year for 2009.


