<img src="http://www.central-core-7.com/54940.png" style="display:none;">

Faceted Search and SEO: Effectively Leveraging Faceted Search for SEO

20 Nov 2015 Mike Levin
0 Comments

in Retail, SEO

Faceted search used to be pretty rare. Now, it seems to be everywhere! Getting the details wrong on how you wrangle the search-visibility of sites with this feature is one of the things most commonly broken in e-commerce sites today. We encounter the same issues over and over here at Flying Point Digital, and from an SEO-perspective, it's not simply "make better category pages". Although that is an important part of the fix, it's only half the story.

Thanks, Captain Obvious

There’s enough oversight or misconception of what’s going on with faceted search and how good this site navigation technique might be for SEO, that it’s time that we wrote an article. It's the same, age-old, accidental spider-trap story, but with a twist. Or, should we say with new dimensions. For those who have been in the SEO industry awhile, that's probably plenty of information to both infer and fix the problem. Faceted search creates a spider-trap as large as every combination of possible facet selections, so long as your navigation is "search friendly".

Problem defined. Solutions implicit. You salty old dogs of the SEO-industry can go away. For those just hearing about or dealing with this for the first time, read on. We will plunge you first into a bit of history, the bad situation that often currently exists on such sites, and then finally lay out a few broad strokes of one possible solution.

 

Faceted Search Jewel Analogy

 

Million-Product Catalogs

Wherever there's e-commerce with big catalogs of millions, or even just tens-of-thousands of products, there's structured data like price and color and size to describe it all. And the term chosen to describe the user interfaces built around searching and filtering using such product-describers is facets.

Faceted search is just all the filters you can click on to refine your search, beyond plugging-in keywords or drilling-down on navigation. There's some formal definitions here, and an implied order insensitivity (that is not present on drill-down navigation). Drilling down through order-sensitive menus (like Web hyperlinks) implies certain finality to your exploration. Everything you "find" is analogous to files on a hard drive or nodes in a tree. While it’s possible, it’s simply harder to create spider-traps with drill-down navigation. It’s how the Web mostly works, and it’s what made Google search-and-index such a brilliant and effective system. It’s also what has given Google an unfair reputation for “not liking” dynamic sites.

Spider-Traps and Mixed Messages

As soon as a question mark is introduced to the URL, the site is considered "dynamic", and the site could go on forever. Think of a calendar webpage where you can always click a "next day" link. It's really that simple to create a spider-trap. And it's not the existence of the question mark that makes the site dynamic or bad or unreadable to Google in any way. It's that the question mark is present on the types of sites Google has to put aside at some point, and get on with the business of crawling sites that don't make things miserable. Or else, all the seemingly infinite resources of Google would be spent crawling that one simple infinite calendar on one little site.

Dynamic sites (or URLs) aren't inherently bad, as some people feel. What’s bad is how easy it is to make accidental spider-traps and never realize that you even have the problem. From Google's side, they're just getting onto the next site in some realistic way, so they don't spend all their time spinning their wheels. Google has a lot more willingness these days to intrepidly dive into spider-traps, pull back a few million pages, and see if they can't make any sense of it.

In this article, we're focusing in on one particular type of dynamic URL spider-trap as generated by the navigational scheme often called faceted search. Fun word, facets. Makes you think of the cut faces of a jewel. I guess that serves the e-commerce biz just fine and it’s easier than saying arbitrarily parameterized or attributed or multidimensional or field-filtered search. Not all parameterized search are facets. Facets tend to allow themselves to go in different orders and in seemingly infinite permutations—both what makes them "facets" and such a particularly nasty spider-trap.

Endeca and Lucene 

We are noticing problems with faceted search sites more often, because it’s easier now to make sites that use it. This navigation technique used to be considerably rarer because of the cost and expertise required to set it up, and the beefy server requirements of delivering this feature (with accurate data) at scale. That's changing. No matter what your data is locked-up in, some product like Endeca (now, from Oracle) or Lucene (an Apache project) can sweep through it and build the database and indexes required to connect to the site-building components that layer faceted search into a site.

Endeca has long been the dominant enterprise-class commercial software to offer faceted search—which is why you hear their name invoked so much when this topic arises. You'll pay for that confidence, of course. But if you have your own confidence, and a strong developer team, there's the non-proprietary (free and open source) Lucene software stack alternative.

Lucene, as I'm told—as I am not an experience developer with this particular software stack—does almost everything Endeca does, even with enterprise-level performance, but for free. As with Endeca, there's really a whole grab-bag of individual products that work together in a sort of ecosystem. The top of that ecosystem is the Apache Software Foundation (equivalent of the company), then the Lucene project (equivalent of product) and after that, the part that makes the actual Web UI we’re talking about—either Solr or Elastic Search.

So all this Lucene and Endeca stuff is admittedly that IT infrastructure stuff that "The Cloud" is supposed to keep you from having to deal with, and have a bit of an old-school DIY-feel to them. If you're a smaller company, or simply don't want implementation pains, and want to be using the most agreed-upon best practices out-of-the-box and still be considered enterprise-class, there's always Demandware, or a host of other products that fill the niches between Endeca/Lucene at one extreme and a self-hosted instance of WooCommerce on WordPress at the other.

Plus, all the really big tech players, such as IBM, Microsoft and SAP, offer something to solve the Web faceted search problem too. Endeca and Lucene are the names that come up over and over when you're an SEO tackling these problems, so this is an easy way to frame this faceted search discussion, but keep in mind there really are others on each end of the spectrum, and countless more in-between. If for example you want that cloud-ease of Demandware, but with the option of taking it all in house someday to start layering in extreme customization for competitive advantage, there's Hybris at the high-end, and Magento at the low-end.

Two Extreme Scenarios 

But at the end of the day, all these infrastructures have some form of faceted search and have to deal with the same set of problems. Generally, faceted search falls into one of two categories. All the millions of potential pages being "made possible" are either:

  1. Completely invisible to search due to one reason or another
  2. Visible to search, but creates a site that Googlebot will never finish crawling and exploring

In the first scenario, faceted search sites that are invisible to search are either invisible because the user interface is built with old fashioned CGI-form elements and requires a submit or the execution of JavaScript for the search to execute, or it is actually crawlable, but the site owners have "turned off" Google's ability to crawl/index the site through robots.txt or some other mechanism—usually because they have suffered the pains of situation number two.

In situation number two, the entire faceted search site and all the potential pages it can generate are perfectly crawlable by Google. However, the pages are never-ending, and 99% of that never-ending crawl is duplicate content. In other words, it's a spider-trap. Google sees your whole site, but because of the ridiculousness of the task you set before it, it will give up and move onto the next site.

Seldom thought about, but critically important, is that this spider-trap will have an impact on your search rankings by diluting or completely obfuscating the "core set" of important pages your site can/should be generating that could be positioned in easy-to-discover click-paths (main & secondary navigation) and be tweaked to align with known searched-on and known converting keywords.

Think in Terms of Actual Real-Life Trees 

So, the trick is to light up that core set of pages, like the main-trunk and branches of a tree. These perhaps represent the first two selected facets or some other mechanism for "defining the core set of pages" that is coordinated with what your keyword research is going for. Trunk and branches are core. They are your master set of canonical non-duplicate pages—whether or not they were actually produced by choosing faceted search parameters. (Your core pages might well be comprised of these).

Even if your site can generate millions more pages than this, this "core" of anywhere from 100 to 10,000 pages can be your master canonical set. All the other millions of mostly-duplicate variations could possess canonical tags back to the closest-matching URL from the core set. Yep, there might be some custom development work here if your e-commerce platform does not support such out-of-the-box tricks.

And that's just one of the approaches to getting these spider traps under control—let everything index... let the spider-trap continue to exist... but be clear to Google about what's going on, and how any crawling past the eventually-obvious core/important-set is over-the-top and perhaps unnecessary work. A Google-search with a site-modifier should come back with approximately the amount of canonical core pages you are now clearly advertising—and NOT the rest, which you are admitting are low-priority permutations.

The best solutions are always ones where only a finite amount of pages can be generated by a site, and Google can spin through them all in a few days. Try running Screaming Frog against a site (with plenty of memory). If it never finishes, you might have a spider trap.

It's like on any given tree, it might be difficult, but you could actually count the leaves! It’s possible, but you’ll finish. So too will Screaming Frog finish crawling a properly finite site.

Order Matters—Cutting Down the Permutations 

Certain hybridization of facets can help get the situation under control—such as making certain facets only able to activate in combination with certain other facets to reflect and enforce the data-relationship constraints. You might consider this a combination of the much more finite drill-down navigation scheme with search facets. (Facets are presented specifically at certain drill-down levels). Drill-down navigation tends to enforce a certain order to your query string parameters (obfuscated as folders or not).

You can also construct your URLs carefully, with a certain enforced order to the facets, so that you're only dealing with combinations instead of permutations. (Do a search for "combinations vs. permutations.") Specifically, if you select facet A and then facet B in one case, but then facet B and then facet A in another, the URLs are going to be different, but the resulting page the same. This can be fixed by just alphabetizing or using some pre-set order for how the parameters are to appear on the URL.

And finally remembering that we're sticking with the tree-metaphor for site-hierarchy, the purpose of a tree is to spread out its branches, twigs and leaves to create surface-area with leaves to capture sunlight most efficiently. Evolution has shaped trees so that they do not continue growing out past the point where they capture light most efficiently.

Artistically Shaping a Site 

As stated at the opining of this article, most faceted search sites either make their site invisible to search or an impossible crawling chore. The real answer is somewhere in the middle—an artistic-shaping. There are many ways to pull this off, from making adjustments to your robots.txt file to tweaking your Google Search Console (formerly Webmaster Tools) settings, to changing the meta tags in your view-source.

The solutions are varied, and all should be directed by an overarching keyword targeting strategy, and based on what is supported by your technology platform and implementable by your team. Unlike natural trees whose maximum shape is defined by the constraints of nature, faceted websites can grow uninhibited, and you may never know it—except for never performing well in Google.

 

Retail SEO

Comments