Thin content that Google clearly does not value (indexed but few impressions and clicks) is a problem for SEO and not only for that page, but for other pages on the site. The idea is that thin content detracts from higher quality content that Google does prize.
The problem of thin content has been around since the Panda update (2011), yet still we constantly see this kind of thing on client websites.
- Expanding Definition of Thin Content
- The Problem with Page Size
- How a Human Judges Thin Content
- Machine-based Content Relevance
- Link effects on Thin Content
- What to Do with Thin Content
- Improving Thin Content
- Removing Thin Content
- Google Search Console Removals
- Thin Content Checklist
- How to Remove Thin Content
- Find Pages Indexed on Google
Expanding Definition of Thin Content
Initially, Google targeted thin content as:
- Duplicate content,
- Content with low unique ratio, and
- Content with high ad ratio
However, that increased as various other ways of measuring thin content emerged. Thin content, like all other Google terms, is ultimately a human concept, but the algorithms approximate this concept in various ways.
Thin = Low Quality = Irrelevant
The Problem with Page Size
Page size also used to be a factor regarding thin content. The longer the page, the less thinness. But we know that is not true, as more garbage doesn't make garbage less garbage. We've seen extremely thin pages do much better than much longer pages on a particular top. The phenomenon when WordPress tag and category pages actually rank much higher than the real content pages they are linking to, is painful to see.
Pages should be the right size for the content being expressed. Nothing massive, and nothing too brief (unless, again, the content calls for that). This is a problem of judgment and good writing and journalistic skills should come to the fore on these decisions.
Still, one can look at a given page size in word count and see some evidence of thinness of content.
How a Human Judges Thin Content
Human behavior can indicate low quality, thin, irrelevant content in a number of ways. (Note that indicate will always be an approximation, until the human is actually asked.)
- Click through rates on search results: higher CTR = higher quality
- Time on page: more = higher quality
- Bounce rate (visitors to one page only on site): lower = higher quality
- Repeat visits to page: more = higher quality
- Social signals to a page: more linking and sharing = higher quality
- Blogging behavior: more linking from blog posts = higher quality
- Expert sharing: more experts (in a particular niche) who share the links = higher quality
- Having relevant links in the page that users click and also return from: more = higher quality
- This is off-site linking as well as on-site linking and adjacent content
- This directly effects the bounce rate as well (with on-site relevant links)
As we can see, simulating many humans behavior (especially human experts) would be the holy grail of getting Google to see content as relevant, and therefore not thin.
Machine-based Content Relevance
For the machine to find relevance, the current best approach to understanding the Google algorithm is a multiple linear regression model. That means there is no one factor or a few to focus on but the interactions between each of the factors are increasingly complex. That said, here are some of the factors involved:
- Exact match of content to search
- This can be the title or url exact match
- This can have a snippet inside an article exact match the search term
As with a deep understanding of human needs and intentions, a deep understanding of the content matching those needs and intentions is the basis for machine-based relevance.
If link text is close to the search term, that helps better evaluate the quality of the content, though of course this can be manipulated, and is a new kind of keyword stuffing.
What to Do with Thin Content
There are two ways for dealing with thin content:
- Remove it, or
- Improve it.
There is no middle ground.
Improving Thin Content
Improving Thin Content takes some time and expertise. Thankfully one can learn from competitors, viewing how the top-ranking content is structured can go some way toward a plan for improving one's own content.
Sometimes it is just a matter of rewriting or expanding the subheads (H2s, H3s) and expanding the content in some ways. Sometimes it is a matter of adding more or better links on the page, and restructuring on-page navigation.
Including more or better images and marking them up properly can also improve thin content. Many people design infographics which become anchor points for content
Removing Thin Content
Removing content from the Google Index is done when Google attempts to re-index, and encounters a
410 code. This means that Google has to return to the content. In other words, forbidding a search engine from searching for a page or pages won't help here, so using robots.txt is not how to deal with the Google Index. If Google is told it cannot crawl a page, then it cannot see the
Google Search Console Removals
Google Search Console has a Removals section that includes temporary removals and outdated content. Temporary Removals submission removes does not remove the content from the index, only from SERPS. Outdated Content will get the search engine to re-crawl the page(s) which is what is desired. They can then see the
noindex (if the content is still there), or the
410 codes (if the content is no longer present).
Thin Content Checklist
How to identify thin content:
- Duplicate content (on own site or elsewhere)
- Not very original or unique
- Too short (word count)
- Not meaningful, interesting, or important (human judgement, use of personas)
- Doesn't have many or any outbound links (count links, verify links, quality/relevance/value assessment)
- Doesn't have many or any on-site links (count links, verify links, quality/relevance/value assessment, bounce rate)
- Has too long or not long enough of a title (character length)
- Doesn't have enough or descriptive subheads (count and assess H2s, H3s)
- Doesn't have experts sharing it (inbound link count and source quality)
- Doesn't have anyone sharing it (inbound link count)
- Doesn't answer any questions (snippet optimization)
- Doesn't keep people engaged and interested (time-on-page, bounce rate)
- Doesn't make people want to come back (repeat visitors)
- Google provides few impressions, for the search terms that matter (GSC) Impressions
- Click-through rating on Google search is low, for the search terms that matter (GSC) CTR
- Doesn't make people want to click on the search result (title and description)
How to Remove Thin Content
The easiest is to delete the content (or unpublish it) and let Google naturally remove the content during its crawling process. When revisiting, the Googlebot will see a
404 and after a while remove that page from its index. This is what Google recommends to permanently remove content.
However, if one wants to try and prod the Googlebot to search (and remove) faster, then flagging pages has having outdated content or temporarily removing them (which blocks URLs from SERPS as well as clearning the snippet and cached version) seems to be a good approach. Submit to the Google Search Console Removals Outdated Content form. Make sure to have a clear and thorough understanding of the GSC Removals Tool.
Realize however that Google has made the tool hard to use for bulk removal and only allows a single page or pattern to be submitted at a time. The reasoning is that this is a tool for exceptional cases.
By using the removal tool, we can manually remove the cache and SERP snippets. The de-indexing happens automatically. If removed from cache and SERPS, the re-indexing should happen faster. And therefore the de-indexing should happen faster (regardless of what Google is saying).
If one wants to try and do bulk removal there is a Google Search Console Bulk URL Removal tool.
Find Pages Indexed on Google
The above works great as long as one knows which pages are indexed on Google. That is not so simple, as one cannot rely on the Google Search Console Performance tab as it will only show pages that were returned as impressions and clicks. The Coverage section is more valid, and can have eye-popping numbers. Even with a WordPress installation of 1,000 posts, there is inflation and error and whatnot that leads to nearly 5,000 valid and 1,200 excluded. These are category, tag, archive, and various pagination issues, but the number is astonishing. Even looking at 600 pages returning impressions, there is nearly 9 times more indexed pages than desired. We need to get these pages listed and in a database to perform the de-indexing systematically.