tks for the comment. no exact conclusions on what to exactly do to get there were reached. but I do think that some important conclusions were drawn based on the analysis:
Old Web Definitions Are Exploitable and The Algo to Extract them is Very Basic
Google Answer Boxes are appearing more and more in the SERPs
Google Does A/B Testing on Google Answer Boxes to understand if the Answer Box is relevant or not.
New Web Extractions are Pretty Accurate and are Smartly Extracted from Websites.
The Number of Google Answer Boxes increases
The Types of Widgets were Shown - I am sure a lot of people did not know about Video Widgets for Example.
Google Answer Boxes are extracted only from content in the Top 10 Organic SERPs
The higher the number of referring domains the higher the change to be in an Answer Box.
I think the conclusions are pretty important if you think about the change of the SERPs and how this may affect the users.
@victorpan almost everyone drops everything they have on their site in the Google Index while only a small percentage of the indexed pages drives traffic to the site. The 80/20 rule mostly. Why polute the Google Index in the end. It might slap you at some point.
low quality content
duplicate content
content that should not be found in the SERPs
content that is scraped or aggregated programmatically
I do not see value in having this kind of content in the Google Index while Content Updates from Google might devalue the entire site if it has too much of this kind of content indexed.
A lot of findings are there. Multiple types of OneBoxes found and multiple extraction methods. I suspect "Web Definitions" are using a very basic methodology and they are totally unrelated to "Web Extraction" as we named them there. I even found expired domains since 2 months ranking in the Web Definitions. even bought one to see what kind of traffic it receives from the answer box. All is detailed in the experiment and analysis.
My POV is that they help the user. I a large majority they are correct. Sometimes they "go wild" but it is normal. It is an ever evolving process. They simplify the way you search and extract info. Better UX.
the patent you mention is an interesting one looking at the Time to Query and Second Queries, etc
Regarding your 2007 article. Their definiont at the moment was:
"Googles search technology finds many sources of specialized information. Those that are most relevant to your search are included at the top of your search results. Typical onebox results include news, stock quotes, weather and local websites related to your search."
this referred only to what I call Google Widgets, Web Definitions. the Web Extraction is more new and it came in the last years. It is something generated on the fly. It is surely using the Knowledge Graph at some extent. Because it simply need to understand the query and then return the best answer for it (or not). Answer Boxes will be more and more seen in teh SERPS probably as I have seen in this latest analysis on 10k keywords. We ran it twice at an interval of 1 week and we saw a slight increase in new keywords with answer boxes. Still G does A/B testing on them. Sometime they appear , sometime not. Based on that I suspect they try to understand if it is relevant to the user or not.
Google does not read the text in the image for the search engine. as it is shown in the text with uploading the word GOOGLE it matched something like "eagle eye solutions". They do a visual match on the uploaded image and the images in their index and suggest best on the most similarly visual and its already assigned concept.
What is interesting though that in the test with a scanned PDF it did the OCR on the scanned pages and the text there was indexed.
This + all the others (Google Keep) etc show that they have the ability to do it quite well. It is only a matter of time until they will do it for all images. And when that moment comes the ones who prepared will be "grateful" for doing it.
Probably it is a matter of resources now to scan all the images in the index would take away some resources. The same as they do it with the Visual Crawler which does not visits always your site but when it makes sense for them.
I think that in some situations when they might have a red flag or something regarding a site or images there they will try to read the text in the image to clear some of the issues.
@amabaie it is quite simple for Google to do it automatically. practically i think they have the link classified in Unnatural / Suspect and OK. The Suspect ones are the ones that they can not say automatically that are unnatural or ok. they have a problem to decide on those.
when you do the disavow you will send your own opinion to Google. they will match your opinion to their and see the commonalities. if they don't overlap in the area where they are sure "Unnatural" they will tell you to do more work on it.
if they overlap like 98% i think they will let you go.
if you send too many they might consider that those suspect could be unnatural if you sent them and disavow all or they will ignore the disavow as they could consider it was invalid.
things changed a bit in the Answer Boxes since then. Hummingbird is a really complex Bird. The article referred here talks about Answer Boxes only. Not about the entire hummingbird. First they need to identify the entities, their properties and their relationships. how it works exactly only Google knows and from a technical point of view multiple things can be applied. the title of the post on inbound may be misleading as the title of the article refers to Answer Boxes.
it depends. if you refer to the normal random image accross the internet yes the answer is NO. but if you refer to the images in your PDFs or if you refer to some problematic images or Google Keep etc then the answer is YES
i do to. but i really doubt that no one at google noticed this until now. the solution would be to treat punctuation signs differently. every brand ( example Nike +) has the same problem. The alogirthm should be updated for these kind of issues.
HI Jake, They did not lose them directly but indirectly they possibly are. look at the trends. half of the searches for "google+" are searching "google +" meaning they want to find the initial "google+" but they don't , then they are suggested to search for "google + login" which may or may not give the results (depends on local settings - anyway it is not in the first place - 4th or 7th). Anyway the idea is that this is creating frustration among users. Again, Google's trends speak for themselves.
it does it automatically. everything inside is automatic in fact. very helpful charts i would say and data for easily spotting link building strategies used.
"data-mining thousands of hours human classification data"
that would be algorithm training. like face detection. you need to validate 100 photos and then the system will recognize you. validate 1000 photos and the system will improve its detection ratio. over a certain amount of validated photos no improvement in automatic detection will be found ... or it will be like 0.0001 improvement.
I am sure that each link has to be validated by multiple persons to achieve a certain rating and be used in the training phase. Imagine 10 people voting for a link. One says it is good . the other bad and so on. The simplieft way would be to take the average of their opinions ... but this would be a long talk.
tks for the mention guys ;) as Krystian said it is cognitiveSEO not SEO COGNITIVE :) no worries though Google matches it :)
@victorpan almost everyone drops everything they have on their site in the Google Index while only a small percentage of the indexed pages drives traffic to the site. The 80/20 rule mostly. Why polute the Google Index in the end. It might slap you at some point.
I do not see value in having this kind of content in the Google Index while Content Updates from Google might devalue the entire site if it has too much of this kind of content indexed.
I did not post this on inBound so can not change that. The title of the post is about the Answer Boxes.
Today we posted a new research on 10k scraped keywords ( relevant questions keywords extract from autocomplete using a specific method in order to maximize the number of possible answer boxes returned) and we did a more in-depth analysis on it. http://cognitiveseo.com/blog/6266/decoding-google-answer-box-algorithm-serp-research-10-353-keywords/
A lot of findings are there. Multiple types of OneBoxes found and multiple extraction methods. I suspect "Web Definitions" are using a very basic methodology and they are totally unrelated to "Web Extraction" as we named them there. I even found expired domains since 2 months ranking in the Web Definitions. even bought one to see what kind of traffic it receives from the answer box. All is detailed in the experiment and analysis.
My POV is that they help the user. I a large majority they are correct. Sometimes they "go wild" but it is normal. It is an ever evolving process. They simplify the way you search and extract info. Better UX.
the patent you mention is an interesting one looking at the Time to Query and Second Queries, etc
Regarding your 2007 article. Their definiont at the moment was:
"Googles search technology finds many sources of specialized information. Those that are most relevant to your search are included at the top of your search results. Typical onebox results include news, stock quotes, weather and local websites related to your search."
this referred only to what I call Google Widgets, Web Definitions. the Web Extraction is more new and it came in the last years. It is something generated on the fly. It is surely using the Knowledge Graph at some extent. Because it simply need to understand the query and then return the best answer for it (or not). Answer Boxes will be more and more seen in teh SERPS probably as I have seen in this latest analysis on 10k keywords. We ran it twice at an interval of 1 week and we saw a slight increase in new keywords with answer boxes. Still G does A/B testing on them. Sometime they appear , sometime not. Based on that I suspect they try to understand if it is relevant to the user or not.
my long 2c :)
Google does not read the text in the image for the search engine. as it is shown in the text with uploading the word GOOGLE it matched something like "eagle eye solutions". They do a visual match on the uploaded image and the images in their index and suggest best on the most similarly visual and its already assigned concept.
What is interesting though that in the test with a scanned PDF it did the OCR on the scanned pages and the text there was indexed.
This + all the others (Google Keep) etc show that they have the ability to do it quite well. It is only a matter of time until they will do it for all images. And when that moment comes the ones who prepared will be "grateful" for doing it.
Probably it is a matter of resources now to scan all the images in the index would take away some resources. The same as they do it with the Visual Crawler which does not visits always your site but when it makes sense for them.
I think that in some situations when they might have a red flag or something regarding a site or images there they will try to read the text in the image to clear some of the issues.
jumping in :) the case is written for SEO pros mostly how do where both hats :). in the end if all about the business model.
@amabaie it is quite simple for Google to do it automatically. practically i think they have the link classified in Unnatural / Suspect and OK. The Suspect ones are the ones that they can not say automatically that are unnatural or ok. they have a problem to decide on those.
when you do the disavow you will send your own opinion to Google. they will match your opinion to their and see the commonalities. if they don't overlap in the area where they are sure "Unnatural" they will tell you to do more work on it.
if they overlap like 98% i think they will let you go.
if you send too many they might consider that those suspect could be unnatural if you sent them and disavow all or they will ignore the disavow as they could consider it was invalid.
cool. looking forward ;)
there are 2 types of extractions we found:
"web definitions" - search for "what is a natural link"
"web extraction" - search for "what is negative seo"
The first one is very old. The last one is rather new ( the last years).
The first one is very basic. The second one is very advanced and uses entities.
things changed a bit in the Answer Boxes since then. Hummingbird is a really complex Bird. The article referred here talks about Answer Boxes only. Not about the entire hummingbird. First they need to identify the entities, their properties and their relationships. how it works exactly only Google knows and from a technical point of view multiple things can be applied. the title of the post on inbound may be misleading as the title of the article refers to Answer Boxes.
it depends. if you refer to the normal random image accross the internet yes the answer is NO. but if you refer to the images in your PDFs or if you refer to some problematic images or Google Keep etc then the answer is YES
tks Victor and Umar ;)
I think it only only because of the visually similar pixel distribution. they do not read yet I think the image (only in particular cases).
what is interesting thou is the they do it on scanned PDFs.
they math photo based on the object and pixels and colors in them.
i do to. but i really doubt that no one at google noticed this until now. the solution would be to treat punctuation signs differently. every brand ( example Nike +) has the same problem. The alogirthm should be updated for these kind of issues.
HI Jake, They did not lose them directly but indirectly they possibly are. look at the trends. half of the searches for "google+" are searching "google +" meaning they want to find the initial "google+" but they don't , then they are suggested to search for "google + login" which may or may not give the results (depends on local settings - anyway it is not in the first place - 4th or 7th). Anyway the idea is that this is creating frustration among users. Again, Google's trends speak for themselves.
it does it automatically. everything inside is automatic in fact. very helpful charts i would say and data for easily spotting link building strategies used.
Good content always sticks. Unfortunately no human can write good/great content every time.
But as with everything in life ... "too much" can't do any good ... it always hurts. ... one way or the other.
Content as it is today, is created to be quickly consumed and it is highly unlikely to be consumed again by the same individual.