when I go to Google/News it brings up headlines and body of articles from all manner different sites around the web. I was wondering if anyone knew the technique behind this? How does google know where the headline is and where the article is on so many different sites, each with it’s own formatting? Do all the sites have some kind of standardization for google, did they all rewrite their code so that google ca read them better? Any clues to this functionallity would be much appreciated.
I don’t think it’s based on article-name or title.
I think it’s like this peuso-code:
for( 3x )
{
int begin = indexOf(word) - 6 words;
int end = indexOf(word) + 6 words;
add( fullWebsite.substring( begin , end ) );
}
The interesting part is the indexing and how and where Googles stores all that data. I can send you a paper that outlines their technqiue. It so happens that my colleagues in the lab I am working deal a lot with Information Retrieval.
Parsing all the HTML code is not very difficult (while parsing PDF files is more challenging). There are robust probabilistic models that have a high precision and recall on filtering the HTML stuff out of a page and leaving the plain text. As a matter of fact, Google uses some of the HTML attributes to rank keywords (e.g. headers, links, etc.). But probably the most important aspect of the ranking algorithm is the evaluation of the graph that is spanned by pages linking other pages. That’s why pages that have a lot of links pointing to that page (and which are constantly updated) will appear at a higher position in the search result.
Johannes
is anyone using the personalised homepage?
I haven’t check out the service yet, but it sounds like they are pulling RSS feeds. There are thousands of them out there and news sites especially have them.
I played around with it a bit one or two months ago… I added a weather forecast for Berlin and Shanghai to my homepage but the information for both cities “is temporarily unavailable”. The slashdot news ticker and the Quote-of-the-day is cool though… But I don’t actually use it.
Johannes
I use it to see a few headlines and the latest messages in my (GMail) inbox on the same page. If the inbox is empty or has nothing interesting visible, I head off to the news. So it’s a sort of a conditional block for me so that I don’t waste my time heading off to GMail when I have an empty inbox.