Log in

View Full Version : Ripping Google Books


megalomania
January 13th, 2006, 01:56 PM
My admiration for Google has increased significantly with their latest scheme of digitizing recently published books and making them available online. It is unfortunate the actual number of pages you can view at any one time are limited. I know it is a good idea because all the publishers hate it. It fact the publishers are screaming bloody murder, but Google is fast on its way to becoming one of the worlds financial superpowers, so they have the resources to fight.

It is quite comforting to me to see the publishers squirming in their seats over this. It serves those greedy bastards right for charging so much for books and giving authors so little in return. Book publishers, like the music executives of the RIAA, contribute nothing, create nothing, but they get almost all of the money.

But I digress… I know someone would try to rip google of its book pages, I just figured they would eagerly publish the results and there would be a script by now. Apparently some doo gooder has done just this, but he is keeping his technique a secret thus far.

I wonder if any Forum members have any ideas about systematically raping the google books site? If anyone can view any page, then every page must be viewable somehow. Somewhere out there is a hacker or group trying to crack this. If anyone could provide some links, leads, or insight I would greatly appreciate it.

FUTI
January 16th, 2006, 01:03 PM
Mega I wondered also about this and posted it already...
http://www.roguesci.org/theforum/showthread.php?t=4985
my friend tried "stupid" brute force approach and search for word polymer inside a book that has a plastic materials as main subject...the hits generated produced about 80% of book pages!!! So I guess that approach I proposed earlier can work.

Kamisama
January 29th, 2006, 07:09 PM
I'm looking at the weblinks that are given.
Usually within the weblinks there is a pattern and using that pattern you can create a navigation bar.
There is usually something within the weblink that gives away how the search+navigation+book works.
Often this will need to be edited, reformed, and manipulated to get to the next page.

I did this with the pubmed books for biology

I made a two frame page.
Blank top frame.
Bottom frame with navigation jump menu which changed the top frame's page.
I often used search methods such topics for pubmed.
Until this tipped me off: http://biology-online.org/biology-forum/about2438.html

However, most of creating navigation for pubmed books relied on weblink manipulation.

Simply said, it's all in the weblinks.
Similar to how hotmail was cracked back in the day.
___
edit
___
http://books.google.com/books?q=united+states&id=Uazpff00Y5EC&vid=ISBN0765607301&dq=mercantilism&prev=http%3A%2F%2Fbooks.google.com%2Fbooks%3Fq%3Dm ercantilism&ie=UTF-8

I see within the weblinks....

http://books.google.com/books?ie=UTF-8&hl=en&vid=ISBN0765607301&id=Uazpff00Y5EC&pg=PA60&lpg=PA60&dq=mercantilism&vq=united+states&prev=http://books.google.com/books%3Fq%3Dmercantilism&sig=rXKn-HQx4L85y5ItRgLnpEX8cQ8

ISBN = 0765607301

PAGE60 = PA60 [pg=PA60&lpg=PA60]
(i noticed that this is often a tag for telling the google program from how far you can travel within pages.. think graphing line distance -3 0 3)

vq=united+states [the topic (helpful in grabbing tons of pages at once)]

(UTF-8) the text encoding, i believe

hl=en (the language)

prev=http://books.google.com/books%3Fq%3Dmercantilism&sig=rXKn-HQx4L85y5ItRgLnpEX8cQ8 (may be of importance???)

Think of the search results page as the main directory
and then results as subdirectories.
____________
the search engine seems to be based slightly off google's original style omitting obvious words like (the, and, if, but) etc. google's search engine isn't stupid, it's not going to fall victim. I used something simple, (united states)

i suppose certain words in this book such as (gold, colonies, adam smith, england, spain, europe, century) certain things would collect more pages. perhaps it would take about 3 hours to collect all the material, open dreamweaver, and compile it all.

I just tried right-clicking and bookmarking in mozilla. it allows it to keep the original name of the page (page 41). thus, when i had this i noticed that it would probably save the yellow marking highlight also. therefore, the vq=united... i just erased the part united+States and left it at "vq=" thus taking away the yellow highlight.
_____________________

if you had more web development skills, i assume you could make a bottom frame with navigation option to search through a database of code and then select that page.

search= weirdcodewithnumbers+page#

the weird code probably has something to do with this: pg=PA60&lpg=PA60


I'm oldskewl so i'm only limited to html and swf :roffle: The only thing I'm good at these days is bypassing stuff.
"Oh, so they have internet on computers now!"