Google

Google VP for search quality talks about searching

An interesting interview with Udi Manber, vice president in charge of search quality at Google:

Manber says that even as recently as the early 1990s searching "was done by professionals in various limited domains. There was legal search, there was medical search, there was chemical search, and some limited news search. And it was done by a searcher--professional people....The idea that people will do the search themselves--that it'll democratize the whole thing and you don't have to go to a professional--that's the revolution."

He also says that Google "tunes" search results based on where you are physically in the world:

The other difference is it depends on location. If you do the same search from a different country, you get different results, even if it's the same language. We will tune the results by the country in which you're searching. It's by language and location.

Pentagon Audit of Iraq Spending

I always find it odd when news reports cite government documents without giving a link or good reference to them. It seems to me that this is something news web sites should do regularly. These reports are not always that easy to track down. Case in point: today's New York Times has a story about a Pentagon report:

A Pentagon audit of $8.2 billion in American taxpayer money spent by the United States Army on contractors in Iraq has found that almost none of the payments followed federal rules and that in some cases, contracts worth millions of dollars were paid for despite little or no record of what, if anything, was received.

Using Google to search on the title of the report plus "site:.gov" yields nothing this morning, although the report is available from two different government web sites.

The report is available at the site of the House Committee on Oversight and Government Reform, Committee Holds Hearing on Accountability Lapses in Multiple Funds for Iraq, Wednesday, May 21, 2008, along with other statements and documents.

It is also available at www.dodig.mil/audit/reports with this url: www.dodig.mil/audit/reports/fy08/08-098.pdf The same google search with "site:.mil" substituted for "site:.gov" finds the title in a May 22 "what's new" story on the home page www.dodig.osd.mil of the Office of the Inspector General.

This is the second report I have looked for this week that is available as a PDF document on a government web site that google has (evidently) not indexed full text. I do not know if this reflects a google policy or just a delay in indexing.

Creating Gov Doc "Libraries" in Google Books

Digitized Government Documents in Google Books has been written about quite a lot over here at the FGI and I'd like to revisit this topic again but with a different focus.

I was searching for Civil War era government documents for a History Professor, and I realized that we did not own one of the documents he sought. Before suggesting that he interlibrary loan a copy of this document, I decided to search online for a full-text digitized version. Alas, it did not exist in the digital realm, but I did find some other digitized gov docs pertaining to his research needs in Google Books. We were both elated, he because I had found what he needed, and I because so many documents I found digitized on Google Books were the same documents we had lost to mold and water damage from Hurricane Rita!

Out of curiosity, I did a Google Book search for other types of government publications and found these gems:

Trial of the Conspirators, for the Assassination of President Lincoln

Illustrations of the Gross Morbid Anatomy of theBrain in the Insane (isn't that a Cypress Hill song? Nevermind...) by the Government Hospital for the Insane.

How it Feels to be the Husband of a Suffragette
(not published by the Government Printing Office, but it is a book housed in the National American Woman Suffrage Association Collection in the Library of Congress).

Official Records of the Union and Confederate Navies in the War of the Rebellion

Most of these documents were scanned at large research universities or depositories, but the quality is not always decent andcan sometimes border on the illegible. I was quite amused when I discovered a staff person's hand digitized on this document's cover:

However, there are bigger snafus than a digitized librarian's hand. For example, despite government documents being in the public domain, Google Books treats most post-1922 (i.e. post-copyright law) government documents as copyrighted material by only allowing a limited view! For more details, please read James Jacobs' post on this issue.

Despite all these issues (which have yet to be resolved), I decided to take advantage of the access to full-text, pre-1922 government documents and create a McNeese Gov Docs "Library"account in Google Books for my depository. The account also allows you to subscribe to updates of its holdings via an RSS feed. I put a link to the library account and the RSS feed on my depository's homepage and our "Gov Guides" wiki. I'll add more of these interesting and old documents as I come across them, especially those pertaining to Louisiana or documents that were lost to Hurricane Rita.

Here are some tips for finding gov docs in Google Books: Use Advanced Search, and in the Publisher field, type in Govt OR GPO OR "Government Printing Office". You can also search by agency, (i.e. "Department of the Interior") by typing the name of the agency in the Author field.

Have fun exploring and building your own digital collections, but please let me know if you find some really cool gov docs, ok?

The internet, Google, libraries

Thanks to my fellow blogger for the link to Fister's article.  I recently came across an article (linked through that splendid online publication brought to us by the Chronicle of Higher Education, Arts and Letters Daily) that offers more food for thought.  Titled Better than Free,this thoughtful piece by Kevin Kelly of Wired discusses how we might be able to add value to the vast amount of free information now available - value that people will be willing to pay for. He says that "The internet is a copy machine....When copies are super abundant, they become worthless. When copies are super abundant, stuff which can't be copied becomes scarce and valuable. When copies are free, you need to sell things which can not be copied. Well, what can't be copied?" He discusses eight "generative values" that are better than free: 
immediacy,  personalization, interpretation, authenticity, accessibility, embodiment, patronage and findability.  All of these are relevant to librarians as we adjust our skill sets to provide information available on the internet. 

Fister on Privacy, Facebook, Google, Libraries

This is a very useful and thoughtful piece that starts with musings on Facebook and privacy issues and addresses much larger issues that affect libraries and library users and academic publishing. This is a must read.

  • Face Value, By Barbara Fister, Inside Higher Ed (Feb. 18, 2008).

Sample:

Libraries have always taken privacy seriously - not because it's valuable in itself, but because it's a necessary condition for the freedom to read whatever you want without risk of penalty. When the PATRIOT Act was passed, librarians checked to make sure their databases erased the connection between a book and its borrower as soon as the book was returned. That erasure, however, makes it harder to offer the kind of personalization, such as recommendations based on previous book choices, that the public increasingly expects from online systems. After all, it's what they get from Amazon.

...[W]e've barely begun to examine the unintended consequences of the Faustian bargain we strike when we share content through privately-owned digital domains of the public sphere.

Joe Esposito pointed to this article in a posting to the liblicense-l mailing list and he says:

As I was reading this, I reflected on an ongoing conversation with a friend of mine, a former Congressional staffer, about the growing political need for Google to be declared a regulated public utiility, like the AT&T of yesteryear. Too much power in the hands of too few: it's morally wrong, and socially dangerous.

I would just add to this that, when we rely on the government to be the only official repository of all government information, we are putting too much power in the hands of too few.  We are allowing the government to be the only entity that controls access to that information and the privacy or lack of privacy of all readers of that information. The solution to that is to build  collections of digital government information is libraries.  We have barely begun to understand the Faustian bargain we strike when we share content through a single government-controlled digital repository.

Fister is a librarian at Gustavus Adolphus College. Her blog is barbara fister's place.

 

Most fed data is un-Googleable

As we've noted here before (Is your search engine finding the government information you need?), the problem of relying on commercial search engines to find government information is that a lot of government information on the web goes un-indexed by those search engines.

  • Most fed data is un-Googleable By Jason Miller FCW (December 17, 2007). "After five years, a major E-Gov Act provision goes unmet because of search problems."

Sen. Joseph Lieberman (I-Conn.), chairman of the Homeland Security and Governmental Affairs Committee, says "There are more than 2,000 federal government Web sites not included in commercial search engine results. Is it accidental, or is there a policy, or it is just laziness? I would like to know why" and "Agencies do not let commercial search engines index their sites."

I wonder if that is true? I wonder if there is any document librarian who can answer that question or point to which sites are not indexed?

It is probably closer to the truth to say, along with John Needham, Google's manager for public-sector content partnerships, that government  "databases" are being missed by web crawlers and that  "Agencies are concerned more about how information is presented than if users are finding it."  In other words, agencies would probably like to have their information indexed, but haven't figured out how to do so, or don't have the budgets to do what is necessary.  It probably isn't "laziness" but lack of funds and other resources; it probably is sometimes "accidental" in that some may not know what to do.  It is probably sometimes even "policy" -- but probably less often.

But, one big problem is that we don't really know the scope of the problem or the cause.  FDLP librarians should be pushing GPO, researchers, and library schools to research these issues so we have answers.

 

Is your search engine finding the government information you need?

The Center for Democracy and Technology (CDT) and OMB Watch have released a report, Hiding in Plain Sight: Why Important Government Information Cannot Be Found Through Commercial Search Engines (Dec. 11, 2007), that highlights "a critical gap in online access to vital government information." The press release for the report says:

"It is unclear if agencies know there is a roadblock between the public and their information and have not taken the adequate steps to correct the problem, or if the agencies simply do not realize that their important information is not being found and indexed by search engines," said Sean Moulton, Director of Federal Information Policy for OMB Watch. "In today's Internet age, either answer is unacceptable." 

The report uses several search examples that Americans might expect to result in access to trustworthy government information. Instead, the results overlook a vast amount of useful government information.

The report is also included in the testimony of Ari Schwartz, Deputy Director, Center for Democracy and Technology at the December 11, 2007 Hearing of U.S. Senate Committee on Homeland Security and Governmental Affairs on E-Government 2.0: Improving Innovation, Collaboration, and Access. See: CRS Reports, E-government, Thomas, indexing of the government web, and more!

Research Libraries Question Google Book Scanning Restrictions

Google's book-scanning project and restrictions that Microsoft places on books it scans in a similar project continue to attract attention, praise... and controversy. This article in the International Herald Tribune outlines some of the key problems of commercializing information in libraries and of libraries outsourcing one of their key functions.

Hafner notes that "Several major research libraries have rebuffed offers from Google and Microsoft to scan their books into computer databases, saying they were put off by restrictions these companies wanted to place on the new digital collections."

One particular example demonstrates how Google's business plan simply does not allow for adequate scholarly access and use. Tom Garnett, director of the Biodiversity Heritage Library, a group of 10 prominent natural history and botanical libraries tells the story.

Garnett said the most striking example of this came when he asked the Google representatives about a theoretical example.

"We asked, 'Suppose we allowed you to digitize all our literature, and there was an ant researcher who wanted to peel off 10,000 pages of ant literature and load it on his own server and perform advanced analysis to correlate it with climatological data over the last 100 years, using software he had developed to study trends in species research,'" Garnett recalled.

He said the Google executives told him this would not be possible. "They said, 'We'd be sympathetic but it doesn't fit in with our model.'" Smith [Adam Smith, project management director of Google Book Search] ... said this was not the case. "It's certainly something we would work with libraries to do," he said.

The Open Content Alliance (OCA) offers an alternative to the Google project, but Hafner says that Microsoft, after joining the Open Content Alliance in 2005, "added a restriction that prohibits a book it has digitized from being included in commercial search engines other than Microsoft's". This was news to me and I was not able to confirm that.

Paul Duguid, an adjunct professor at the School of Information at the University of California at Berkeley and author of The social life of information, says, "There are two opposed pathways being mapped out. One is shaped by commercial concerns, the other by a commitment to openness, and which one will win is not clear." And Doron Weber, a program director at the Sloan Foundation, which has made several grants to libraries for digitization, says, "You don't want any for-profit company having control of the world's knowledge."

[The article was online on Saturday morning October 20, but I have been unable to find it on the IHT web site since then. A copy is available here. The article is in LexisNexis and can be found by doing an "easy search" on "Major U.S. and World Publications" on the phrase "research libraries have rebuffed offers from Google" (including the quotation marks).]

[UPDATE: the article is now available on the NYT website:
http://www.nytimes.com/2007/10/22/technology/22library.html ]

See also: On Google's Monetization of Libraries, By Rory Litwin, Library Juice 7:26 (December 17, 2004).

Questioning the power of Google

Google. Who's looking at you?, by John Arlidge, The Sunday Times, October 21, 2007. "It wants to know everything about you. It wants to be your best friend -- or your Big Brother. Are your secrets safe with Google?"

Google's overall goal is to have a record of every e-mail we have ever written, every contact whose details we have recorded, every file we have created, every picture we have taken and saved, every appointment we have made, every website we have visited, every search query we have typed into its home page, every ad we have clicked on, and everything we have bought online. It wants to know and record where we have been and, thanks to our search history of airlines, car-hire firms and MapQuest, where we are going in the future and when.

This would not just make Google the largest, most powerful super-computer ever; it would make it the most powerful institution in history. Small wonder that the London-based human-rights group Privacy International has condemned its plans as "hostile to privacy", and EU ministers called Google's vision "Orwellian". Even John Battelle, one of the net's leading evangelists, who co-founded the technology bible Wired magazine, and wrote The Search, the definitive study of Google's rise, now says: "I've found myself more and more wary of Google, out of some primal, lizard-brain fear of giving too much control of my data to one source."

(see also: Google: "We don't know enough about you"... yet.)

The Googlization of Everything, "Drop the fight"? or Start a Revolution?

Siva Vaidhyanathan is writing his next book, The Googlization of Everything: How One Company is Disrupting Culture, Commerce, and Community--and Why We Should Worry, and will be posting snippets of text on the blog, The Googlization of Everything and asking readers for comments. He says of the book:

The book will answer three key questions: What does the world look like through the lens of Google?; How is Google's ubiquity affecting the production and dissemination of knowledge?; and how has the corporation altered the rules and practices that govern other companies, institutions, and states?

Vaidhyanathan is an associate professor of media studies and law at the University of Virginia, a fellow at the Institute for the Future of the Book, and the author of Copyrights and Copywrongs and The Anarchist in the Library.

"Drop the fight"?
To those of us at FGI, the increasing reliance by libraries on Google is something that needs close scrutiny. I was dismayed to read recently that an associate dean for public services and collection development said, that because Google does the "search function" "better than so far any library can do" therefore "..what would be in our best interests is to drop the fight, to let Google take over that..." (Susan Gibbons quoted in Young Librarians, Talkin' 'Bout Their Generation by Scott Carlson, Chronicle of Higher Education, October 19, 2007). To say that finding information can be done only one way and the (apparently) winner-take-all popularity-contest and keyword-in-text approaches used by google are so good that libraries should "drop the fight" is just plain short-sighted and an abrogation of our responsibility (IMHO). For a different point of view, see On Google's Monetization of Libraries, By Rory Litwin, Library Juice 7:26 (December 17, 2004).

New Feature: Google Books/Fed Docs

FGI is pleased to announce a new occasional series that will examine how Google Books treats US Federal Documents. These posts will have titles that begin with "Google Books/Fed Docs".

We're very pleased to have a guest researcher putting up these posting. Please give a warm FGI welcome to Julia Tryon, Government Documents Librarian of the Phillips Memorial Library at Providence College. Julia has started to gather statistics and other information about the tens of thousands of government documents that have been scanned by the Google Books projects. We posted her first govdoc-l message on this subject.

Julia has agreed to start blogging on this subject for FGI. Unlike our "Blogger of the Month" series, Julia will post whenever she finds something interesting at the intersection of US Federal Documents and Google Books until she feels that she's exhausted the subject.

Take it away Julia!

Estimates and observations about govdocs in Google Books

Julia Tryon, Government Documents Librarian of the Phillips Memorial Library at Providence College posted some interesting information about the intersection of Federal Publications and Google Books to govdoc-l this week. She is quoted here with permisssion:
 

 

My director has asked me to discover what I may about the amount of documents available in Google's digital projects.  I've been looking at Google, partners' websites, articles, blogs, etc.  I have found a lot of chit-chat but very little substantive information.  Maybe I am just not looking for it the right way or in the right places.

It seems that there is a blackout on reporting statistics for these projects. Google and most of the partners give no statistical data at all. Stanford did have a page with statistics that was buried on their project's website but the information had not been updated since 2004.

To figure out the statistics on my own, I have tried searching Google Books, Stanford, and University of Michigan; but there is no way to limit a search to government documents. On Google I was able to search by publisher and, using various abbreviations for GPO that are used in
the publisher field, I came up with 187,522 (GPO-141,600; Gov't-2322; Government Printing Office-43,600). The university catalogs did not allow me to search by publisher.

When looking at the search results in Google for publisher field has GPO, I found 141,600 items, only 82,487 of which were available in the full view. And although it is nice to think that we have the full text for 82,487 documents, not all of them can be used. I randomly picked a title to see how it looked and chose the Statistical Abstract for 1954. The pages were clear enough to read easily but on every even numbered page part of the right hand column was chopped off.

 

 

Have you done your own study/poking around/etc with Google Books and Federal Documents? Share your findings with us!

 

 

Google: "We don't know enough about you"... yet.

There are big privacy implications of relying on private sector companies like Google instead of libraries to index knowledge. One of the biggest problems is that, in the age of the web, search engines don't just index content and help you find it, they also track what you use and how you use it, thus learning more about you. They don't just index what you want to find, they index you too.

An interview with Google's chief executive shows that this is Google's explicit goal.

  • Google's goal: to organise your daily life, By Caroline Daniel and Maija Palmer, Financial Times, May 22 2007. "The goal is to enable Google users to be able to ask the question such as 'What shall I do tomorrow?' and "What job shall I take?'" The race to accumulate the most comprehensive database of individual information has become the new battleground for search engines as it will allow the industry to offer far more personalised advertisements. These are the holy grail for the search industry, as such advertising would command higher rates. Mr Schmidt told journalists in London: "We cannot even answer the most basic questions because we don't know enough about you. That is the most important aspect of Google's expansion."

An OpEd in today's Los Angeles Times examines these comments...

  • Is Google's data grinder dangerous?, By Andrew Keen Los Angeles Times, July 12, 2007. Still, if iGoogle turns out to be half as wise about each of us as Schmidt predicts, then this artificial intelligence will challenge traditional privacy rights as well as provide us with an excuse to deny responsibility for our own actions. What happens, for example, when the government demands access to our iGoogle records? And will we be able to sue iGoogle if it advises us to make an unwise career decision?

As Keen says, "Google is not our friend. Schmidt's iGoogle vision of the future is not altruistic, and his company is not a nonprofit group dedicated to the realization of human self-understanding." See also: Privacy: "I have nothing to hide"

Google mashup: books and maps

According to AllPoints Blog, there will soon be connections between Google Books, Google Earth and Google Maps.

Google has posted locations found in some texts (Michael Jones' demo at the New York State Geospatial Summit showed only publicly available ones) to Google Earth. When you click on a placemark you jump to the book's page and to the exact page on which the location information was found. And you get a Google Map of all the places mentioned in the book.

Here are 2 examples. Scroll to the bottom of the book's "about" page, and you'll see a google map of every place mentioned!

This is definitely an indicator of where the net is going. As David Weinberger posits (and I'm seriously paraphrasing!), in his new book, Everything is Miscellaneous, the seemingly paradoxical idea of opening up or giving away your library's metadata to the world is what will drive users to your library.

Proceedings of the Second Life Education Workshop

The Education Dept's ERIC Clearinghouse gave notice of this recent report on educational potential in Second Life:

ERIC #: ED493670
Title: Proceedings of the Second Life Education Workshop, Part of the Second Life Community Convention (1st, San Francisco, California, August 18-20, 2006)

As a volume of proceedings, this is a compilation of papers. My favorite paper based solely on title from these proceedings is "Down the Rabbit Hole ... or How the NMC Took the Red Pill and Got a Second Life (Larry Johnson)."

From a government information perspective though, people might be more interested in "Designing an Educational Island inside Second Life for the National Oceanic and Atmospheric Administration (NOAA) Earth System Research Laboratory (ESRL) (Eric J. Hackathorn)"

Both papers can be accessed at the link above.

It also appears that at least some of the content in the ERIC database is visible to Google, since I received this link in a Google Alert. It would be interesting for someone to research whether ALL of the ERIC content is accessible through Google since we've reported on significant amounts of government information are hidden from Google.

Syndicate content