Unofficial Transcript of unofficial audio session provided by James Staub. Non-anonymous corrections are welcome.
200704171330 FDLP-DLC WEB HARVESTING
1335 KATHY BRAZEE[?]
session in the fall focused on technologies, this session will focus on management
GPO is developing a digital repository. One way we capture is web harvesting.
1337 slide harvesting pilot project – pilot project with two vendors independently crawled EPA web site
1338 slide ongoing harvesting activities â€“ defines semi-manual
1339 slide harvested content management plan
1340 slide major issues
1340 slide Persistent Uniform Resource Locators
1341 slide Superintendent of Documents policies
SOD 304 harvesting policy
defining in-scope publications
1341 slide cooperative cataloging
potential volume of cataloging is a big issue â€“ pilot harvesting discovered a boatload of publications
1343 slide Harvesting complete publications
1345 slide ongoing technology discovery
1346 CINDY ETKIN
slide SOD 304 policy statement â€“ previously ID 74
1347 slide sod 304 policy statement review
we can guarantee that there will be changes to this policy as a result of the harvesting pilot. I have a packet of all the questions and issues of policy that needs to be addressed
1348 slide brainstorming
list of policies that need re-evaluation
1349 slide assumptions
GPO will continue to develop more fully automated
GPO will continue to
1350 slide general assumptions
rewind to previous slide
WW council senses harvesting is a powerful technique for finding fugitive documents. It is also beautiful in that it makes content searchable. Downside is that this content does not come with a cataloging record, which turns into a huge burden on your staff. EPA harvesting pilot.retrieved so many docs, that you would seriously hold up harvesting if you were to try to catalog everything. If GPO has an opportunity to make govt info searchable, they should do this now, and let cataloging catch up later
1353 WW you will harvest in accordance with policy… I think policy should be written to allow access by the American people. If it’s not, it’s the policy that should be changed, not the harvesting process.
1353 slide general assumptions
LSCM will continue to be responsible for scope determinations…
Material harvested but not within scope…
WW if you need to harvest, it is because an agency is not following Title 44. Why ask permission from an agency that is already breaking the law? If they don’t have the resources to comply, then they don’t have the resources to evaluate harvesting
AM there are agencies that partially comply. E.g., EPA. We lose materials from the regional offices. The Census Bureau has played well, but we still might want to harvest. I’m not sure that you can make your assumption that GPO will not be reorganized in the future. I don’t think you can make this assuption
1357 PH private industry will scrape Web sites without permission. Sometimes agencies are grateful to have another outlet for their information
CE SOD will be revised
BS last phrase is interesting â€“ there is a fear there as to who will be setting policies and in charge of how the harvesting is done. There needs to be more discussion as FDSys becomes a real system. Who is going to manage the harvested material?
1359 PH is this going to be demand driven â€“ will there be a mechanism for libraries to go in and say I want this information?
CE where’s Robin?
PH if it’s demand driven, you will spend more time gathering what they want than what they don’t need.
RD going back to BS question â€“ FDSys will enable the technologies, but the library unit makes those decisions
BS content and content management should be managed by content specialists
CE I will attempt to answer PH question â€“ one of the questions is whether we whould prioritize harvest targets. We also have mandate of being comprehensive. So it’s both.
Amy West, U MN â€“ if FDSys can ingest from external partners â€“ you can use archive-it â€“ scope would be a problem, but it seems like gathering more is more important
WW scope issue is really the achilles heel of the pilot. The fix â€“ have humans guide the crawler in advance. The pilot attempts tried to weed after the fact, but these judgements are difficult for computers. Better choice: trust an FDLP librarian to help guide the harvester on the front end.
Matt Langraff â€“ we came to a similar conclusion at the end of our pilot, and one of the recommendations in the white paper is to spend more time studying a site before harvesting. Work with FDLP, GPO staff, and/or agency webmasters. We are in the discovery phase as far as technology goes.
RD â€“ Lori talked yesterday about the DTIC contract for automated catalog record generation. We only have 20 staff on cataloging. We have to have automated ways to help with the cataloging piece.
WW to be clear, I support this effort, but the chance of success is below 100%. How much of a partial solution can it be?
RD there is still the human element that needs to be involved on both ends.
PH GPO does not do the work, [huh? I think his point â€“ use FDLP staff for cataloging]
Lori Hall â€“ we are looking at cooperative cataloging. Perhaps brief records.
1409 second assumption from same slide
1409 slide general assumptions
LCSM may use a combination of in-house…
GS this document put together two unlike things â€“ cataloging and harvesting. Harvesting should produce searchable results, cataloging should occur according to priorities. They are two separate things.
ML â€“ I completely agree with you â€“ this has come to light since the white paper.
1411 #2 bibliographic records for Web harvested publications will be comlpeted in accordance with overall cataloging procedures
WW harvest, make searchable, and available to the American people
CE this is a decision GPO made before, not necessarily for harvested material
1412 slide general assumptions
bibliographic records for Web harvested publications
LCSM will explore the use of automated…
1413 KATRINA S
slide Questions for Discussion
1.Are the assumptions stated above correct with respect to Web harvested publications
WW some are fine, some are not
Kathy Hale, PA State Library â€“ when harvesting, did you encounter firewalls? Did you overcome them?
ML yes, we hit impediments. Policy states we honor robots.txt, but our agreement with EPA allowed for permission to crawl.
KH are you going to have to get specific permission to overcome each impediment?
WW DOE does not protect or prevent people from seeing anything that is released. We only protect when there is a law that demands we restrict. GPO’s mandate is for published material. If there is an access restriction, then it is not really a publication, and falls out of scope.
RD intentionally or not, there is in scope material behind a robots.txt exclusion rule.
1417 Steve Wood, Penn State â€“ assumptions seem to prevent agencies from assisting in harvesting or in cataloging
ML white paper for pilot â€“ more work needs to be done with the agency to discover in-scope content
Gil Baldwin â€“ a year ago, Richard Huffine and I served together on a committee on cataloging and developing a simple metadata scheme for agencies. Our recommended schema was not accepted by OMB
KB ditto plus…
1421 Steve X Notre Dame â€“ I don’t want to take forward the baggage of a paper system. Title 44 pretty much gives you your collection development policy. Libraries don’t stop buying things waiting for cataloging to catch up.
1423 slide should web harvested publications be identified as such in CGP?
PH I believe people should know the source of the stuff they’re looking at
AM that’s analogous to putting â€œthis book is printed on acid-free paperâ€ – it doesn’t matter
PH it may matter if you’re only getting portions of the document. We want to get the information in there and then catalog it.
WW the question presumes that the information is in the catalog in the first place. I’m not sure what information customer cares.
BS this sounds like the electronic equivalent of the black dot
RA the only reason we would want to designate is if the content were changed through the harvesting process
DD the reason is for the agency, not for the public. The agency needs to know how GPO got it.
GB in FDSys we will capture the source of any content in the system, including MODS records
ML requirement of the harvester is to capture the date and time of the capture
CE policy… catalog
GS are we saying this should be in the descriptive metadata?
Jerry Breeze, Columbia â€“ are web harvested publications intrinsically different from the other formats GPO receives? I do not think so.
1429 slide should LSCM point PURLs at the live copy of a publication on the agency Web site or the archived copy…
PH keeping them updated and cross-referenced is a full-time job. On the other hand, a big disclaimer needs to come with a big disclaimer
BS should this question be updated to PURLs/handles
RD yes. This to me is one of our most challenging questions
KS I could argue both sides of this one
MS no reason to believe that the agencies have the obligation to permanence that GPO does. It’s your obligation
GS … [drops mic] … in our institutional repository, departments won’t play unless we drive traffic to their sites
KS are you implying that GPO would not capture?
RD not at all.
WW OSTI always points to the live page, but we have a different mission than the GPO
PH is there a statutory requirement to preserve that information…
KS your users will get access to the live site and not the GPO site â€“ your users are presumably looking for the subject and need live links.
PH and on the archived side, links might point to outdated information
AQ should point to both. Very simple. Our users want the material, regardless of where it’s stored
PH that would be pretty tough. You would need to make sure that you are in sync
Pat Ragens, UNV Reno â€“ implemented in the indefinite future, 20-30 years down the road. Recently, agencies were reorganized under DHS, and their web addresses changed. Might be best to point to GPO’s copy but mentioned from whence it was obtained.
David Sismowsky, CA State Library â€“ does this question refer to the CGP as we know it today or FDSys? As I understood FDSys, we should be able to access the current version and all previous versions. We will need a multiplicity of links in any kind of search and retrieval system
Ann Sanders, Library of MI â€“ we had to go to both. Our users wanted both. We started pointing to the archived version, and that didn’t work.
1439 slide are cooperative cataloging partenrships…
WW this is the only way
1439 slide are the cataloging levels outlined above acceptable?
KB â€“ this means abridged records
GS I assume parters can enhance these records
PH qualifying â€“ certified parters â€“ make sure they’re authentic
?? UKY make sure GPO’s symbol is on the record
1442 are there groups of publications that should be among those manually or semi-manually harvested by LSCM?
KM clarifying â€“ should GPO make sure we harvest in such a way that we capture the entire publication, e.g, with integrating resources
KS the community should notice important things getting missed, and submit reports/requests to GPO
KS you’re wondering if the library community has things they want harvested now?
KB we’re looking for experience, best practices
Kathy Hale â€“ you’re going to get a million different answers. I suggest you put this on the next Biennial survey
1444 slide what should GPO do with out-of-scope material accidentally harvested?
WW throw it in the trash as soon as you find it
AM I’m not sure how we can tell GPO what’s missing in a project that hasn’t started yet.
DD NCES in their benevolent wisdom decided to break a single report into as many parts and versions as they possibly could. It will take a human to determine what’s valuable.
Amy West[?] seems like there are three questions
1.frequency of harvest
2.investment on guiding the harvester on the front-end
3.my real worry is the quantity of information that I don’t know about. I want more stuff captured.
1449 KS I have more questions
1.25% of documents that were partial â€“ why were they partial?
ML â€“ that’s a rough estimate. Has to do with the way the information was reported back from the contractor. It doesn’t mean that the whole document was not harvested.
PH what you get with harvesting is a bunch of information, but you miss the links that tie it all together. Just pulling in information is not enough
ML we considered that more and more through the pilot and had the vendors begin to tackle this issue by maintaining the agency directory structures.
KB in one case I saw a PDF represent and entire publication, but the agency also posted chapter six as a separate file
2.so what comes next?
ML main focus of this session was on management of harvested content â€“ that’s our immediate concern. There are a lot of different methodologies to do harvesting, but we don’t want to jump in without more consensus on which practices to follow. Details of the next pilot are yet to be determined.
WW your metric should be: we made a portion of information searchable that was not searchable before
ML I completely agree â€“ we should make that content searchable and accessible
TimB we want to encourage you to move forward as quickly as possible
1455 [that lady again whose name I can’t catch to save my life] I hear about so many technologies, how are you brining them together
ML that’s what we’re trying to do. We have taken steps to align these projects together.
?? what about the LOCKSS piece? It seems to be a good way to distribute collections
RD GPO is very supportive of LOCKSS. GPO engaged in a pilot project. The feedback we received â€“ scalable sustainable supportable. At the end of the day, LOCKSS may neet these
1458 Steve Woods, Penn State â€“ I want to communicate the urgency of our need for you to harvest. Our catalogs are not clean any more. We are moving to a good-enough standard. If we get hung up on making this a perfect, clean, nice neat thing, it will never get done.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.