CIMA Fall Caucus Presentation:
Wednesday, October 13, 2004
1) Priorities for SOS
A) Making choices regarding materials in a collection
For the Save Our Sounds project, the priority for digitization was the sound recordings – to save content recorded onto media that had been damaged or begun to degrade. A co-equal priority for digitization was materials in other formats from the same collection that would supply context for these recordings. This would include correspondence, music and lyric transcriptions, fieldnotes, and other manuscript materials; photographs, and moving images. So the question became: what is “it” or “not it” – the “it” materials would be sound recordings and related materials (in terms of intellectual content). Materials that were “not it” might include materials with content that could be easily transcribed (such as information typewritten on labels affixed to tape boxes), obsolete inventories for collection materials, and research done by other people included with the collection (research papers written by Carpenter’s students, Angus Gillespie’s research on George Korson). Since each collection is unique, the decisions about inclusion and exclusion differ from collection to collection.
B) “It” and “Not It”
The materials not selected for digitization from the Linscott Collection include notes on scraps of paper that were unorganized, and impossible to organize based upon content. The pieces of paper – some no larger than a 4 inches by 3 inches – frequently had notes written on both sides, sometimes for different dates or different purposes. The disc sleeves were in part transcribed, but because the collector used abbreviations that were sometimes hard to understand, it was considered better to also digitize the disc sleeves, as well as transcribe them. The reverse of unsent form letters, parts of unused Christmas cards, sections of envelopes – Linscott made use of almost any writing surface. While the notes will prove useful to researchers, more time would have been spent deciding about arrangement than was deemed reasonable. Also, these pieces of paper are not fragile, another factor in the decision not to digitize them. Instead, preservation photocopies were made.
C) Accessibility as a deciding factor
One reason to digitize any materials is to make them accessible – a wax cylinder is damaged every time it is played, and even then, a playback machine may not be available. A lantern slide – a glass-based positive photographic image – cannot be viewed without the proper equipment, and the medium is too fragile to serve. Negatives cannot be served; even slides are a challenge. (Would the user hold the slide up to the light? Is there a light box available?)
Providing access to obsolete or specialized formats is one reason to digitize materials. Another reason to digitize materials is to safeguard the originals. Repeated handling of manuscripts, or playback of analog audio recordings, will cause degradation, even with the most careful handling. If a digitized access (or reference) copy is damaged, another access copy can be made from the digitized preservation master. I’ll say more about this later.
2) Preparation for digitization I: Survey the materials
Any collection that is to be digitized should be fully processed, and if not fully processed, then carefully surveyed. At the Folklife Center, we’ve learned from experience that piecemeal digitization efforts – some items from this collection, a few from that collection, when these collections aren’t fully processed or surveyed – leads to problems when the collection is finally processed, especially when it comes to assigning numbers to physical items, like photos and sound recordings, within the range of materials in a collection. The bottom line: know what you have in a collection before you begin a digitization project.
3) Preparation for digitization II
A) Organizing the materials
When a collection has been processed or at least surveyed, it is easier to start thinking about how much will be digitized and the demands on available resources. Whether the work is digitized in-house or contracted out, you will make plans for digitization by format. So manuscript materials should be grouped together (if not physically, then at least intellectually, in a survey), photo prints together, photo negatives together, etc.; the same is true for audio discs, audiotapes (all formats), and videotapes. It is not just an archive’s workflow to consider, but the workflow of the contractor. Moving back and forth between bound and unbound manuscript items, or between discs and tapes, can slow down the transfer process and lead to numbering errors as the digital files are created.
B) Getting ready to write a Statement of Work
Also, in the Folklife Center’s experience, cost is calculated based not only on format, but also on the size and format of the original items, desired output for the digital files, and other measurements. Some of these considerations:
1) Manuscript materials:
– count bound pages
… note the dimension range of the items (measurements of the smallest item and measurements of the largest item)
… count oversized items
… count how many for each of the following: color, grayscale, or bitonal (black and white)
– count unbound pages
… note the dimension range of the items
… count oversized items
… count how many for each of the following: color, grayscale, or bitonal (black and white)
The considerations become more complex with special formats, like scrapbooks:
– will the item be disbound?
– will the entire page be digitized, and then each element; and in what order?
– will the covers be scanned?
2) Audio materials – count by format and size (e.g., how many 7-
inch reels, 10-inch reels, audiocassettes, DATs, wax cylinders, wire
3) Photos – count by format and size (e.g., prints and their
dimensions; film negatives, with dimensions – and within this
distinction, how many are one frame, how many strips, etc.;
transparencies [35 mm, etc.], lantern slides, etc.)
4) Other graphic materials [e.g., drawings] – blueprints, maps,
5) Moving image items – Super 8, 16 mm, VHS, Betamax, digital
The Statement of Work will reflect the degree to which you’ve organized or assessed a collection. The more you know about the physical materials, the more accurately you can convey this information to a contractor, and the more ably you can plan for staffing and resources to manage the files after they are delivered to your institution.
4) Preparation for digitization III: conservation work
During the survey of collection materials, you may encounter items that are damaged. This could be torn papers (where the tear passes through text, or an old rubber band has become sticky and will not come off a piece of paper); fingerprints on photos (or negatives), nitrate film stock (either still or moving images); splices on audio tapes that are falling off; or tapes with “sticky-shed” syndrome; audiodiscs with a flaking surface, or a build-up of palmitic acid; vinegar syndrome (some film formats are subject to this), and other damage or degradation that must be addressed before digitization. You will want to capture materials in the best possible condition, depending upon your resources for conservation work, how much you are able to do in-house, and the extent of the work needed. This is one of the hidden costs in digitization, especially for repositories with older materials. A survey is the best way to see what treatments may be needed. You may even need to call in an expert for advice about the degree of conservation work needed. It may be more than you think is needed, or it may be less. Contacts at other institutions may prove a valuable resource, if you wish to get expert guidance that may not be available in-house.
Segue to Michael
Segue back to Marcia
5) Numbering systems for collections and collection items
A numbering system is important not only for a digitization project, but for any archive that assigns numbers to individual items, or groups of items (like a folder of correspondence). A decision that each archive or repository must make: do we use the numbering system we have, or create a new system for digitized materials? A numbering system needs to be flexible enough to allow for different original formats, and also for a range in terms of the quantity of materials within a given collection (a few or a lot). When it comes to the qualities you should consider in a numbering system:
ID numbers should…
…be unique; an alphanumeric can help
…keep it as simple as possible; the more complex the system, the more likely that errors can be created during manual keying of information (e.g., not a 16-character alphanumeric)
…best if applied to materials by format: sound recordings with one prefix, followed by numbers; manuscript materials with a different prefix, followed by numbers, etc.
…be consistent; if possible, use a numbering system already in place for a collection
…construct numbers with “leading zeroes” – in a collection with 1000 sound recordings, the first would be 0001, the second would be 0002, and the last would be 1000; this creates a parallel numbering structure that most systems will sort in the order you intend to represent
As you get ready to impose numbers…
…create concordance between existing numbering system and the one you will impose
…take into account the maximum number of items in a collection with like formats, not just those that will be digitized
…use lowercase letters – some operating systems, like UNIX, rely on this
The new numbering system uses format codes (SR, PH, MS, CF, AR, GR) to indicate format, which helps both in reference tools and in file management.
Filenames are a related issue. You have an item, but the item needs a filename, so it can be distinguished from other files on a computer.
– sr001 is the first sound recording
– sr001am.wav is the “A” side, master file
– sr001bm.wav is the “B” side, master file
– sr001ash.wav is the “service high” wav file
– sr001asl.mp3 is the “service low” file (an mp3)
These examples are not prescriptive. They work for the Folklife Center. The best way to approach filenaming for your institution, as you do research to make choices before a project begins: consult with other individuals at other institutions, and discover what standards they use. Also, find out what they would do differently, in hindsight.
A) What metadata to capture
As you make decisions regarding what to digitize, you will make decisions on how to access to these digitized materials. You’ll decide what sort of database to use, whether records will go into an ILS, the selection of controlled vocabularies, the look and feel of the retrieved information, as well as other issues.
Scans, sound files, and related digitized materials cannot simply reside on a server or a local computer without a means for intellectual access. Metadata – “an encoded description of an information package” (The Organization of Information, by Arlene G. Taylor) – is needed for digital objects. Metadata is more than a bibliographic record. Administrative, structural, and descriptive metadata can provide information to both staff and researchers that includes and goes beyond basic bibliographic information.
Administrative metadata – includes technical data on the digitization
process (file creation and storage), quality review results, rights for both
use and reproduction of digital files, persistent identifiers, and technical
notes about the transfers (like engineer’s notes) — Dublin Core does not
include administrative metadata, whereas MODS does.
Descriptive metadata – bibliographic attributes, including physical
description of source (if digital item is reformatted, not born-digital);
title, statement of responsibility, keywords, an abstract – descriptive
metadata supports resource discovery.
Structural metadata – information that shows how parts of a complex
digital object (one with many parts) are related: which side of a disc
sleeve is related to which side of an audio disc, which pages in a
manuscript folder are part of the same unit (like a multi-page letter).
Like administrative metadata, this is usually “behind-the-scenes,”
whereas descriptive metadata this information is meant to be discovered
by the researcher.
B) Where to store the metadata
A database is the best way to capture detailed information about the materials that will be digitized. The Folklife Center followed the lead of M/B/RS, and used a prototype database created for the Library. Commonly called “Builder,” this database allows for detailed data entry, either manually or using information imported from MARC records, and this information can be linked to the digitized items (images, sound recordings, etc.). The database can capture administrative, structural, and technical metadata. Another part of the database, called Generator, is used to generate the XML used in the third part of the database, called Viewer. This is the search and retrieval interface.
In Builder, we have a structural map (a structmap) that shows how various parts of a digital object are organized. It looks like an outline, and allows the researcher to see the relationship between parts of a digital object. The structure wraps up the digital files and the metadata as a package.
This database is in the process of being replaced by MAVIS (Merged Audio Visual Information System), created by Wizard, an Australian company. LC and other institutions use the database, which is upgraded about twice a year. User institutions have a say in the database’s elements and functionality, so the qualities of the database are often defined by user needs, and this makes for a positive difference.
No one metadata standard (or set of elements) will be a perfect fit for your institution. What you choose will depend upon your institution’s needs, and will need to allow for a certain measure of flexibility, again, based upon your institution’s needs.
C) Metadata storage
Where the metadata is stored is another decision to make. Choose software with a history of being supported in future releases (Access, Oracle, etc.), that can be readily migrated/exported. FileMaker Pro is reliable, but may not be robust enough to contain the metadata you need to capture. AFC generally uses Access, because of its flexibility. MAVIS uses Oracle. What your institution decides to use is another matter for both research and curatorial decision.
7) Standards for digitizing various formats (manuscripts, sound recordings, photographs, moving images)
Capture at the highest quality possible, whether it’s a photograph, a manuscript page, or a sound recording. While standards that exist now will be superseded by new technology, in terms of hardware, software, and capabilities, doing the best work possible right now will allow you to get a worthy master copy, and generate a better quality derivative from that master. This is especially important if the materials are delicate right now, or if they degrade past the point of no return by some future date. You’ll know that you did the best you could at the time, given the resources available at the time.
The master copy is sometimes called the production copy. Whereas NARA differentiates between the preservation copy and the production copy, AFC presently uses these terms interchangeably. The preservation copy is set aside as a replacement for the original materials, and is rarely accessed, whereas the production (or master) copy is used to generate access copies (or what we call derivatives). The access copy is usually a compressed format, makes a smaller file size (and thus takes up less computer storage space), and opens faster. In the most basic terms, regardless of the original format (sound recording, text, photo, moving image):
uncompressed digital copy = master copy (highest quality)
compressed digital copy, created from master copy = access copies
96/24 (explain) – audio
– 96 kHz or 96,000 samples per second / turns continuous wave into series of dots
– 24 bit word depth [or width; it indicates verticality] / a bit is used to represent the precision of the sample, and is also a measure of dynamic range (calculated exponentially; each bit more doubles the amount); the higher number allows for fewer errors
300 ppi (explain) – manuscripts, photographic materials
– dpi = dots per inch
– ppi = pixels per inch
mpeg2 or Real Media (moving images)
Use file formats that are uncompressed for master/preservation copies, and compressed formats for access copies. Rationale: an uncompressed format can take up a lot of disc storage space, and might be large enough to take a few minutes to launch. A compressed format will take less time to launch, but the sound (or visual) quality will not be as good as an uncompressed file. It is a curatorial decision to make the master copy readily available, or only available upon demand. AFC does not make uncompressed files readily available, because they could be used in production-level work for commercial purposes, without prior permission.
Another consideration: whether to “improve” the highest-quality file. This means enhancements like: sharpening the contrast for a scanned image, cleaning up any flaws from the scan like a tear or a hole, or perhaps smeared ink. For audio files, this might mean depopping and declicking, diminishing the number of skips in a transfer (sometimes skips are unavoidable) by adjusting the wave signal using software (like SoundForge), and any other alterations that will change the intended one-to-one ratio between the original physical item and the digitized copy. AFC chooses to create “flat” transfers for sound recordings, and unenhanced scans for manuscript and photographic materials. Digitization captures what you have, not what you wish you had, or what the materials were originally like. If a researchers wants to make use of software to “clean up” an image or a sound file, in order to understand or apprehend information more clearly, it is up to that person. The process is so subjective that there is no fair way to make that sort of universal decision for researchers, and therefore AFC chooses unenhanced digitized copies as their preference.
8) Quality review of digitized materials
Not all materials can be reviewed (especially in large collections); select a subset. 10% of the digitized files for a project is reasonable. The NDL had a more demanding standard: 10% of the master files, and 100% of the derivatives. The argument being: if the derivative is fine, usually the master copy is fine. Again, the percentage of files you review upon delivery is a decision to make. QR allows you the opportunity to have rework done by the contractor – something that should be written into a statement of work. Also, if there is a problem with the files, you will want to know now, rather than a year from now, or when a researcher finds an error.
You can find more than a few sets of guidelines for the quality review of digital files reformatted from original (physical) sources. Here are but a few to consider:
Be certain that…
…files can be opened and viewed (or listened to)
…files are in the format requested
…all the files requested were delivered
When listening (or viewing) a file, check:
…what’s there (that was requested)
…what’s there and shouldn’t be (that was introduced in the digitization process)
…what isn’t there (the start or the end was cut off, etc.)
…to confirm that specifications were followed (color images scanned in color, sampling rates are correct, etc.)
When reviewing the file, consider the source – and this is when familiarity with the source materials will pay off. You may receive a scan of a photo that has a white line on it; upon review of the original photo, it may have a scratched emulsion, and that was picked up in the digitized version. In listening to a sound recording, you may hear a low hum. This occurred in the QR process for two collections with which I’ve worked. By talking with a recording engineer, I learned that a humming sound can be introduced when the recording machine is hooked up to a problematic power supply – in the case of the Linscott and Carpenter collections, this may have been a house with an erratic power source, or it could be a car battery. Whatever the reason, the hum was introduced during the original capture process, not during digitization, and so the transfer could not be faulted.
Of equal importance: all materials sent out to be digitized are returned. This is true of projects both on-site and off-site. Take an inventory of materials once they’re returned by the contractor, and be sure receive everything that was sent out.
9) Case study: The James Madison Carpenter Collection
Dealing with a collection that has already been processed; an international team working on the digitization process (a cooperative effort)
Conclusion: Issues to consider beyond those presented here
1) delivery system: researcher interface
2) infrastructure for long-term management of resources (both digital files and metadata); quality review over time; migration to new storage media
3) monitoring use of resources (statistics; rationale to request more funds)
4) feedback format
Conclusion: key points to remember
1) As you plan for a digitization project, keep in mind other potential projects; this may inform the decisions you make, and keep you from a drastic revision of practices when a new project is begun
2) Document the practices you employ, the standards used, the sources consulted, the contacts made, and create your own manual for standards and practices. Update it frequently, and review it before each new digitization project begins.