Distributed Proofreaders

Last Change: November 10, 2009

About Project Gutenberg and Distributed Proofreaders

Founded in 1971, Project Gutenberg, or PG, (Wikipedia page) is the world's oldest electronic library. Its holdings number about 30,000 works, almost exclusively in the public domain (under US copyright law), and are available at no cost over the Internet.

About half of PG's catalog is produced by Distributed Proofreaders, or DP, an all-volunteer Internet community devoted to production of high-quality free ebooks. At DP, books are scanned and run through optical character recognition (OCR) software, then proofread for errors, formatted for semantic structure and visual appearance, and assembled into finished ebooks. Proofreading (or proofing) and formatting are done one page at a time, typically over a period of days or weeks, by dozens of volunteers. The site's name derives from the distributed nature of the workflow.

Other, larger collections of ebooks exist. However, PG is the devoted to high-quality, text-based ebooks. The Internet Archive and Google Books provide books made from scanned images. These are enormous files, tens or even hundreds of megabytes, and are effectively unavailable to people without fast Internet access, including users on dial-up world-wide, and most users in Africa, the Middle East, South America, and Asia. PG's collection consists of text files, tens of kilobytes to a few megabytes. The entire collection currently fits on one single-sided DVD, making it possible to distribute the catalog by ordinary post.

Ebooks created from page images suffer from whatever limitations existed when the book was scanned. If the resolution is low, the contrast is poor, or if a page was stained or damaged, those defects carry into the final product. By comparison, a PG ebook is a text file, which suffers from none of these legibility issues.

Doesn't Google provide OCRed text of its books? Yes, but if you've tried to read raw OCR, you'll discover the technology has a long way to go. Typos are commonplace, and ususual elements, such as mathematics, often scan as garbage. By contrast, DP volunteers correct the raw OCR to match the scan, focusing as needed on pages that are difficult to read, and create a marked-up file containing tags needed to display the book with good fidelity to the original.

My Activities at DP

At DP, I work almost exclusively on mathematics and physics books formatted with the typesetting language LaTeX. My long-term goal at DP is systematically to digitize selecta from the public domain mathematics literature. You can browse the list of books I've post-processed, namely, assembled into finished ebooks and uploaded to PG.

For reasons indicated above, mathematics books benefit particularly from DP's work flow. Nonetheless, digitizing mathematics is a slow process and demands extensive training. A typical mathematics book requires about an hour per page, and much of that work must be performed by volunteers familiar or fluent with LaTeX. On the flip side, volunteering at DP can be an excellent path to learning LaTeX, provided you have the time and committment.

Since joining DP, I've become involved with many aspects of LaTeX ebook production: formatting individual pages, helping to develop the distinctive work flow for mathematical projects, writing manuals for proofing, formatting, and post-processing LaTeX, creating interactive training materials, coordinating LaTeX-knowledgeable volunteers, and writing the software PG uses to package uploaded LaTeX projects. All of these tasks have been carried out in collaboration with volunteers from around the world, none of whom I've met in person.

Getting Involved

To work at DP, you must register as a volunteer. This takes only a minute or two, and is similar to registration at any free online site: You request a user name and provide an email address, to which your account information (a welcome message and your initial password) is sent. As a volunteer, you may visit the site, browse and work on available projects, and communicate with other volunteers in bulletin-board style forums on a variety of topics.

Access to some activities is limited, to ensure you have been sufficiently trained and have accumulated enough knowledge of the site to carry out tasks productively. In some cases access is granted after a certain amount of time on site and a certain number of pages completed. In other cases, access is granted by evaluation of your work by a more experienced volunteer.

DP's success as a source of high-quality ebooks is testament to the potential of the Internet, which allowed a far-flung group of like-minded individuals to grow into a thriving community founded on the desire to preserve history one page at a time.

Books Posted to PG

  1. Alexander McAulay:
    The Utility of Quaternions in Physics
  2. Amos Emerson Dolbear:
    The Machinery of the Universe
  3. Karl Weierstrass:
    Theorie der Abel'schen Functionen
  4. Ernst Leonard Lindelöf:
    Le calcul des résidus et ses applications à la théorie des fonctions
  5. Arthur S. Eddington:
    Space, Time and Gravitation
  6. Vito Volterra:
    Leçons sur l'intégration des Équations Différentielles aux Dérivées Partielles
  7. Michel A. Melkanoff et al.:
    A Fortran program for elastic scattering analyses with the nuclear optical model
  8. Leonard E. Dickson:
    First course in the theory of equations
  9. Jacques Hadamard:
    Four Lectures on Mathematics
  10. Maurice Godefroy:
    La Fonction Gamma
  11. Albert Ribaucour:
    Étude des Élassoïdes ou Surfaces A Courbure Moyenne Nulle
  12. H. E. Slaught and N. J. Lennes:
    Solid Geometry with Problems and Applications (Revised edition)

Project Gutenberg Catalogues

Math

Physics