Text Processing and the Web

John F. Moore

Revision History
Revision 1.1 April 2012 JFM

Table of Contents

1. A Book
2. Man Pages
3. WordStar
4. TeX and LaTeX
5. Markup Language
5.1. HTML
5.2. XML
5.2.1. Docbook
6. Asciidoc
7. Presenting and Transforming Content
7.1. Example of AsciiDoc Processing
8. Pandoc
8.1. Pandoc Examples

1. A Book

A Book

We are all familiar with the written word. And I would imagine that all of us have written things on the computer. So the question is how do we get from words on a page into some presentation format.

To answer this question lets look at some of the system used. For this talk I am going to ignore the work processing programs such as Open Offce Writer, or Abiword, or KWord or others. For more information on this topic have a look at Linux Word Processing

2. Man Pages

One of the first programs for formating text on Unix was roff.

 

RUNOFF is a direct predecessor of the runoff document formatting program of Multics, which in turn was the ancestor of the roff and nroff document formatting programs of Unix, and their descendants. It was also the ancestor of FORMAT for the IBM System/360, and of course indirectly for every computerized word processing system.

 
  --TYPSET and RUNOFF

This program is still in use in the Linux system as man pages. Lets have a look at a man page.

For more information on man page structure here is a good page from the internet Man Page Howto

Tip View in a terminal "man groff" and in another terminal "zless /usr/share/man/man1/groff.1.gz"

3. WordStar

One of the early word processing programs for CP/M and later Dos was WordStar.

 

WordStar is a word processor application, published by MicroPro International, originally written for the CP/M operating system but later ported to DOS, that enjoyed a dominant market share during the early- to mid-1980s. Although Seymour I. Rubinstein was the principal owner of the company, Rob Barnaby was the sole author of the early versions of the program; starting with WordStar 4.0, the program was built on new code written principally by Peter Mierau.

WordStar was a text-based word processing program, meaning that it worked with files that were essentially text, with markup language-like formatting commands (such as the "dot commands"); this made the files relatively small. By contrast, most word processors today are code-based, and save their documents in much larger files.

 
  --WordStar

Even though this was never a Linux tool, it’s method of saving the data has a lot of similarity to man pages. I include it here because it used text files with embeded format commands. The difference was that the interface was WYSIWYG so there was never the need to process the files.

Here is the specification for the WordStar File Format.

4. TeX and LaTeX

One of the next system that I experimented with was the TeX system.

 

Together with the Metafont language for font description and the Computer Modern family of typefaces, TeX was designed with two main goals in mind: to allow anybody to produce high-quality books using a reasonable amount of effort, and to provide a system that would give exactly the same results on all computers, now and in the future.

TeX is a popular means by which to typeset complex mathematical formulae; it has been noted as one of the most sophisticated digital typographical systems in the world. TeX is popular in academia, especially in mathematics, computer science, economics, engineering, physics, statistics, and quantitative psychology. It has largely displaced Unix troff, the other favored formatter, in many Unix installations, which use both for different purposes. It is now also being used for many other typesetting tasks, especially in the form of LaTeX and other template packages.

 
  --TeX

This system takes some time to learn, but the output often worth the work, especially if you take the time to use it’s macro ability.

An example of this power comes from a project I did for my son’s grade school. The idea was to get the kids to write books. So 3rd and 4th graders would write up a story. Then a parent would take the story and type it into a template file I provided. I would then take their file, and merge it with a number of macros to produce a book printed on regular 8 1/2 by 11 paper, in a two up format. So you would fold the pages in the middle and you would have a 8 1/2 by 5 1/2 inch book. When we wrapped the pages in a cover. The students could take home a book they wrote themselves and I publish for them. Needless to say they were thrilled.

Tip View in a terminal the files amato.tex cas-320.ps cas-320.tex tolly.poem.ps tolly.poem.tex

5. Markup Language

Now that we have see how a document can be tagged for how it will look, lets take a different approach to documents. This time sets see what happens if we mark up the text for what it is instead of what it will look like.

 

A markup language is a modern system for annotating a document in a way that is syntactically distinguishable from the text. The idea and terminology evolved from the "marking up" of manuscripts, i.e., the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts. Examples are typesetting instructions such as those found in troff and LaTeX, or structural markers such as XML tags. Markup is typically omitted from the version of the text that is displayed for end-user consumption. Some markup languages, such as HTML, have presentation semantics, meaning that their specification prescribes how the structured data are to be presented, but other markup languages, like XML, have no predefined semantics.

A well-known example of a markup language in widespread use today is HyperText Markup Language (HTML), one of the document formats of the World Wide Web. HTML, which is an instance of SGML (though, strictly, it does not comply with all the rules of SGML), follows many of the markup conventions used in the publishing industry in the communication of printed work between authors, editors, and printers.

 
  --Markup Language

5.1. HTML

The first common use of a markup language by most people are web pages. Lets take a look at a tutorial on creating web pages. I am going to suggest two. There are hundreds of web sites that want to teach you about writing HTML. I am going to suggest a few I think are some of the best, at least in my opinion

  1. Web page creations for Teachers. WRITING HTML is one of the oldest tutorials I know, but I think it is still one of the best.
  2. Tizag’s Tutorias is another old timer web page tutorial. Beginner’s Web Site Creating Guide leads with little to distract you. Although this site has not updated since 2008 it is still good.
  3. W3Schools is a more recent addition to tutorials. Learn to Create Websites has tutorials on many web and non-web technologies related to Markup languages.

Lets discuss markup in more detail. HTML was the first time many people started using markup to indicte what something was as opposed to how it looks. In HTML we mark things with opening and closing tags. The opening tag is repeated for the closing tag but includes a slash / before the word. Here are some of the opening tags. All of these have closing tags except where marked.

  • <title> (Title)
  • <H1> (Header Level 1)
  • <H2> (Header Level 2)
  • <P> (Paragraph)
  • <UL> (Bullet List)
  • <OL> (Ordered List)
  • <DL> (Definition List)
  • <LI> (List Item)
  • <A HREF="URL"></A> (Link to a web page)
  • <IMG SRC="URL"> (Link to an image)

5.2. XML

XML is a newer version of markup languages. It was designed to allow documents to be shared and understood between computers. Here is how wikipedia defines XML.

 

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards.

The design goals of XML emphasize simplicity, generality, and usability over the Internet.[7] It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.

 
  --XML

XML markup can be used to exchange records between different computers and even different Operating Systems. One of it’s strengths is that you can create a DTD (Document Type Definition) that defines a set of tags and the rules for using them. Once a DTD exists it can be used to create documents which can be easly understood on both ends of a transfer. XML documents are often used in manufacturing automation where a manufacturer and a parts supplier exchange information between computers. The XML documents server as the transfer document so two totally different systems can understand what is each other is saying without a custom interface.

One group created an XML schema which is used for document publishing called DocBook.

5.2.1. Docbook

 

DocBook is a schema (available in several languages including RELAX NG, SGML and XML DTDs, and W3C XML Schema) maintained by the DocBook Technical Committee of OASIS. It is particularly well suited to books and papers about computer hardware and software (though it is by no means limited to these applications).

 
  --What is Docbook

I found Docbook useful and have used it for several years to write these presentations. Lets take a look at a presentation and the source for the presentation.

First lets look at the source code WhyNot Source Once this is typed into the computer, it is converted into xhtml code using a script and then look like this: WhyNot Presentation

6. Asciidoc

This last markup is the one I am using for this talk. It has many similarities to the markup used by Wikipedia.

 

AsciiDoc is a text document format for writing notes, documentation, articles, books, ebooks, slideshows, web pages, man pages and blogs. AsciiDoc files can be translated to many formats including HTML, PDF, EPUB, man page.

AsciiDoc is highly configurable: both the AsciiDoc source file syntax and the backend output markups (which can be almost any type of SGML/XML markup) can be customized and extended by the user.

 
  --Asciidoc Introduction

The interesting thing about Asciidoc is the number of formats it can be converted into. It makes the work of typing up a talk like this fairly straight forward. The down side of this format is that it moves beck to presentation markup as opposed to content markup.

For an example of AsciiDoc let use their website.

As a comparison, lets look at the markup used in Wikipedia Main Page

7. Presenting and Transforming Content

We have discussed two types of markup in this talk Descriptive markup and Presentational markup. This debate is framed by the uses of the information and the way it will be used. The argument goes back to the early computer documentation systems. Examples of Presentation markup comes from ROFF typesetting, and TeX markup. Examples of Descriptive markup comes from IBM BookMaster, and SGML.

The Presentation markup aims to define how the output is going to be presented. Which font, what line spacing, where to highlight, etc. This type of markup is commonly used in individual documents or web pages. It’s use in books is well known and was infact the basis for some of the original computer markup.

The Descriptive markup aims to define what the content of the document is. Author name, abstract, section, block quote, reference, etc. The need for this type of markup is to make it obvious to the computer what the words represent. The computer can then use programs to convert this markup into presentation markup for output to different devices. Additionally, Descriptive markup allows comptuters to create indexes and search capabilities over a large number of documents.

Here is an interesting article I cam across which discusses markup Introducing Markup Languages.

This was implemented using SCRIPT for IBM.

Another discussion of using computers to document knowledge is presented in Wiki: A Systems Programming Productivity Tool

An interesting cross over language to cover both types of markup might be Asciidoc, described above since it seems to use Presentational markup, but includes tools to convert to Descriptive markup.

7.1. Example of AsciiDoc Processing

So I am going to show you what I was able to produce from this same input file using different processing steps.

The original file for this talk looks like text.adoc

To create this web page I used the transform command 'asciidoc -a data-uri -a icons -a toc -a max-width=55em text.adoc' which produced this version of the html document.

To create a Docbook version of this file I used the command 'asciidoc -f /etc/asciidoc/docbook.conf --backend=docbook -o text.xml text.adoc' which produced the file text-docbook.xml after which I converted the output using a script I have named 'process.sh text-docbook'. This processing, when combined with the CSS style sheet, style-ob.css, produced this version text-docbook.html.

8. Pandoc

Well it seems that my exploration of markup languages has taken me to a new tool. Pandoc is not a markup language, instead it is a tool to convert one markup into another markup.

 

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in markdown, reStructuredText, textile, HTML, or LaTeX to

HTML formats: XHTML, HTML5, and HTML slide shows using Slidy, S5, or DZSlides.

Word processor formats: Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML

Ebooks: EPUB

Documentation formats: DocBook, GNU TexInfo, Groff man pages

TeX formats: LaTeX, ConTeXt, LaTeX Beamer slides

PDF via LaTeX

Lightweight markup formats: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, Emacs Org-Mode, Textile

Pandoc understands a number of useful markdown syntax extensions, including document metadata (title, author, date); footnotes; tables; definition lists; superscript and subscript; strikeout; enhanced ordered lists (start number and numbering style are significant); running example lists; delimited code blocks with syntax highlighting; smart quotes, dashes, and ellipses; markdown inside HTML blocks; and inline LaTeX. If strict markdown compatibility is desired, all of these extensions can be turned off.

 
  --About Pandoc

This is a tool not a markup language. It’s advantages are that it allows you to create multiple documents out of a single source file. Probably from our perspective this is most useful for preparing web pages. Additionally it allow us to customize a whole web site by recompiling the web pages.

But before we get into how to structure the documents, lets have a

8.1. Pandoc Examples

Now even though I could reate some of these examples, it makes more sense to view what can be done by viewing some Pandoc Examples