RenderingEverythingAsText

Rendering Everything as Text

by Philip Hollenback
Originally Published May 26, 2005 on LinuxDevCenter.

You can display any computer data as text. For many types of data, this is obvious: we've all seen HTML converted to text right in our web browsers. However, this idea can extend much further. Although the notion of converting all data to text may not sound immediately useful, it can be surprisingly powerful. Why wait for graphical web browsers or image editors to load before you find out what's in a file? With a few helper applications and scripts, you can quickly display textual information about any type of data.

Why Go to All This Work?

You might wonder why anyone would go through the trouble of figuring out how to display all files as text, especially in this day of 21-inch LCD screens. One big reason is a uniformity of experience: you never have to leave your mail reader to evaluate a file. Suppose that someone sends you a Word document. To learn anything at all about this file, you have to first save it and then run OpenOffice (or maybe even Microsoft Word). Then you have to sit around and wait for Word to load, wait for the file to load, and so forth. Finally, you find out that the attachment is just a copy of this month's TPS report with a note that you forgot the cover sheet.

If you have your text-based mail reader and the helper tools configured properly, you can bypass those extra steps. With the file immediately converted into text right inside your mail program, you can easily see that there's no need to run OpenOffice - you just need to send a reply email.

Another big advantage of the anything-to-text approach is that it helps you avoid viruses and other Internet shenanigans. Spammers often send bizarrely formatted HTML mail to disguise their actions. If you convert everything to text before viewing it, you can clearly see what is going on. You have also avoided your GUI and web browser, where viruses often attack.

The Tools

This technique revolves around the command line. In particular, converting files to text is most useful in command-line mail clients such as mutt. Of course, there's no reason you can't convert an attachment to text in a graphical mail program and then display it in an Xterm. Remember also that you have just the command line when you SSH into a remote system, unless you take additional steps such as forwarding X over SSH.

Additionally, while these ideas are oriented toward Unix and Linux in particular, with modifications they apply to other systems. In particular, all of this works on Mac OS X, which I use on a daily basis.

It's important to understand how mail clients process attachments. The determination of how to process an attachment is controlled by the mailcap file (first your private ~/.mailcap and then the shared /etc/mailcap). Every email attachment has a MIME type, which is assigned by the sending mail program. Whenever a MIME-aware program encounters a MIME type (such as text/html), it consults the mailcap to find a matching entry. Each line in the mailcap file constitutes one entry. If the MIME attachment is of type text/html, the matching mailcap entry might be:

text/html;view-html %s; copiousoutput; +nametemplate=%s.html

which instructs the calling program (mutt) to use the view-html program on all text/html attachments.

That's just a quick overview of the mailcap mechanism. An excellent resource for more details is the mutt manual.

A Basic Example

Let's start with the HTML attachment. This is easy to convert to text since it already is text with additional markup. You could send the raw HTML out to the console as a start. However, you can do better. Sending the HTML through a text-based web browser can preserve some of the original formatting such as paragraphs and tables. Either the standard Lynx text web browser or a more sophisticated text browser such as w3m will work fine. The view-html script from the previous section might look like:

if type w3m >/dev/null 2>&1; then
      w3m -T text/html -cols 80 -dump $1 | tr 
240 elif type lynx >/dev/null 2>&1 ; then lynx -dump -force_html /dev/fd/0 <$1 else echo $0: can't find w3m or lynx >&2 exit 1 fi

The idea is to call either w3m or Lynx and tell that program to dump the rendered output to stdout as text. That odd /dev/fd/0 file in the Lynx command line is necessary to trick Lynx into accepting data on standard input - it says to open file descriptor 0, which is stdin.

Now using the mailcap entry from the last section and the script above, mutt can display any HTML as text. Were you wondering what that copiousoutput thing in the mailcap is about? That flag tells the calling program that the results from the mailcap entry will be text output, with no interaction necessary. Entries without that flag may require user interaction; for example, if you sent an image to a graphical image viewer. Mutt can use this information to display the text inline while you are viewing a message, instead of making you go to a separate screen. To enable this, add auto_view entries to your ~/.muttrc config file for each MIME type you wish to view as inline text, like this:

auto_view text/html application/msword

Keep in mind that many data formats are easy to convert into HTML, so this recipe is a useful building block for other conversions.

Extracting Text from Microsoft Files

The closed nature of the Microsoft programs and their associated data files makes it highly challenging to extract text from them. However, plenty of people have worked diligently on these data files to achieve a large measure of success. It is possible to extract the text from Word, Excel, and Powerpoint files, thanks to wvHtml, xlhtml, and ppthtml. The wvHtml program is part of the wvWare suite. The other two programs are part of the xlhtml utility.

I said previously that the HTML-to-text conversion is a useful stepping-stone. Here is an example of that; the tools for the Microsoft files all convert to HTML. By piping the output of (for example) xlhtml through a text-mode HTML viewer, you can obtain often very readable text. Here's the sample script, similar to the one above for HTML to text:

if type w3m >/dev/null 2>&1 ; then
    xlHtml $1 2>/dev/null| w3m -T text/html -cols 80 -dump | tr \\240
elif type lynx >/dev/null  2>&1 ; then
    xlHtml $1 2>/dev/null| lynx -dump -force_html /dev/fd/0
else
    echo $0: can't find w3m or lynx >&2
    exit 1
fi

Again, it's good to use w3m if you have it. This is particularly true with Excel files, as the table rendering in w3m is so much better than the rendering in Lynx.

The process is much the same for Microsoft Word files, but you have to play some tricks with wvHtml to make it send the file to stdout:

if type w3m >/dev/null 2>&1
  then
    wvHtml $1 /dev/fd/1 2>/dev/null| w3m -T text/html -cols 80 -dump |\
      tr \\240
elif type lynx >/dev/null  2>&1
  then
    wvHtml $1 /dev/fd/1 2>/dev/null| lynx -dump -force_html /dev/fd/0
else
    echo $0: can't find w3m or lynx >&2
    exit 1
fi

The basic approach for data files that are mostly text based is pretty simple: find a utility to convert the file to HTML and convert that HTML to text. Again, the mailcap file determines how to process a file (or MIME attachment). Here are the entries for the Microsoft file formats:

application/msexcel;view-excel %s;copiousoutput; +nametemplate=%s.xls
application/msword;view-msword %s; copiousoutput; +nametemplate=%s.doc

The nametemplate= entry ensures that the file goes to the conversion program with a proper file extension. Some programs insist on the correct extensions for files.

One big annoyance you will see quite often with Microsoft file formats is a MIME attachment with type application/octet-stream. Basically, that is the default MIME type. If the sending program can't (or won't) figure out what kind of file it is sending, it can just throw up its hands and say Hey, here's a stream of bytes - you figure it out.

Using the power of Unix/Linux on the receiving end, you can fix that problem. The octet-filter script uses the file extension and calls the file utility to reconstruct the proper MIME type. Then it hands the file off to the right helper. The proper mailcap entry is:

application/octet-stream; octet-filter %s;copiousoutput

Several other Microsoft formats worth mentioning are RTF files and TNEF attachments. RTF (Rich Text Format) is a simple text-based markup language, and TNEF is a mechanism that Microsoft servers use to encapsulate MIME data for Microsoft clients. Again, there are utilities to handle both of these, such as TNEF, the TNEF decoder, and rtfreader, an RTF-to-text converter. The mailcap entries are:

application/ms-rtf; rtfreader %s; copiousoutput
application/ms-tnef; tnef2txt %s; copiousoutput

Falling Back to File Manifests

What is the textual representation of a ZIP file? The best answer I have come up with is a file manifest. This is the answer for any MIME attachment that is a collection of files. Examples include .zip, .tar, and .jar files. In each case, you can run the corresponding command on the attachment to list the files within. This is certainly better than doing nothing, because it gives the user a chance to see what he's downloaded before opening it up. Here's the mailcap entry to generate a manifest for a .tar file:

application/x-tar; tar -tf - ; copiousoutput;

The manifest idea also applies for images (although there's a much more creative approach in the next section). At the very least, you can extract some basic data from the image and display it. Typically this includes the file size, number of colors, and embedded comments. The identify program (which comes with the ImageMagick collection of image tools) prints the following information about a jpeg file:

Format: JPEG (Joint Photographic Experts Group JFIF format)
Geometry: 195x195
Class: DirectClass
Type: true color
Depth: 8 bits-per-pixel component
Colors: 25594
Resolution: 28x28 pixels/centimeter
Filesize: 12.1k
Interlace: None
Background Color: grey100
Border Color: #DFDFDF
Matte Color: grey74
Dispose: Undefined
Iterations: 0
Compression: JPEG
comment: Test Image
signature: 7e546210e516fd2e870ee9df47f0bfc15a9ec0d431c5abeb5a92cf0e811f9f2a
Tainted: False
User Time: 0.0u
Elapsed Time: 0:01

Again, that's better than nothing, right? Here's the mailcap entry:

image/*;identify -verbose %s;copiousoutput

You can turn MP3 files into text in a similar way. The ID3 standard defines a set of text tags such as artist and title that can be embedded into an MP3 file. A utility such as id3v2 can extract this information and display it as text:

$ id3v2 -l Yeah_Yeah_Yeahs-Machine.mp3
TT2 (Title/songname/content description): Machine
TP1 (Lead performer(s)/Soloist(s)): Yeah Yeah Yeahs
TAL (Album/Movie/Show title): Machine
TYE (Year): 2002
TCO (Content type): Rock (17)

With a little formatting, that makes a great text representation of the file.

A More Complicated Technique: Images to Text

A summary of an image is pretty interesting. Wouldn't it be even better if you could see some sort of textual representation of the image itself? This final conversion does just that. It's more a cute hack than a real tool, but it does illustrate my mantra of anything-to-text.

The trick here is to use the aalib ASCII-art library. aalib is a graphics driver that displays images using only ASCII characters, in the style of the old line-printer art. The aalib algorithms are smart enough that the result is something that looks vaguely like the original image from a few feet away.

The viewer that uses aalib to convert images to text is asciiview. Unfortunately, it doesn't work exactly like a filter, so a bit of scripting is necessary. This Perl script:

#!/usr/bin/perl
$ARGV[0] || die must supply image file;

open(ASCII,
  echo q | asciiview -driver stdout -kbddriver stdin $ARGV[0] 2>/dev/null |)
  or die failed to open $ARGV[0];

while(<ASCII>) { last if /\x0C/; }

while(<ASCII>)
{
    last if /\x0C/;
    print;
}
close ASCII;

takes any image and displays it as text. Figure 1 shows the original image.

Here's the result from asciiview:

|=+=|==++)SZZZZZXXXd#Z211YSYSZ####qpoZX#mqmXA2+:.:-:S2XXX(
=%vxliliii|=*UZXmX##Zexl*???Tqgu*S1dX#ZmZ#XXZXma::..::::!S(
)xXa%|||++|=a3XXmZ#UXvissaauSixYWmApdXmX#Z#mXZXXXoi,..:==||;
=xuXXi====<xuSXYXmZXoIl*?SouyXmau3ZZm2XXXZdXZS2nnXn>-.=+|=+`
=vnoov+=;:<xX1x1wdZX2nxlss%xi?X##mXqon22SnZXXXoxnvnn=::|;==;
<lv1i|=;:=sxlx3Sx2XXXXXXnoX1XonaoXZ#ZZZoox12SonnxYnnn>;=+=:
)=;:-=++|iiiiIiixXXZ#Z#Z#ZXXXXXXXXSXS2oXnx%{inxS1x1nxli===;.
:==<=xi|i||||iivXXZUZ###UUZ#ZUZZZZZXXXZXooxxivIliiIIxIxi;::.
:||xnns<n|||||xnXXZZ#Z#Z#U##Z#########Z#ZZXXonlii|||iino>==.
:|+}IoodXo==+ioXXZZ#Z################Z##Z#ZZZoi||||>+<XSoc:.
:+auXXZZXX===|oSXZZZZUZUZZZUZ#ZZZZXXSSX#ZZ#ZUXi|+=+||a1XXoc:
=XSnXoXSXXcii|*l*!!++++IXSXXXX2I|=+++!!!YSXZZl=iii1o2onS2o;
)2noXXonxXXss:===...:....-+S1::;:..:.:==+|=suai|xXoZX2o2S2(
=nnInSXXoxxXo==|+=;;:.:===d#mmc;::-:::=||||xXZXnxoXZ22ooSnS(
)Ioox*noXXsIS%oaaa>||><saX##m##qa%|=|=|<aau#ZZZXd#ZXn2nzxxS(
)lv2on%InSoIl1XXXXZ#ZZXXXX##m#Z#ZXXZ#Z#UZZ#ZUZZUZXZSnnos++<;
)lvxISosiI1n||2XXZXX2noXYSSXXXSYSXXoX1ZXZZZZZZUZZZSzvxni:=<;
)xx}xI1ns<||{io22onvuXZXooxxxnuXZ#ZZXonnSXXXZZZX1nn1xvni:=+;
=nnncxlx%i|;:;3o2SoXX????Y!?Y??YY?YYY3XoXXXXZZZXoviun1li=;=:
)vosi1|x%l=::=)n2SXXXoc|YISXXZ#X2nlisuXZZXZXXXXonn21Iil>::;.
=vnonx%s<ii>||<i12SSSXXaaxx12121XXXZZZZXZXZXXXZZ2IllvIx(..:.
=vvnnvvxiii==+|||{1XSXXSSXXoXXXXXXZZZZZZZXS2SXZZm,:.-.:<;::.
-{nxvx1%i|i=;++==+|i*XXXZZZ##Z#U#Z#ZZZZS11ndXZ#ZZ#a;:.=x|+=,
.:+1nIx1ii|=.==|+|||i|*S2XXXXXXXXXXYY1IxxSSXZUZZ#U##a=unl%%;
.:;=*|l*i|==||=:;==+==|iI1IIIIIIIllIvvooSXXZZZUZ#ZZ(x1svno(

As you can see, the converted image won't win any awards for fidelity, but it is vaguely similar to the original. The key point here is that you could quickly evaluate the text image and decide whether it was worth it to open the original in a graphical viewer. For example, some people send email with tiled background images. When you are using a text-based mail reader, you can never tell what this image is until you open it in a viewer. When you see that it's a piece of notebook paper with pink flowers on it, you realize what a waste of effort it was to open it. Maybe if you could take a quick look at the ASCII version of the image you could save some time.

I should emphasize that it takes way too much processing power to create the ASCII image. This makes the conversion of images into ASCII more of a technology demonstration than a useful tool, unless you have a lot of cycles to burn.

Conclusion

There's no reason to limit yourself to the GUI world! The console screen may at first seem like a step backward in technology. However, you may find (as I have) that you can perform many tasks more efficiently without the graphical distraction. If you're a mutt user, you may find my mutt scripts useful.

As I said in the introduction, you can convert any computer data to text in some way. With text-based data such as HTML or Word documents, the conversion is quite faithful to the original. However, even with completely nontext data, some text representation is always possible. You can check file manifests and text tags for text descriptions. Finally, images can become surprisingly realistic ASCII art (for some definitions of realistic).

The quest then becomes the search for the best text representation of each type of data. For example, is there a way to describe music as text? We already do that with scores. Perhaps it will be possible to use some pattern matching and web search to find the lyrics for a song.

With a little clever application of mailcap entries and helper scripts, you can convert any data into some form of textual representation. The end result is a more efficient work environment as you avoid the overhead of the GUI. As an added benefit, you might even be a little more secure as you reduce your exposure to mail viruses.


CategoryGeekStuff



Our Founder
ToolboxClick to hide/show