View Source:
RenderingEverythingAsText
Note:
This page has been locked and cannot be edited.
!!! Rendering Everything as Text by Philip Hollenback%%% Originally Published May 26, 2005 on [LinuxDevCenter|http://linuxdevcenter.com/pub/a/linux/2005/05/26/textonly.html]. You can display any computer data as text. For many types of data, this is obvious: we've all seen HTML converted to text right in our web browsers. However, this idea can extend much further. Although the notion of converting all data to text may not sound immediately useful, it can be surprisingly powerful. Why wait for graphical web browsers or image editors to load before you find out what's in a file? With a few helper applications and scripts, you can quickly display textual information about any type of data. !! Why Go to All This Work? You might wonder why anyone would go through the trouble of figuring out how to display all files as text, especially in this day of 21-inch LCD screens. One big reason is a uniformity of experience: you never have to leave your mail reader to evaluate a file. Suppose that someone sends you a Word document. To learn anything at all about this file, you have to first save it and then run ~OpenOffice (or maybe even Microsoft Word). Then you have to sit around and wait for Word to load, wait for the file to load, and so forth. Finally, you find out that the attachment is just a copy of this month's TPS report with a note that you forgot the cover sheet. If you have your text-based mail reader and the helper tools configured properly, you can bypass those extra steps. With the file immediately converted into text right inside your mail program, you can easily see that there's no need to run ~OpenOffice - you just need to send a reply email. Another big advantage of the anything-to-text approach is that it helps you avoid viruses and other Internet shenanigans. Spammers often send bizarrely formatted HTML mail to disguise their actions. If you convert everything to text before viewing it, you can clearly see what is going on. You have also avoided your GUI and web browser, where viruses often attack. !! The Tools This technique revolves around the command line. In particular, converting files to text is most useful in command-line mail clients such as [mutt|http://www.mutt.org/]. Of course, there's no reason you can't convert an attachment to text in a graphical mail program and then display it in an Xterm. Remember also that you have just the command line when you SSH into a remote system, unless you take additional steps such as forwarding X over SSH. Additionally, while these ideas are oriented toward Unix and Linux in particular, with modifications they apply to other systems. In particular, all of this works on Mac OS X, which I use on a daily basis. It's important to understand how mail clients process attachments. The determination of how to process an attachment is controlled by the mailcap file (first your private _~~/.mailcap_ and then the shared _/etc/mailcap_). Every email attachment has a MIME type, which is assigned by the sending mail program. Whenever a MIME-aware program encounters a MIME type (such as <code>text/html</code>), it consults the mailcap to find a matching entry. Each line in the mailcap file constitutes one entry. If the MIME attachment is of type <code>text/html</code>, the matching mailcap entry might be: <pre> text/html;view-html %s; copiousoutput; +nametemplate=%s.html </pre> which instructs the calling program (mutt) to use the <code>view-html</code> program on all <code>text/html</code> attachments. That's just a quick overview of the mailcap mechanism. An excellent resource for more details is the [mutt manual|http://www.mutt.org/doc/manual/]. !! A Basic Example Let's start with the HTML attachment. This is easy to convert to text since it already is text with additional markup. You could send the raw HTML out to the console as a start. However, you can do better. Sending the HTML through a text-based web browser can preserve some of the original formatting such as paragraphs and tables. Either the standard Lynx text web browser or a more sophisticated text browser such as w3m will work fine. The <code>view-html</code> script from the previous section might look like: <pre> if type w3m >/dev/null 2>&1; then w3m -T text/html -cols 80 -dump $1 | tr \\240 elif type lynx >/dev/null 2>&1 ; then lynx -dump -force_html /dev/fd/0 <$1 else echo $0: can't find w3m or lynx >&2 exit 1 fi </pre> The idea is to call either w3m or Lynx and tell that program to dump the rendered output to stdout as text. That odd <code>/dev/fd/0</code> file in the Lynx command line is necessary to trick Lynx into accepting data on standard input - it says to open file descriptor 0, which is stdin. Now using the mailcap entry from the last section and the script above, mutt can display any HTML as text. Were you wondering what that <code>copiousoutput</code> thing in the mailcap is about? That flag tells the calling program that the results from the mailcap entry will be text output, with no interaction necessary. Entries without that flag may require user interaction; for example, if you sent an image to a graphical image viewer. Mutt can use this information to display the text inline while you are viewing a message, instead of making you go to a separate screen. To enable this, add <code>auto_view</code> entries to your _~~/.muttrc_ config file for each MIME type you wish to view as inline text, like this: <pre> auto_view text/html application/msword </pre> Keep in mind that many data formats are easy to convert into HTML, so this recipe is a useful building block for other conversions. !! Extracting Text from Microsoft Files The closed nature of the Microsoft programs and their associated data files makes it highly challenging to extract text from them. However, plenty of people have worked diligently on these data files to achieve a large measure of success. It is possible to extract the text from Word, Excel, and Powerpoint files, thanks to <code>wvHtml</code>, <code>xlhtml</code>, and <code>ppthtml</code>. The <code>wvHtml</code> program is part of the [wvWare|http://wvware.sourceforge.net/] suite. The other two programs are part of the <code>[xlhtml|http://chicago.sourceforge.net/xlhtml/]</code> utility. I said previously that the HTML-to-text conversion is a useful stepping-stone. Here is an example of that; the tools for the Microsoft files all convert to HTML. By piping the output of (for example) <code>xlhtml</code> through a text-mode HTML viewer, you can obtain often very readable text. Here's the sample script, similar to the one above for HTML to text: <verbatim> if type w3m >/dev/null 2>&1 ; then xlHtml $1 2>/dev/null| w3m -T text/html -cols 80 -dump | tr \\240 elif type lynx >/dev/null 2>&1 ; then xlHtml $1 2>/dev/null| lynx -dump -force_html /dev/fd/0 else echo $0: can't find w3m or lynx >&2 exit 1 fi </verbatim> Again, it's good to use w3m if you have it. This is particularly true with Excel files, as the table rendering in w3m is so much better than the rendering in Lynx. The process is much the same for Microsoft Word files, but you have to play some tricks with wvHtml to make it send the file to stdout: <verbatim> if type w3m >/dev/null 2>&1 then wvHtml $1 /dev/fd/1 2>/dev/null| w3m -T text/html -cols 80 -dump |\ tr \\240 elif type lynx >/dev/null 2>&1 then wvHtml $1 /dev/fd/1 2>/dev/null| lynx -dump -force_html /dev/fd/0 else echo $0: can't find w3m or lynx >&2 exit 1 fi </verbatim> The basic approach for data files that are mostly text based is pretty simple: find a utility to convert the file to HTML and convert that HTML to text. Again, the mailcap file determines how to process a file (or MIME attachment). Here are the entries for the Microsoft file formats: <pre> application/msexcel;view-excel %s;copiousoutput; +nametemplate=%s.xls application/msword;view-msword %s; copiousoutput; +nametemplate=%s.doc </pre> The <code>nametemplate=</code> entry ensures that the file goes to the conversion program with a proper file extension. Some programs insist on the correct extensions for files. One big annoyance you will see quite often with Microsoft file formats is a MIME attachment with type <code>application/octet-stream</code>. Basically, that is the default MIME type. If the sending program can't (or won't) figure out what kind of file it is sending, it can just throw up its hands and say Hey, here's a stream of bytes - you figure it out. Using the power of Unix/Linux on the receiving end, you can fix that problem. The <code>[octet-filter|http://www.davep.org/mutt/]</code> script uses the file extension and calls the <code>file</code> utility to reconstruct the proper MIME type. Then it hands the file off to the right helper. The proper mailcap entry is: <pre> application/octet-stream; octet-filter %s;copiousoutput </pre> Several other Microsoft formats worth mentioning are RTF files and TNEF attachments. RTF (Rich Text Format) is a simple text-based markup language, and TNEF is a mechanism that Microsoft servers use to encapsulate MIME data for Microsoft clients. Again, there are utilities to handle both of these, such as [TNEF|http://tnef.sourceforge.net/], the TNEF decoder, and [rtfreader|http://www.fiction.net/blong/programs/#rtf], an RTF-to-text converter. The mailcap entries are: <pre> application/ms-rtf; rtfreader %s; copiousoutput application/ms-tnef; tnef2txt %s; copiousoutput </pre> !! Falling Back to File Manifests What is the textual representation of a ZIP file? The best answer I have come up with is a file manifest. This is the answer for any MIME attachment that is a collection of files. Examples include .zip, .tar, and .jar files. In each case, you can run the corresponding command on the attachment to list the files within. This is certainly better than doing nothing, because it gives the user a chance to see what he's downloaded before opening it up. Here's the mailcap entry to generate a manifest for a .tar file: <pre> application/x-tar; tar -tf - ; copiousoutput; </pre> The manifest idea also applies for images (although there's a much more creative approach in the next section). At the very least, you can extract some basic data from the image and display it. Typically this includes the file size, number of colors, and embedded comments. The <code>identify</code> program (which comes with the ~ImageMagick collection of image tools) prints the following information about a jpeg file: <verbatim> Format: JPEG (Joint Photographic Experts Group JFIF format) Geometry: 195x195 Class: DirectClass Type: true color Depth: 8 bits-per-pixel component Colors: 25594 Resolution: 28x28 pixels/centimeter Filesize: 12.1k Interlace: None Background Color: grey100 Border Color: #DFDFDF Matte Color: grey74 Dispose: Undefined Iterations: 0 Compression: JPEG comment: Test Image signature: 7e546210e516fd2e870ee9df47f0bfc15a9ec0d431c5abeb5a92cf0e811f9f2a Tainted: False User Time: 0.0u Elapsed Time: 0:01 </verbatim> Again, that's better than nothing, right? Here's the mailcap entry: <pre> image/*;identify -verbose %s;copiousoutput </pre> You can turn MP3 files into text in a similar way. The ID3 standard defines a set of text tags such as artist and title that can be embedded into an MP3 file. A utility such as [id3v2|http://id3v2.sf.net/] can extract this information and display it as text: <verbatim> $ id3v2 -l Yeah_Yeah_Yeahs-Machine.mp3 TT2 (Title/songname/content description): Machine TP1 (Lead performer(s)/Soloist(s)): Yeah Yeah Yeahs TAL (Album/Movie/Show title): Machine TYE (Year): 2002 TCO (Content type): Rock (17) </verbatim> With a little formatting, that makes a great text representation of the file. !! A More Complicated Technique: Images to Text A summary of an image is pretty interesting. Wouldn't it be even better if you could see some sort of textual representation of the image itself? This final conversion does just that. It's more a cute hack than a real tool, but it does illustrate my mantra of anything-to-text. The trick here is to use the [aalib|http://aa-project.sourceforge.net/aalib/] ASCII-art library. aalib is a graphics driver that displays images using only ASCII characters, in the style of the old line-printer art. The aalib algorithms are smart enough that the result is something that looks vaguely like the original image from a few feet away. The viewer that uses aalib to convert images to text is <code>asciiview</code>. Unfortunately, it doesn't work exactly like a filter, so a bit of scripting is necessary. This Perl script: <verbatim> #!/usr/bin/perl $ARGV[0] || die must supply image file; open(ASCII, echo q | asciiview -driver stdout -kbddriver stdin $ARGV[0] 2>/dev/null |) or die failed to open $ARGV[0]; while(<ASCII>) { last if /\x0C/; } while(<ASCII>) { last if /\x0C/; print; } close ASCII; </verbatim> takes any image and displays it as text. Figure 1 shows the original image. [http://www.hollenback.net/writings/TestImage.jpg] Here's the result from <code>asciiview</code>: <verbatim> |=+=|==++)SZZZZZXXXd#Z211YSYSZ####qpoZX#mqmXA2+:.:-:S2XXX( =%vxliliii|=*UZXmX##Zexl*???Tqgu*S1dX#ZmZ#XXZXma::..::::!S( )xXa%|||++|=a3XXmZ#UXvissaauSixYWmApdXmX#Z#mXZXXXoi,..:==||; =xuXXi====<xuSXYXmZXoIl*?SouyXmau3ZZm2XXXZdXZS2nnXn>-.=+|=+` =vnoov+=;:<xX1x1wdZX2nxlss%xi?X##mXqon22SnZXXXoxnvnn=::|;==; <lv1i|=;:=sxlx3Sx2XXXXXXnoX1XonaoXZ#ZZZoox12SonnxYnnn>;=+=: )=;:-=++|iiiiIiixXXZ#Z#Z#ZXXXXXXXXSXS2oXnx%{inxS1x1nxli===;. :==<=xi|i||||iivXXZUZ###UUZ#ZUZZZZZXXXZXooxxivIliiIIxIxi;::. :||xnns<n|||||xnXXZZ#Z#Z#U##Z#########Z#ZZXXonlii|||iino>==. :|+}IoodXo==+ioXXZZ#Z################Z##Z#ZZZoi||||>+<XSoc:. :+auXXZZXX===|oSXZZZZUZUZZZUZ#ZZZZXXSSX#ZZ#ZUXi|+=+||a1XXoc: =XSnXoXSXXcii|*l*!!++++IXSXXXX2I|=+++!!!YSXZZl=iii1o2onS2o; )2noXXonxXXss:===...:....-+S1::;:..:.:==+|=suai|xXoZX2o2S2( =nnInSXXoxxXo==|+=;;:.:===d#mmc;::-:::=||||xXZXnxoXZ22ooSnS( )Ioox*noXXsIS%oaaa>||><saX##m##qa%|=|=|<aau#ZZZXd#ZXn2nzxxS( )lv2on%InSoIl1XXXXZ#ZZXXXX##m#Z#ZXXZ#Z#UZZ#ZUZZUZXZSnnos++<; )lvxISosiI1n||2XXZXX2noXYSSXXXSYSXXoX1ZXZZZZZZUZZZSzvxni:=<; )xx}xI1ns<||{io22onvuXZXooxxxnuXZ#ZZXonnSXXXZZZX1nn1xvni:=+; =nnncxlx%i|;:;3o2SoXX????Y!?Y??YY?YYY3XoXXXXZZZXoviun1li=;=: )vosi1|x%l=::=)n2SXXXoc|YISXXZ#X2nlisuXZZXZXXXXonn21Iil>::;. =vnonx%s<ii>||<i12SSSXXaaxx12121XXXZZZZXZXZXXXZZ2IllvIx(..:. =vvnnvvxiii==+|||{1XSXXSSXXoXXXXXXZZZZZZZXS2SXZZm,:.-.:<;::. -{nxvx1%i|i=;++==+|i*XXXZZZ##Z#U#Z#ZZZZS11ndXZ#ZZ#a;:.=x|+=, .:+1nIx1ii|=.==|+|||i|*S2XXXXXXXXXXYY1IxxSSXZUZZ#U##a=unl%%; .:;=*|l*i|==||=:;==+==|iI1IIIIIIIllIvvooSXXZZZUZ#ZZ(x1svno( </verbatim> As you can see, the converted image won't win any awards for fidelity, but it is vaguely similar to the original. The key point here is that you could quickly evaluate the text image and decide whether it was worth it to open the original in a graphical viewer. For example, some people send email with tiled background images. When you are using a text-based mail reader, you can never tell what this image is until you open it in a viewer. When you see that it's a piece of notebook paper with pink flowers on it, you realize what a waste of effort it was to open it. Maybe if you could take a quick look at the ASCII version of the image you could save some time. I should emphasize that it takes way too much processing power to create the ASCII image. This makes the conversion of images into ASCII more of a technology demonstration than a useful tool, unless you have a lot of cycles to burn. !! Conclusion There's no reason to limit yourself to the GUI world! The console screen may at first seem like a step backward in technology. However, you may find (as I have) that you can perform many tasks more efficiently without the graphical distraction. If you're a mutt user, you may find [my mutt scripts|http://www.hollenback.net/index.php/CollectionOfScripts] useful. As I said in the introduction, you can convert any computer data to text in some way. With text-based data such as HTML or Word documents, the conversion is quite faithful to the original. However, even with completely nontext data, some text representation is always possible. You can check file manifests and text tags for text descriptions. Finally, images can become surprisingly realistic ASCII art (for some definitions of realistic). The quest then becomes the search for the best text representation of each type of data. For example, is there a way to describe music as text? We already do that with scores. Perhaps it will be possible to use some pattern matching and web search to find the lyrics for a song. With a little clever application of mailcap entries and helper scripts, you can convert any data into some form of textual representation. The end result is a more efficient work environment as you avoid the overhead of the GUI. As an added benefit, you might even be a little more secure as you reduce your exposure to mail viruses. ----- CategoryGeekStuff
Please enable JavaScript to view the
comments powered by Disqus.
HollenbackDotNet
Home Page
Popular Pages
All Categories
Main Categories
General Interest
Geek Stuff
DevOps
Linux Stuff
Pictures
Search
Toolbox
RecentChanges
RecentNewPages
What links here
Printable version
AllPages
RecentChanges
Recent Changes Cached
No changes found
Favorite Categories
ActionPage
(150)
WikiPlugin
(149)
GeekStuff
(137)
PhpWikiAdministration
(102)
Help/PageList
(75)
Help/MagicPhpWikiURLs
(75)
Blog
(69)
Pictures
(60)
GeneralInterest
(44)
LinuxStuff
(38)
Views
View Page
View Source
History
Diff
Sign In