TextConvert
A Tool to ease the problems of converting flat file libraries to ACEDB Format

Version 1.3

Joachim Baumann
Otto Ritter (First Specification)

Table of Contents


University of Stuttgart

Institute for Parallel and Distributed High Performance Systems (IPVR)

Breitwiesenstrasse 20-22, D-70565 Stuttgart

EMail: joachim.baumann@informatik.uni-stuttgart.de

. . Overview

TextConvert is a program that takes as input a description file in a format that resembles the format of AWK-scripts, reads a data stream and produces another data stream according to patterns and attached code, so called actions, in the description file. The patterns can be formed according to Perl's regular expressions, and the actions are formulated as Perl source code enriched with additional commands to ease the conversion of text-based data. The additional commands support the ACEDB [ACEDB] format.

Acknowledgements

We are very grateful to the IGD group in Heidelberg for their constant help. Furthermore we express our thanks to Professor Rothermel and the department of Distributed Systems for their advice and encouragement.

This work has been funded by the DKFZ (The German Cancer Research Centre), and is part of the IGD development [IGD] under the CEU contract GENE-CT93-0003.

1 Tutorial

1.1 Bits and Pieces

TextConvert reads an input data stream line by line (each line is terminated with a newline character), tries to match the patterns that are given in the description file, and if one matches, executes the attached action (the code that follows the pattern). After the action has been executed, the next pattern is tried on the same line, until all patterns are tried. The next line is read, and the pattern matching begins again.

This is it, basically. But you can change the control flow, if you want. You can end the evaluation of the patterns for the actual line (with NEXTSCAN), you can force TextConvert to retry all patterns on the actual line (AGAIN), or you can simply read the next line while not leaving the action (retaining the control).

To use TextConvert, you should have some knowledge in programming in the language Perl. This is because the actions attached to the patterns are normal Perl program fragments (very short program fragments, but nevertheless), and thus have to be programmed in a normal, programmer's way. This tutorial cannot be an introduction to programming in Perl, nor to programming in general. If you need further information, see [PERL].

As long as you are testing your description file, you should, after each change, use the -c option to test if the description file gets parsed without errrors. It can happen, that TextConvert doesn't complain about an error in the normal mode of operation.

All additional commands are used like functions, i. e. you have to use braces to enclose the arguments.

1.2 Input and Output

To do something useful with a program, you have to be able to examine the input data stream and to write data to the output data stream. These are the most basic functions of each program, and thus we examine them first.

1.2.1 The Input Field

The current line of input data read by Textconvert is found in the variable FIELD. Here the whole line can be examined, or matched against patterns.

A normal line of text contains many words, that are separated by spaces. These words you can find in an array that consists of exactly the number of words in the line. The array is named FIELD as well, and has to be accessed with an index number, like this, 'FIELD[3]". The indices are numbered from 0 to n - 1, n being the number of words in the line. This number (n - 1) can be found in the variable '#FIELD'.

Attention: You should never change the FIELD variable or a member of the FIELD[] array !

But if you are sure that this is the only way to solve your problem, here is some additional information (don't say I didn't warn you). At the beginning of each round of pattern matching and executing actions, the next input line is read into the FIELD variable (this is only another name for the $_ variable, for the real Perl hackers out there), the trailing newline is removed, and then the contents is split into the FIELD[] array (which is the default split array @_). Now TextConvert begins to match the patterns and to execute actions. If, a some time, it has to execute the special command AGAIN, it jumps right back to the beginning of the pattern matching. It doesn't read the input line again, and it doesn't split the contents of the FIELD variable again. This means, that if you change the FIELD variable, or one of the members of the FIELD[] array, the contents of both do not correspond any more. Furthermore, because the patterns are matched against the FIELD variable also, this might influence the matching of a pattern that comes after the pattern belonging to the current action, in which you just changed the contents of the FIELD variable. On the other hand, you can use this to deliberately hinder the execution of a pattern / action, or to implicitely call a pattern (by adding a match of this pattern to the FIELD variable). But then there are other, safer ways to do this. You could, for example, use a pattern that matches everything (e.g. '/./ ') and check for a variable you set in the action that should cause the execution of the other one. Or you could change the flow of control.

Again: don't do it !

Examples

If you want see if the input line contains the string 'RNA' (for the PRINT command see the next section):

if(FIELD =~ /RNA/)
{
  PRINT("Found RNA", NEWLINE);
}

or, if you had previous experience with Perl and its conditional constructs

PRINT("Found RNA", NEWLINE) if FIELD =~ /RNA/;

If you want to print the input line beginning with the 5th character, if the second field is not a ".", the following can be used (this happens in the EMBL Database, if you want to extract the KW lines):

if(FIELD[1] ne ".")
{
    PRINT(substr(FIELD, 5), NEWLINE);
}

Or suppose you want to examine the last field of the input line:

$test_string = FIELD[#FIELD];

If you want to loop over all input fields, which are delimited by comma and space, like in the string "Eukaryota; Planta; Phycophyta; Euglenophyceae." (the OC lines in the EMBL DataBase):

for($i = 1; $i <= #FIELD; $i++)
{
    chop FIELD[$i];
    PRINT("FIELD[$i]", NEWLINE);
}

1.2.2 The Print Statements

TextConvert supports two printing commands, PRINT, and PRINTF. Both commands write to the object in focus (see next section), or if the Object Stack is empty, to the output data stream.

PRINT simply prints its arguments, without further processing. Consequently, the arguments are not separated by spaces. PRINTF takes a format string (like the printf()-function in C), and formats additional arguments according to this format string. In most cases you won't need the additional functionality of PRINTF, because, like in Perl, you can use variable names (including FIELD and the FIELD[]-array members) in a normal string that is to be printed. But if you want to determine exactly how the output looks like, including precision for decimal values, left justification, or different output formats, you have to use this command. For additional information see [PERL] and your favourite C book.

Examples

To print the input line preceded by the string " > ", use the following code:

PRINT(" > ", FIELD, NEWLINE);

or

PRINT(" > FIELD", NEWLINE);

If you need to print a variable $hex as a hexadecimal number:

PRINTF("%x", $hex);

You can even use variables to influence the format string. The following example is borrowed from [PERL]:

$width = 20;
$value = sin(1.0);
foreach $precision (0 .. ($width - 2))
{
  PRINTF("%${width}.${precision}f\n", $value);
}

gives the following output

                   1
                 0.8
                0.84
               0.841
              0.8415
             0.84147
            0.841471
           0.8414710
          0.84147098
         0.841470985

        0.8414709848
       0.84147098481
      0.841470984808
     0.8414709848079
    0.84147098480790
   0.841470984807897
  0.8414709848078965
 0.84147098480789650
0.841470984807896505

Note that '${width}' is equivalent to '$width' (it only delimits the variable name from the following alphanumeric characters) and that '${precision}f' is not in the least equivalent to '$precisionf', which is why the variable names in the example above are enclosed by curly brackets.

1.2.3 Other Commands influencing the Input or Output Data Stream

The command CLASS writes some of the contents of the Object Stack to the output data stream. See the section about the Object Stack for more information

NEXTLINE

If you want to read the next input line without leaving the action, you can use the command NEXTLINE. After issuing this command the FIELD contains the new current input line, and the FIELD[] array the different words of the input line. Before you use NEXTLINE, you should check with the special command EOF(), if the end of the file is reached.

Example

You want to read over the whole feature table of the EMBL DataBase. It begins with a line containing the identifier 'FH' and, following it, lines that begin with 'FT':

/FH/
{
  NEXTSCAN if EOF();
  NEXTLINE;
  while(FIELD =~ /^FT/)
  {
    NEXTSCAN if EOF();
    NEXTLINE;
  }
  AGAIN;
}

The command AGAIN restarts the pattern matching on the current line. See the section about control flow.

Or the KW-line of the EMBL DataBase again. It consists of keywords, separated by semicolons. The last entry ends with a period. And keywords may consist of more than one word, contain spaces or embedded periods.

/^KW/
{
  while(substr(FIELD, -1) ne ".")
  {
    # We have to split the input line ourselves 
    # using ';' as delimiter. We drop the first
    # 3 characters (they are "KW ")
    @myfield = split(/; */, substr(FIELD, 3));
    for($i = 1; $i <= $#myfield; $i++)
    {
      PRINT("Kw:", $myfield[$i], NEWLINE);
    }
    NEXTSCAN if EOF();
    NEXTLINE;
  }
  # now we examine the last line
  NEXTSCAN if FIELD[1] eq ".";   # line is "KW ."
  chop FIELD;     # remove the period
  @myfield = split(/; */, substr(FIELD, 3);
  for($i = 1; $i <= $#myfield; $i++)
  {
    PRINT("Keyword: ", $myfield[$i]);
  }
  NEXTSCAN;
}

The command NEXTSCAN forces TextConvert to read the next input line and begin again with matching the patterns. See the section about control flow.

DECODE

TextConvert supports the decoding of a data format, that encodes sequences of 4 different strings as 2-Bit sequences. The DNA, for example, can be encoded thus, which leads to a very space saving representation of the data. Also supported is the ability to read 8-Bit or ASCII sequences of arbitrary length that are embedded in the normal text.

The command DECODE reads a data sequence of given length and decodes it according to its first argument, which can be either "2 Bit" or "2-Bit" to decode the 2-Bit coded sequences and "8BIT" or "ASCII" to read in 8-Bit coded sequences (these are not really encoded). The decoded data will be found in the variable $DECODED.

Example

Suppose you want to read a data stream with the following structure:

>>>>A00144  2/93  2BIT  Len: 705
A00144 H.sapiens LAG-2 gene promoter region. 2/93
[Some binary data, 177 Bytes]>>>>A00149  3/93  ASCII  Len: 567
A00149 H.sapiens IFN-alpha-J1 mRNA. 3/93
ATGGCCCGGTCCTTTTCTTTACTGATGGCCGTGCTGGTACTCACCTACAAATCCANCTGCTCTCTGGGCTGTG
ATCTGCCTCAGACCCACAGCCTGCGTAATAGGAGGGCCTTGATACTCCTGGCACAAATGGGAAGAATCTCTCC
TTTCTCCTGCTTGAAGGACAGACATGAATTCAGATTCCCGGAGGAGGAGTTTGATGGCCACCAGTTCCAGAAG
ACTCAAGCCATCTCTGTCCTCCATGAGATGATCCAGCAGACCTTCAATCTCTTCAGCACAGAGGACTCATCTG
CTGCTTGGGAACAGAGCCTCCTAGAAAAATTTTCCACTGAACTTTACCAGCAACTGAATGACCTGGAAGCATG
TGTGATACAGGAGGTTGGGGTGGAAGAGACTCCCCTGATGAATGAGGACTTCATCCTGGCTGTGAGGAAATAC
TTCCAAAGAATCACTCTTTATCTAACAGAGAAGAAATACAGCCCTTGTGCCTGGGAGGTTGTCAGAGCAGAAA
TCATGAGATCCTTCTCTTTTTCAACAAACTTGAAAAAAGGATTAAGGAGGAAGGAT

This is a format that is used by the GCG (Genetics Computer Group) Sequence Analysis Software Package [GCG]. Each entry begins with the string '>>>>' directly followed by the name. In the second field it contains the date, in the third the type of encoding of the following sequence. Now follows the keyword 'Len:' and the last field contains the length of the data sequence. The second line contains the name of the sequence in its first field, a short description and the date. After a newline the sequence data follows (coded as given in the third field of the first line, and of the length found in the last field of the first line).

Here follows the code (for explanations of NEWCONTEXT(), CLASS() and NEXTSCAN see the appropriate sections):

/>>>>/
{
    NEWCONTEXT();
    $coding = FIELD[2];
    $length = FIELD[#FIELD];
    NEXTSCAN if EOF();
    NEXTLINE;
    CLASS("DNA", "FIELD[0]");
    DECODE($coding, $length, (C, T, A, G));
    # Format output data in lines of 75 characters each
    $position = 0;
    while($characters = 
            substr($decoded, $position, 75))
    {
        PRINT("$characters\n");
        $position += 75;
    }
    NEXTSCAN;
}

This would be the whole description file needed to convert data streams formatted like according to that format into the ACEDB data format.

1.3 The Object Stack

Often the input data stream is structured in a way, that you can distinguish different objects, which are nested (one object is embedded in another). Sometimes it is best to retain that nested structure, but more often it is better to diminish the complexity of the structure by flattening it, and to replace the embedded object by a reference to a newly created object. This object contains the data of the old embedded object.

To represent the nested objects of the input data stream, there exists the notion of the Object Stack. Here you can create new objects on top of others (by pushing them on the Stack), existing ones, which were created earlier, add data to them, and delete them after writing them to the output data stream (by popping them from the Stack). And you can empty the whole Object Stack (and write the contents of all objects to the output data stream) by issuing a single command.

You create a new object on the Stack by issuing the command CLASS with to parameters, the objects type (or class), and its name.

CLASS("Object_Type1", "Object_Name1");

All subsequent output will be directed to this object, until a new CLASS command changes the object on top of the Stack (that's not the entire truth, see the next section about the focus).

If you use the command CLASS with only the type of the object as the single parameter, then the object on the top of the Stack will be examined, and if it is not ! of the type given, will be removed from the Stack and its contents written to the output data stream. This goes on, until either an object of the given type is found or the Stack is empty.

Why this?

If you know, that a line containing a special pattern has to belong to a specific object, you can use this command to ensure, that the topmost object on the Stack is of the needed type. This also means, that objects, which are higher on the Stack, and not of the given type, have to have ended (should be removed from the Stack and written to the output data stream).

This makes it easier to have a well defined context for each pattern you give in the description file. And which object a pattern belongs to, doesn't depend on the order of patterns in the description file. The first implementation of TextConvert went without such a CLASS statement, and I got so many problems, that I decided to implement a means by which those problems could be circumvented.

Attention: As a rule of thumb, you should have at least one CLASS command per action.

Imagine, you detect the border (or end) of the outermost object (which means, it is the object on the bottom of the Stack) in the tree of nested objects you just work on. It can be a little bit tedious having to remember all the objects and to remove all of them one by one from the Stack. For this purpose the command NEWCONTEXT exists. It removes all objects from the Stack, thereby writing them to the output data stream. After issuing this command, you can be quite sure to have an empty Object Stack to work with.

You can postpone the naming of an object you just create, if you need to, by giving the special argument POSTPONE instead of a name when you call the command CLASS.

CLASS("Object_without_name", POSTPONE);

In the next section (about the focus) you will learn how to set this name at a later time.

Examples

Let's assume a input data stream with the following structure (comments are in backets):

Outermost T'is_the_name
Outermost_entry "data to be converted"
... [here comes some more data and some embedded objects]
Innermost T'is_the_embedded_object
... [data of innermost object]
Back_to_Outermost "well, here's the outermost object again"
Go_on_with_Outermost "Say: Hello, Outermost"
...

We have an object of type "Outermost", with some entries belonging to it, maybe some other embedded objects, and an object of type "Innermost", which is embedded at least into the outermost object. It could as well be embedded in some other objects, the only thing we know for sure is, that after we detect an entry belonging to the outermost object again, the border of the innermost object has been crossed, and can be written to the output data stream.

# We've found the outermost object
/Outermost/
{
  CLASS("OUTERMOST", FIELD[1]);
  ...
}
/Outermost_entry/
{
  CLASS("OUTERMOST");
  ...
}
...
# The innermost object is found
/Innermost/
{
  # Create a representation of the innermost object on
  # the Object Stack
  CLASS("INNERMOST", FIELD[1]);
  ...
}
...
/Back_to_Outermost/
{
  # Ensure we have the right object type on top of the Stack
  CLASS("OUTERMOST");
  ...
}

/Go_on_with_Outermost/
{
  PRINT("Hello, Outermost", NEWLINE);
  ...
}
...

1.3.1 The Focus

Sometimes you don't want to direct your output to the object on the top of the Stack, but to a object which lies underneath. This might, for example, be a parental object, that needs a reference to the object on the top of the Stack. To solve this problem there exists the focus. Each output is directed to the object in focus, which is by default the object on top of the Stack. And each time you use the special command CLASS, it will be set back to this default.

You can get the type (class) of the object in focus by using the command FOCUSCLASS witchout an argument. If you give a parameter, it will be interpreted as a object type (a class). The Stack will be search from the momentary focus to the bottom of the stack, until an object of this type is found. The focus will be set to this object. If no object of this type is found, the focus will be set back to the top of the Stack.

The name of the object in focus can also be accessed, by using the command FOCUSNAME. And if you have postponed the naming of an object at the time of its creation on the Object Stack, you can set this name at a later time by issuing the command FOCUSNAME with a parameter, the name. It will be checked, if the naming really was postponed, and the name set only if the name given at the call of the command CLASS(type, name) was the special parameter POSTPONE.

Examples

Let's create a few new objects on the Stack:

CLASS("Object_1", "Harry");
CLASS("Object_2", POSTPONE);
CLASS("Object_3", "Sally");

Now we have 3 objects on our Object Stack, the topmost is an object of type "Object_3" with the name "Sally", the second is of type "Object_2", without a name (because we don't know it by now), and the bottommost is an object of type "Object_1" named "Harry". The focus is set to the object on the top of the Stack.

$focus_class = FOCUSCLASS;
$focus_name = FOCUSNAME;

The variable $focus_class is set to the string "Object_3", the variable $focus_name to the string "Sally". Let's change the focus now

FOCUSCLASS("Object_1");
FOCUSNAME("This_will_not_be_set");
$focus_class = FOCUSCLASS;
$focus_name = FOCUSNAME;

The variable $focus_class is now set to the string "Object_1", the variable $focus_name to the string "Harry". The command FOCUSNAME didn't change a thing, because the naming of this object wasn't postponed.

FOCUSCLASS(TOPOFSTACK);
FOCUSCLASS("Object_2");
FOCUSNAME("and");
$focus_class = FOCUSCLASS;
$focus_name = FOCUSNAME;
FOCUSCLASS("Object_1");
PRINT("Reference_to: ", $focusclass, ", ", $focusname, NEWLINE);
FOCUSCLASS(TOPOFSTACK);
PRINT("Parent: ", $focus_name);

Now the name of the object in focus is changed to the string "and", and the variable $focus_name reflects this by holding the string "and". The variable $focus_class is set to "Object_2". After that we change the focus to the bottommost object and add a reference to the just changed object. We switch back to the top of the Object Stack to add an entry that contains the name of the second object.

1.4 Control Flow

There are two additional commands, that change the way in which TextConvert reads input lines and matches patterns (and executes the attached actions). These are AGAIN and NEXTSCAN

AGAIN

This command causes TextConvert to restart matching the patterns to the current input line. This can be another line than the one matched against the patterns the previous time, if you used the command NEXTLINE (reads a new line and makes it current). This enables you to write actions that work over multiple lines, and give control back to TextConvert, if it is detected, that the line just read doesn't belong to this particular action. By restarting the matching process all patterns can be matched against this line.

Example

Suppose you want to examine a file with the following structure:

FIRST  <Text>
  <Text>
  ...
  <Text>
SECOND  <Other Text>
  <Other Text>
  ...
  <Other Text>
FIRST  <Text>

When you've read the label FIRST, you know that an arbitrary number of lines with indented text belonging to this label follows. To find the end of this structure, you have to examine the first line that doesn't belong to it anymore. If you are sure about the order, in which the different patterns you try to match come, you can put the patterns in the description file in the same order. But this holds true only, if they cannot follow each other in the file you want to work on (like in the example).

This wouldn't be an example, if there would be no solution. We simply scan all patterns again, when we have detected that we have read too far, by issuing the special command AGAIN.

/FIRST/
{
  PRINT("First Pattern found", NEWLINE);
  # Discard the string "FIRST"
  PRINT("> ", substr(FIELD, 6), NEWLINE);
  NEXTSCAN if EOF();
  NEXTLINE;
  while(length(FIELD[0]) == 0 && FIELD =~ /\S/)
  {
    PRINT("> ", substr(FIELD, 1), NEWLINE);
    NEXTSCAN if EOF();
    NEXTLINE;
  }
  AGAIN;
}
/SECOND/
{
  PRINT("Second Pattern found", NEWLINE);
  # Discard the string "Second"
  PRINT("> ",substr(FIELD, 7), NEWLINE);
  NEXTSCAN if EOF();
  NEXTLINE;
  while(length(FIELD[0]) == 0 && FIELD =~ /\S/)
  {
    PRINT("> ", substr(FIELD, 1), NEWLINE);
    NEXTSCAN if EOF();
    NEXTLINE;
  }
  AGAIN;
}

NEXTSCAN

TextConvert stops the matching of patterns for this line, reads a new input line and starts the matching of the patterns again. You can use this command, if you are sure, that no other pattern matches the current input line (this speeds things up), or if you don't want further matching to happen for this line.

Example

Say you have a file in which lines beginning with the letters 'XX' are to be interpreted as comment lines. And you have exactly one pattern that matches these comments, but there are other patterns that could match something inside a comment (only being paranoid...).

/XX/
{
  # Comments will begin with to slashes '//'
  PRINT("//", substr(FIELD, 2), NEWLINE);
  NEXTSCAN;
}

1.5 Communication between Actions

The actions (or program fragments) are put together into one address space. This means, you can exchange data between different fragments by the use of normal variables.

Example

We want to have each string 'Good Bye' in our input data stream be a match for a string 'Hello' found before. We remember finding the string 'Hello' by setting the variable $hello_variable to 1, and by resetting it to 0 if we found the string 'Good Bye'

/Hello/
{
  $hello_variable = 1;
}
/Good Bye/
{
  if($hello_variable == 1)
  {
    PRINT("Good Bye, friend", NEWLINE);
  }
  else
  {
    PRINT("You forgot to greet me", NEWLINE);
  }
  $hello_variable = 0;
}

1.6 Subroutines and the Include Statement

If you have code fragments, which are used over and over again in your code (means, you copied them lots of times between your actions), then it is possibly time for the use of subroutines. The syntax is exactly the same as in Perl (see [PERL]), and, to be honest, they are implemented as such. But wait, there is something special. You can use all the additional commands TextConvert offers.

And if you are still yawning, and asking yourself, can't this guy do something useful, here it comes: the Include Statement. With it, you can have libraries of common subroutines for all those different description files you are using to convert flat file libraries. And even common Pattern/Action-Pairs can be put into such files and simply be included by your description file.

Example

One simple example could be to have standardized BEGIN and END actions, and a Pattern/Action Pair that matches and discards comments beginning with "//", which can be written once and included in all description files.

[This is the File "Standard.des"]
BEGIN
{
  &print_begin();
}
END
{
  PRINT(NEWLINE, "END OF FILE", NEWLINE);
  PRINT(NEWLINE);
}
!//!
{
  FIELD =~ s!\s*//.*$!!;
  # It is better to call AGAIN, we don't know where this
  # file is included (i.e. which position the pattern has).
  AGAIN;
}
sub print_begin
{
  PRINT("// This file has been created by me", NEWLINE);
  PRINT(NEWLINE);
}

And here the file that includes the definitions above:

#include "Standard.des"
...

Attention: Do not include files recursively (i.e, do not create cyclic graphs).

1.7 Splitting of Input Lines

Normally, the split action defined in TextConvert is sufficient. But there are times, when parts of the input data has to be split upon arbitrary boundaries. To provide you with the utmost flexibility, three commands and an additional Pattern/Action pair is defined. The commands are NEXTLINE_NOSPLIT, SPLIT and SPLITPATTERN, the additional Pattern/Action pair is named SPLIT.

NEXTLINE_NOSPLIT

NEXTLINE_NOSPLIT reads the next inputline without splitting it into the field array. Thus, if you only need the raw line without splitting it into separate fields, then this command can be used.

SPLIT

SPLIT splits the current input line into the FIELD array. This enables you to read a line without splitting it via the command NEXTLINE_NOSPLIT, make arbitrary changes to support the split command, and then split via this command. The sequence of the two commands NEXTLINE_NOSPLIT and SPLIT is equivalent to the command NEXTLINE.

SPLITPATTERN

This command enables you to set the split pattern to arbitrary values. Thus, you can split the input line on more than one pattern by repeatedly changing the pattern with this command followed by the split command described above. The default value for the split pattern is "[ \t\n]+", thus splitting on white space.

SPLITPATTERN("[ \t]+");
The SPLIT Action

If you still need more flexibility, this action is your choice. It can change the entire behaviour of TextConvert if you want. With this action, you have direct access to the code that splits the input field. You can have arbitray perl code here to work on the input line.

The default definition for the split action is as follows (the string "$CONV_Input_Field_Separator" denotes the definition of the split pattern via the command SPLITPATTERN or the default definition "[ \t\n]+"):

SPLIT
{
  s/\n$//;
  study;
  SPLIT;
}

Example

Now a somewhat elaborate example that illustrates one possible use for these commands. It has been used as a foundation for the emulation of ace2ace in the course of converting ACEDB data from one model into another one. The subroutine find_new_object() searches the input stream for the beginning of the next object of data, and sets the variables $current_class and $current_instance to the values found in the input stream. To do this, the splitpattern has to be set to the default. Afterwards, the splitpattern is changed in a way that strips the quotes from the input. The SPLIT action implements line merging via the \ -operator before it splits the line. The BEGIN action calls find_new_object() and leaves the rest to TextConvert. If an empty input line is found, the first normal pattern is used and its action called. This flushes the object stack via NEWCONTEXT, calls find_new_object() to start with the next object data and sets class name and instance name. Now all subsequent actions have access to the current class and instance name.

sub find_new_object
{
      SPLITPATTERN("[ \t\n]+");
      NEXTLINE() while /^\s*$/ && !EOF();
      $current_class = FIELD[0];
      $current_instance = "\"FIELD[NUMFIELD]\"";
      SPLITPATTERN("\"?[ \t]+\"");
}
 
SPLIT
{
      # merge lines ending with \
      FIELD .= NEXTLINE_NOSPLIT while (s/\\\n$//);
      SPLIT;
}
BEGIN
{
      $target_class = "Locus :";
      &find_new_object();
      NEXTSCAN if EOF();
      CLASS($target_class, $current_instance);
}
# The next object begins after an empty line
/^\s*$/
{
  NEWCONTEXT();
      &find_new_object();
      NEXTSCAN if EOF();
      CLASS($target_class, $current_instance);
      NEXTSCAN;
}

2 Usage

2.1 The Command Line Options

All arguments have to be preceded by an option letter. This ensures the correct interpretation of the information given on the commandline. Some of the options are optional. If they are omitted, default values will be used instead. The options can be given in any order.

-c If this option is given, TextConvert checks the given description file for
syntactic correctness.

-d <file> The file given after this option is interpreted as the description file. This
option is mandatory (without description file, TextConvert doesn't know
what to do). The file has to have the format described in the next section.

-f Force the output of the files. Only if this option is given, existing files will
be overwritten. If it is ommitted and one of the output files exists,
TextConvert terminates with an error.

-h If this option is given, a short help page is printed. Nothing else happens.

-i <file> From the file given after this option the input data is read. If this option is
omitted (it is optional), the standard input stream is used to read the input
data.

-l <file> With this option the name of a log file can be given. This log file contains
all lines that have not been matched by at least one of the patterns given in
the description file. If the file exists, and the option -f is not given,
TextConvert terminates with an error.

-o <file> This option names the output file. If this option is not used, the output will
be written to the standard output stream. If the file exists, and option -f is
not given, TextConvert terminates with an error.

Examples

TextConvert -h

prints the help page.

TextConvert -c -d embl.description

checks the description file 'embl.description' for syntactical correctness.

TextConvert -d embl.description

converts data read from standard input according to the description file embl.description and writes the converted data to standard output.

TextConvert -d embl.description -i input.data -o output.data

reads the file 'input.data', converts according to the description file 'embl.description', and writes the converted data to the file 'output.data', if it does not exist.

TextConvert -d embl.description -f -i input.data -o output.data

does the same as before, but the output file will be overwritten if it exists.

2.2 The Description File

2.2.1 The Format of the File

The format of the description file is:

/Pattern/
{
    [PERL CODE, ADDITIONAL COMMANDS and subroutine calls];
}
/Pattern/
{
    [PERL CODE, ADDITIONAL COMMANDS and subroutine calls];
}
sub test
{
   [PERL CODE, ADDITIONAL COMMANDS and subroutine calls];
}
#include "library"

Each input line is read in and then, in a loop, each pattern is tried; if it matches the attached action will be executed. The patterns are matched against each input line in the same order they are found in in the description file.

2.2.2 Pattern Format

The pattern has to be enclosed by delimiters. The delimiter is not part of the pattern. When the pattern is read, TextConvert checks if it begins and ends with the same character (the delimiter), and this character is stripped. The pattern itself has to be of the regular expression format utilised by Perl (see [PERL]).

Attention: Comments are not allowed on lines with patterns.

If you attach actions to the same pattern repetitively, the actions will be executed (if the pattern matches) in exactly the same order as they are found in the description file. Thus, if you have two actions that, besides the pattern, have nothing in common, you don't have to put them together. This makes it easier to have a clear structure in the description file.

Attention: Remember that the patterns are regular expressions. This means, that if you to match e.g. the literal '.' you have to escape it like this, '\ .'.

Examples

To match a pattern at the beginning of a line a caret sign "^" is used. To match the ID entry of the EMBL Database the following pattern can be used:

/^ID/
*^ID*
|^ID|
!^ID!
&^ID&

The delimiters differ in the examples, but they have to be the same at the beginning and the end of the pattern. You must not use matching braces! The pattern matches the string "ID" at the beginning of a line.

To match an exon in the feature table of the EMBL Database, looking like this:

FT  exon  2439..2607

the following pattern could be used:

/^FT[ \t]+exon/

It will be searched for a string "FT" at the beginning of the line, followed by a number of spaces or tabs, followed by the string "exon". An equivalent would be

/^FT\sexon/

Here the tabs and spaces will be matched by the escaped s (\s), which matches whitespace.

The following two pattern / action blocks

# Match anything
/./
{
  #Here begins the first action
  PRINT("Hello action 1");
  PRINT(NEWLINE);
}
/./
{
  #Here begins the second action
  PRINT("Hello action 2");
  PRINT(NEWLINE);
}

will be joined to something similar to the following:

/./
{
  #Here begins the first action
  PRINT("Hello action 1");
  PRINT(NEWLINE);
  #Here begins the second action
  PRINT("Hello action 2");
  PRINT(NEWLINE);
}

2.2.3 Special Patterns

Three special patterns exist. Two of them are BEGIN and END, which denote actions that are to be executed at the beginning and the end of the conversion process. These can be used to initialise variables, to insert text at the beginning or the end of the output stream or to read additional data from another file. The third special pattern is SPLIT, denoting an action that is to be executed instead of the default split action.

Attention: These patterns are not delimited.

Examples

To insert a blank line at the beginning and the end of the output stream the following patterns and actions could be used:

BEGIN
{
  PRINT(NEWLINE);
}
END
{
  PRINT(NEWLINE);
}

To merge lines from the input data ending with the "\" the following SPLIT action can be used:

SPLIT
{
      # merge lines ending with \
      FIELD .= NEXTLINE_NOSPLIT while (s/\\\n$//);
      SPLIT;
}

2.2.4 Subroutines

Subroutines can be declared as usual in Perl. They have to be called according to the calling conventions of Perl, but be advised to use the conventions of Perl 4 to make the description file usable with both Perl 4 and Perl 5 (this means, precede every subroutine call with the ampersand "&").

Examples

Let us define a subroutine, that examines if the parameter given is greater than 0:

sub test_gt_zero
{
    local($param) = @_;
    return ($param > 0);
}

And now we call it:

$t = &test_gt_zero(1);
$t2 =  &test_gt_zero(0);
print "1\n" if $t;
print "2\n" if $t2;

This prints "1\n".

2.2.5 Include Statements

In addition to the normal pattern / action pairs and to the subroutines, one special statement is implemented:

#include "library"

This works exactly as in C or C++, by including the file mentioned. This makes it easy to create libraries of common functions, and to use them by issuing this statement. Including of files can be nested, which means, that inside this included file other files can be included.

Attention: TextConvert does not check whether files are included recursively, e. g. include themself (i.e. build cyclic graphs). But it is pretty easy to find out. If your program doesn't terminate, and doesn't reach the statements in the BEGIN action, you probably have exactly this problem.

2.2.6 Comments

As in Perl, the '#' sign is the character, that is used to declare the rest of the line as a comment. The remaining part of the line is ignored.

One exception to this is the string '#FIELD'. This is an additional command (see next section). That means, that no comment can begin with this string of characters. I feel that this is no serious limitation, because, if there is at least a single other character in this string, it is treated as a comment (e. g. '#_FIELD' or '# FIELD'). But as a precaution, you should always leave an empty space between the comment delimiter '#' and the comment.

The other exception is, that on lines containing patterns no comments are allowed. This again is no serious limitation, put your comment on the line above the pattern instead.

2.3 The additional Commands

The commands that are implemented by TextConvert are always in upper case, to distinguish them from normal Perl code. Many of them can be used with a different number of arguments that implement different functionality. The entries are ordered alphabetically, if there are different numbers of arguments the version with less arguments comes first.

They implement the Object Stack and its output to the output data stream, and ease the reading and decoding of the input data stream.

2.3.1 Control Flow

Following are the commands that influence the control flow of TextConvert.

AGAIN

TextConvert restarts to match the given patterns against the current input line. This can be another line than the one matched against the patterns the previous time, if you used the command NEXTLINE (reads a new line and makes it current). This enables you to write actions that work over multiple lines, and give control back if it is detected, that the line just read doesn't belong to this particular action. By restarting the matching process all patterns can be matched against this line.

NEXTSCAN

TextConvert stops the matching of patterns for this line, reads a new input line and starts the matching of the patterns again. You can use this command, if you are sure, that no other pattern matches the current input line (this speeds things up), or if you don't want further matching to happen for this line.

2.3.2 The Object Stack

If in the input data stream the data objects are nested or stacked, a construct is needed (or of great help at least) to represent this nesting internally. TextConvert does this by the notion of the Object Stack. It enables, without too much hassle, to change the focus of work from one of the nested objects to another, add data, and change the focus back to the topmost object. In the case of the ACEDB format (the supported output data format), an object is simply an instance of one of the defined classes, in the case of EMBL an each entry is an object. And nested inside of these entries other objects can be found, e.g. references. It lies in your responsibility to decide which data should be treated as an object of its own and which as attributes.

There are a few operations that are needed to work with the Object Stack. You have to be able to add objects to it, to remove objects from the top of the Stack, hereby writing their contents to the output data stream, and to access objects that are not on the top of the stack. You have to have access to class and instance name of the object you work on. Furthermore, it would be helpful to remove all objects from the Stack,to clear it up (and again, write the data to the ouput data stream). To be able to access objects on the Stack, which are not on the top, makes it easier to switch back and forth between different nesting levels of such a hierarchy of objects. To support this, a functionality called focus is implemented. After setting the focus to a specific object on the Stack, all subsequent operations work on this object in focus.

TextConvert, with some limitations, supports this functionality. The limitations are, that you are not free in choosing how many objects to remove (and write to the output) from the top of the Stack, and that you are not free in choosing how many objects down the stack you move the focus. You cannot say, "Move down 3 objects from the current focus", or, "Remove (and yes, write out) the 5 topmost objects on the Stack". But you can say "Move down the Object Stack to the next object of the class DNA", or "Remove (and you know what? yup, write out) all objects until the topmost object is of the class Paper". This makes work with the Stack more understandable and safer. You can't accidently remove an object you wanted to retain on the Stack, thereby destroying the Stack structure you depend on.

FOCUSCLASS

This command returns the class of the object currently in focus. If the Object Stack is empty, the command returns an empty string.

FOCUSCLASS(type)

This command works analogous to the command CLASS(type), except that it doesn't remove the object on the Object Stack, but changes only the focus to the highest object of the given type. Subsequent operations will work on this object. If you specify the special argument TOPOFSTACK instead of a normal argument, the focus will be set to the object on top of the Object Stack.

FOCUSNAME

This command returns the name of the object instance in focus. If the Object Stack is empty, the command returns an empty string. This is one possible way to determine if the Stack is empty, because in every other case the object instance has to have a name or the special argument POSTPONE.

FOCUSNAME(name)

It is possible that you don't know the name of an object at the first time you see it in the input data stream. In this case you can give the CLASS command a special argument as name, POSTPONE, that postpones the naming. If you do this, you need, later on, a command to name the class. FOCUSNAME(name) sets the name of the object in focus, if and only if it was postponed. See also CLASS(type, name).

CLASS(type)

This command examines the object on top of the Object Stack, and if it is not of the type given as the argument, the object will be removed (after its contents is written to the output data stream). This goes on, until the current topmost object is of the type given, or until the stack is empty. The focus will be set back to the top of the stack.

CLASS(type, name)

This command creates a new object on top of the Object Stack. The object is of the class type "type" and will be given the name "name". If the argument for name is the special argument POSTPONE instead of a normal string, the naming will be postponed. You can set the name later with the command FOCUSNAME. The focus will be set back to the top of the stack.

NEWCONTEXT

NEWCONTEXT removes all objects from the Object Stack, thereby printing each of the objects to the output data stream. This command is used to write all objects of the stack when you know that the end of the outermost object of the input data stream is reached.

Example

When you convert the EMBL Database, the beginning of an entry is marked by the identifier ID. If this is read, it is a good time to clean up the stack:

/^ID/
{
  NEWCONTEXT();
}

2.3.3 Input / Output

All input will be read from the input data stream, all output will be written to the Object Stack. If an object is removed from the Object Stack, its contents is written to the output data stream.

CLASS(type)

This command examines the object on top of the stack, and if it is not of the type given, deletes it after writing it to the output data stream. See also entry in the previous section.

DECODE(type, length, Code_List)

This command decodes 2 Bit or ASCII coded sequences of arbitrary data. The first argument is the type (2, "2", "2 Bit"... "8Bit", "ASCII"), the second the length of the sequence, the third an array of Strings containing the plain values of the coded sequence. This argument will be ignored if used in conjunction with coding type "ASCII" or "8 Bit". The number of data items will be read directly from the input data stream. The decoded data is to be found in the variable $DECODE.

Attention: The DECODE statement reads items, not bytes. Thus, if you have 2 Bit coded data, length / 4 Bytes will be read (in the example above, 425 / 4 would be 106.25, so an additional byte will be read, that contains the 2 Bit code for the last base).

Example

To decode a 2 Bit coded DNA sequence with a length of 424 bases, you can use the following DECODE statement (the PRINT command outputs the converted data):

DECODE("2 Bit", 425, (C, T, A, G));
PRINT($DECODED, NEWLINE);
EOF()

This command checks the input data stream, if the end of the file is reached. 1 is returned upon reaching end of file, 0 otherwise.

FIELD

The current input line is found in FIELD. If you want to access parts of it, say, the second entry, or the last entry (delimited by spaces), you should use FIELD[n].

FIELD[n]

This command returns the nth field of the input line. Counting begins with 0 and ends with #FIELD (the number of entries minus one). If you want to access the last entry, you can use the following construct:

$lastfield = FIELD[#FIELD];
#FIELD

This command returns the number of entries (delimited by spaces) minus one in the current input line. It works similar to array indexing in C or C++.

NEWCONTEXT

NEWCONTEXT writes all objects currently on the Object Stack to the output data stream. See also the entry in previous section.

NEXTLINE

This command reads the next input line. After issuing NEXTLINE the contents of the new, current line can be accessed through the commands FIELD and FIELD[n]. The command can be used to access data, that is distributed among more than one line. If you discover, that you read one line too much, the command AGAIN (see section about control flow) could come in handy. This command works exactly like a sequential execution of the commands NEXTLINE_NOSPLIT and SPLIT.

Attention: You should test if the end of the file has been reached before issuing this command. This can be done with the special command EOF().

NEXTLINE_NOSPLIT

NEXTLINE_NOSPLIT reads the next inputline without splitting it into the field array. Thus, if you only need the raw line without splitting it into separate fields, then this command can be used.

Attention: You should test if the end of the file has been reached before issuing this command. This can be done with the special command EOF().

PRINT(text)

The PRINT command is similar to the print command of Perl, if there is no object on the stack. But if the Object Stack is not empty, then it prints its text to the current object on the stack (see FOCUSCLASS in the section about the Object Stack). The special argument NEWLINE in the place of text prints a newline character to the current object.

PRINTF(FORMAT, text, ...)

This command is similar to the PRINT command above, except that it takes as first argument a format string FORMAT to format the additional arguments and that it is not limited to one additional argument. The format string contains text with embedded field specifiers (following the normal C printf() conventions).

SPLIT

SPLIT splits the current input line into the FIELD array. This enables you to read a line without splitting it via the command NEXTLINE_NOSPLIT, make arbitrary changes to support the split command, and then split via this command. The sequence of the two commands NEXTLINE_NOSPLIT and SPLIT is equivalent to the command NEXTLINE.

SPLITPATTERN

This command enables you to set the split pattern to arbitrary values. Thus, you can split the input line on more than one pattern by repeatedly changing the pattern with this command followed by the split command described above. The default value for the split pattern is "[ \t\n]+", thus splitting on white space.

SPLITPATTERN("[ \t]+");

3 Installation

You need the language Perl to use TextConvert, because TextConvert is written in it. If you don't know what it is, ask your system administrator. You need at least the version 4, but it works with Perl 5 also. If the path to your executable of perl is not '/usr/local/bin/perl' (you can get this under Unix by issuing the command 'which perl'), you either have to call the program by with the name of the perl interpreter in front of it (e.g. if your perl interpreter is /usr/bin/perl5, you issue the command '/usr/bin/perl5 TextConvert <normal options>'), or by editing the first line of TextConvert.. Alternatively, you can use the included program 'fixin' (directly out of [PERL]).

4 Changes

5 References

[ACEDB] The ACEDB Documentation Server. Homepage.
http://probe.nalusda.gov:8000/acedocs
[AWK_1] Aho, Kernighan, Weinberger. The AWK Programming Language. Addison-Wesley, 1988
[AWK_2] Dale Dougherty. SED & AWK. O'Reilly & Associates, 1992
[EMBL] EMBL Nucleotide Sequence Database User Manual. European Bioinformatics Institute, Release 41, December 1994
[GCG] J. Devereux, P. Haeberli, O. Smithies. A Comprehensive Set of Sequence Analysis Programs for the VAX, Nucleic Acids Res. 12, (1984) 387-395.
[IGD] The Integrated Genomic Database. Homepage.
http://genome.dkfz-heidelberg.de/igd-docs/homepage.html
[PERL] Larry Wall and Randal Schwartz. Programming Perl. O'Reilly & Associates, 1991