. . Overview
TextConvert is a program that takes as input a description file in a format that resembles the format of AWK-scripts, reads a data stream and produces another data stream according to patterns and attached code, so called actions, in the description file. The patterns can be formed according to Perl's regular expressions, and the actions are formulated as Perl source code enriched with additional commands to ease the conversion of text-based data. The additional commands support the ACEDB [ACEDB] format.
Acknowledgements
We are very grateful to the IGD group in Heidelberg for their constant help. Furthermore we express our thanks to Professor Rothermel and the department of Distributed Systems for their advice and encouragement.
This work has been funded by the DKFZ (The German Cancer Research Centre), and is part of the IGD development [IGD] under the CEU contract GENE-CT93-0003.
TextConvert reads an input data stream line by line (each line is terminated with a newline character), tries to match the patterns that are given in the description file, and if one matches, executes the attached action (the code that follows the pattern). After the action has been executed, the next pattern is tried on the same line, until all patterns are tried. The next line is read, and the pattern matching begins again.
This is it, basically. But you can change the control flow, if you want. You can end the evaluation of the patterns for the actual line (with NEXTSCAN), you can force TextConvert to retry all patterns on the actual line (AGAIN), or you can simply read the next line while not leaving the action (retaining the control).
To use TextConvert, you should have some knowledge in programming in the language Perl. This is because the actions attached to the patterns are normal Perl program fragments (very short program fragments, but nevertheless), and thus have to be programmed in a normal, programmer's way. This tutorial cannot be an introduction to programming in Perl, nor to programming in general. If you need further information, see [PERL].
As long as you are testing your description file, you should, after each change, use the -c option to test if the description file gets parsed without errrors. It can happen, that TextConvert doesn't complain about an error in the normal mode of operation.
All additional commands are used like functions, i. e. you have to use braces to enclose the arguments.
To do something useful with a program, you have to be able to examine the input data stream and to write data to the output data stream. These are the most basic functions of each program, and thus we examine them first.
The current line of input data read by Textconvert is found in the variable FIELD. Here the whole line can be examined, or matched against patterns.
A normal line of text contains many words, that are separated by spaces. These words you can find in an array that consists of exactly the number of words in the line. The array is named FIELD as well, and has to be accessed with an index number, like this, 'FIELD[3]". The indices are numbered from 0 to n - 1, n being the number of words in the line. This number (n - 1) can be found in the variable '#FIELD'.
But if you are sure that this is the only way to solve your problem, here is some additional information (don't say I didn't warn you). At the beginning of each round of pattern matching and executing actions, the next input line is read into the FIELD variable (this is only another name for the $_ variable, for the real Perl hackers out there), the trailing newline is removed, and then the contents is split into the FIELD[] array (which is the default split array @_). Now TextConvert begins to match the patterns and to execute actions. If, a some time, it has to execute the special command AGAIN, it jumps right back to the beginning of the pattern matching. It doesn't read the input line again, and it doesn't split the contents of the FIELD variable again. This means, that if you change the FIELD variable, or one of the members of the FIELD[] array, the contents of both do not correspond any more. Furthermore, because the patterns are matched against the FIELD variable also, this might influence the matching of a pattern that comes after the pattern belonging to the current action, in which you just changed the contents of the FIELD variable. On the other hand, you can use this to deliberately hinder the execution of a pattern / action, or to implicitely call a pattern (by adding a match of this pattern to the FIELD variable). But then there are other, safer ways to do this. You could, for example, use a pattern that matches everything (e.g. '/./ ') and check for a variable you set in the action that should cause the execution of the other one. Or you could change the flow of control.
Again: don't do it !
If you want see if the input line contains the string 'RNA' (for the PRINT command see the next section):
or, if you had previous experience with Perl and its conditional constructs
If you want to print the input line beginning with the 5th character, if the second field is not a ".", the following can be used (this happens in the EMBL Database, if you want to extract the KW lines):
Or suppose you want to examine the last field of the input line:
If you want to loop over all input fields, which are delimited by comma and space, like in the string "Eukaryota; Planta; Phycophyta; Euglenophyceae." (the OC lines in the EMBL DataBase):
TextConvert supports two printing commands, PRINT, and PRINTF. Both commands write to the object in focus (see next section), or if the Object Stack is empty, to the output data stream.
PRINT simply prints its arguments, without further processing. Consequently, the arguments are not separated by spaces. PRINTF takes a format string (like the printf()-function in C), and formats additional arguments according to this format string. In most cases you won't need the additional functionality of PRINTF, because, like in Perl, you can use variable names (including FIELD and the FIELD[]-array members) in a normal string that is to be printed. But if you want to determine exactly how the output looks like, including precision for decimal values, left justification, or different output formats, you have to use this command. For additional information see [PERL] and your favourite C book.
To print the input line preceded by the string " > ", use the following code:
or
If you need to print a variable $hex as a hexadecimal number:
You can even use variables to influence the format string. The following example is borrowed from [PERL]:
gives the following output
Note that '${width}' is equivalent to '$width' (it only delimits the variable name from the following alphanumeric characters) and that '${precision}f' is not in the least equivalent to '$precisionf', which is why the variable names in the example above are enclosed by curly brackets.
The command CLASS writes some of the contents of the Object Stack to the output data stream. See the section about the Object Stack for more information
If you want to read the next input line without leaving the action, you can use the command NEXTLINE. After issuing this command the FIELD contains the new current input line, and the FIELD[] array the different words of the input line. Before you use NEXTLINE, you should check with the special command EOF(), if the end of the file is reached.
You want to read over the whole feature table of the EMBL DataBase. It begins with a line containing the identifier 'FH' and, following it, lines that begin with 'FT':
The command AGAIN restarts the pattern matching on the current line. See the section about control flow.
Or the KW-line of the EMBL DataBase again. It consists of keywords, separated by semicolons. The last entry ends with a period. And keywords may consist of more than one word, contain spaces or embedded periods.
The command NEXTSCAN forces TextConvert to read the next input line and begin again with matching the patterns. See the section about control flow.
TextConvert supports the decoding of a data format, that encodes sequences of 4 different strings as 2-Bit sequences. The DNA, for example, can be encoded thus, which leads to a very space saving representation of the data. Also supported is the ability to read 8-Bit or ASCII sequences of arbitrary length that are embedded in the normal text.
The command DECODE reads a data sequence of given length and decodes it according to its first argument, which can be either "2 Bit" or "2-Bit" to decode the 2-Bit coded sequences and "8BIT" or "ASCII" to read in 8-Bit coded sequences (these are not really encoded). The decoded data will be found in the variable $DECODED.
Suppose you want to read a data stream with the following structure:
This is a format that is used by the GCG (Genetics Computer Group) Sequence Analysis Software Package [GCG]. Each entry begins with the string '>>>>' directly followed by the name. In the second field it contains the date, in the third the type of encoding of the following sequence. Now follows the keyword 'Len:' and the last field contains the length of the data sequence. The second line contains the name of the sequence in its first field, a short description and the date. After a newline the sequence data follows (coded as given in the third field of the first line, and of the length found in the last field of the first line).
Here follows the code (for explanations of NEWCONTEXT(), CLASS() and NEXTSCAN see the appropriate sections):
This would be the whole description file needed to convert data streams formatted like according to that format into the ACEDB data format.
Often the input data stream is structured in a way, that you can distinguish different objects, which are nested (one object is embedded in another). Sometimes it is best to retain that nested structure, but more often it is better to diminish the complexity of the structure by flattening it, and to replace the embedded object by a reference to a newly created object. This object contains the data of the old embedded object.
To represent the nested objects of the input data stream, there exists the notion of the Object Stack. Here you can create new objects on top of others (by pushing them on the Stack), existing ones, which were created earlier, add data to them, and delete them after writing them to the output data stream (by popping them from the Stack). And you can empty the whole Object Stack (and write the contents of all objects to the output data stream) by issuing a single command.
You create a new object on the Stack by issuing the command CLASS with to parameters, the objects type (or class), and its name.
All subsequent output will be directed to this object, until a new CLASS command changes the object on top of the Stack (that's not the entire truth, see the next section about the focus).
If you use the command CLASS with only the type of the object as the single parameter, then the object on the top of the Stack will be examined, and if it is not ! of the type given, will be removed from the Stack and its contents written to the output data stream. This goes on, until either an object of the given type is found or the Stack is empty.
Why this?
If you know, that a line containing a special pattern has to belong to a specific object, you can use this command to ensure, that the topmost object on the Stack is of the needed type. This also means, that objects, which are higher on the Stack, and not of the given type, have to have ended (should be removed from the Stack and written to the output data stream).
This makes it easier to have a well defined context for each pattern you give in the description file. And which object a pattern belongs to, doesn't depend on the order of patterns in the description file. The first implementation of TextConvert went without such a CLASS statement, and I got so many problems, that I decided to implement a means by which those problems could be circumvented.
Imagine, you detect the border (or end) of the outermost object (which means, it is the object on the bottom of the Stack) in the tree of nested objects you just work on. It can be a little bit tedious having to remember all the objects and to remove all of them one by one from the Stack. For this purpose the command NEWCONTEXT exists. It removes all objects from the Stack, thereby writing them to the output data stream. After issuing this command, you can be quite sure to have an empty Object Stack to work with.
You can postpone the naming of an object you just create, if you need to, by giving the special argument POSTPONE instead of a name when you call the command CLASS.
In the next section (about the focus) you will learn how to set this name at a later time.
Let's assume a input data stream with the following structure (comments are in backets):
We have an object of type "Outermost", with some entries belonging to it, maybe some other embedded objects, and an object of type "Innermost", which is embedded at least into the outermost object. It could as well be embedded in some other objects, the only thing we know for sure is, that after we detect an entry belonging to the outermost object again, the border of the innermost object has been crossed, and can be written to the output data stream.
Sometimes you don't want to direct your output to the object on the top of the Stack, but to a object which lies underneath. This might, for example, be a parental object, that needs a reference to the object on the top of the Stack. To solve this problem there exists the focus. Each output is directed to the object in focus, which is by default the object on top of the Stack. And each time you use the special command CLASS, it will be set back to this default.
You can get the type (class) of the object in focus by using the command FOCUSCLASS witchout an argument. If you give a parameter, it will be interpreted as a object type (a class). The Stack will be search from the momentary focus to the bottom of the stack, until an object of this type is found. The focus will be set to this object. If no object of this type is found, the focus will be set back to the top of the Stack.
The name of the object in focus can also be accessed, by using the command FOCUSNAME. And if you have postponed the naming of an object at the time of its creation on the Object Stack, you can set this name at a later time by issuing the command FOCUSNAME with a parameter, the name. It will be checked, if the naming really was postponed, and the name set only if the name given at the call of the command CLASS(type, name) was the special parameter POSTPONE.
Let's create a few new objects on the Stack:
Now we have 3 objects on our Object Stack, the topmost is an object of type "Object_3" with the name "Sally", the second is of type "Object_2", without a name (because we don't know it by now), and the bottommost is an object of type "Object_1" named "Harry". The focus is set to the object on the top of the Stack.
The variable $focus_class is set to the string "Object_3", the variable $focus_name to the string "Sally". Let's change the focus now
The variable $focus_class is now set to the string "Object_1", the variable $focus_name to the string "Harry". The command FOCUSNAME didn't change a thing, because the naming of this object wasn't postponed.
Now the name of the object in focus is changed to the string "and", and the variable $focus_name reflects this by holding the string "and". The variable $focus_class is set to "Object_2". After that we change the focus to the bottommost object and add a reference to the just changed object. We switch back to the top of the Object Stack to add an entry that contains the name of the second object.
There are two additional commands, that change the way in which TextConvert reads input lines and matches patterns (and executes the attached actions). These are AGAIN and NEXTSCAN
This command causes TextConvert to restart matching the patterns to the current input line. This can be another line than the one matched against the patterns the previous time, if you used the command NEXTLINE (reads a new line and makes it current). This enables you to write actions that work over multiple lines, and give control back to TextConvert, if it is detected, that the line just read doesn't belong to this particular action. By restarting the matching process all patterns can be matched against this line.
Suppose you want to examine a file with the following structure:
When you've read the label FIRST, you know that an arbitrary number of lines with indented text belonging to this label follows. To find the end of this structure, you have to examine the first line that doesn't belong to it anymore. If you are sure about the order, in which the different patterns you try to match come, you can put the patterns in the description file in the same order. But this holds true only, if they cannot follow each other in the file you want to work on (like in the example).
This wouldn't be an example, if there would be no solution. We simply scan all patterns again, when we have detected that we have read too far, by issuing the special command AGAIN.
TextConvert stops the matching of patterns for this line, reads a new input line and starts the matching of the patterns again. You can use this command, if you are sure, that no other pattern matches the current input line (this speeds things up), or if you don't want further matching to happen for this line.
Say you have a file in which lines beginning with the letters 'XX' are to be interpreted as comment lines. And you have exactly one pattern that matches these comments, but there are other patterns that could match something inside a comment (only being paranoid...).
The actions (or program fragments) are put together into one address space. This means, you can exchange data between different fragments by the use of normal variables.
We want to have each string 'Good Bye' in our input data stream be a match for a string 'Hello' found before. We remember finding the string 'Hello' by setting the variable $hello_variable to 1, and by resetting it to 0 if we found the string 'Good Bye'
If you have code fragments, which are used over and over again in your code (means, you copied them lots of times between your actions), then it is possibly time for the use of subroutines. The syntax is exactly the same as in Perl (see [PERL]), and, to be honest, they are implemented as such. But wait, there is something special. You can use all the additional commands TextConvert offers.
And if you are still yawning, and asking yourself, can't this guy do something useful, here it comes: the Include Statement. With it, you can have libraries of common subroutines for all those different description files you are using to convert flat file libraries. And even common Pattern/Action-Pairs can be put into such files and simply be included by your description file.
One simple example could be to have standardized BEGIN and END actions, and a Pattern/Action Pair that matches and discards comments beginning with "//", which can be written once and included in all description files.
And here the file that includes the definitions above:
Normally, the split action defined in TextConvert is sufficient. But there are times, when parts of the input data has to be split upon arbitrary boundaries. To provide you with the utmost flexibility, three commands and an additional Pattern/Action pair is defined. The commands are NEXTLINE_NOSPLIT, SPLIT and SPLITPATTERN, the additional Pattern/Action pair is named SPLIT.
NEXTLINE_NOSPLIT reads the next inputline without splitting it into the field array. Thus, if you only need the raw line without splitting it into separate fields, then this command can be used.
SPLIT splits the current input line into the FIELD array. This enables you to read a line without splitting it via the command NEXTLINE_NOSPLIT, make arbitrary changes to support the split command, and then split via this command. The sequence of the two commands NEXTLINE_NOSPLIT and SPLIT is equivalent to the command NEXTLINE.
This command enables you to set the split pattern to arbitrary values. Thus, you can split the input line on more than one pattern by repeatedly changing the pattern with this command followed by the split command described above. The default value for the split pattern is "[ \t\n]+", thus splitting on white space.
If you still need more flexibility, this action is your choice. It can change the entire behaviour of TextConvert if you want. With this action, you have direct access to the code that splits the input field. You can have arbitray perl code here to work on the input line.
The default definition for the split action is as follows (the string "$CONV_Input_Field_Separator" denotes the definition of the split pattern via the command SPLITPATTERN or the default definition "[ \t\n]+"):
Now a somewhat elaborate example that illustrates one possible use for these commands. It has been used as a foundation for the emulation of ace2ace in the course of converting ACEDB data from one model into another one. The subroutine find_new_object() searches the input stream for the beginning of the next object of data, and sets the variables $current_class and $current_instance to the values found in the input stream. To do this, the splitpattern has to be set to the default. Afterwards, the splitpattern is changed in a way that strips the quotes from the input. The SPLIT action implements line merging via the \ -operator before it splits the line. The BEGIN action calls find_new_object() and leaves the rest to TextConvert. If an empty input line is found, the first normal pattern is used and its action called. This flushes the object stack via NEWCONTEXT, calls find_new_object() to start with the next object data and sets class name and instance name. Now all subsequent actions have access to the current class and instance name.
All arguments have to be preceded by an option letter. This ensures the correct interpretation of the information given on the commandline. Some of the options are optional. If they are omitted, default values will be used instead. The options can be given in any order.
-c If this option is given, TextConvert checks the given description file for
-d <file> The file given after this option is interpreted as the description file. This
-f Force the output of the files. Only if this option is given, existing files will
-h If this option is given, a short help page is printed. Nothing else happens.
-i <file> From the file given after this option the input data is read. If this option is
-l <file> With this option the name of a log file can be given. This log file contains
-o <file> This option names the output file. If this option is not used, the output will
TextConvert -h
prints the help page.
TextConvert -c -d embl.description
checks the description file 'embl.description' for syntactical correctness.
TextConvert -d embl.description
converts data read from standard input according to the description file embl.description and writes the converted data to standard output.
TextConvert -d embl.description -i input.data -o output.data
reads the file 'input.data', converts according to the description file 'embl.description', and writes the converted data to the file 'output.data', if it does not exist.
TextConvert -d embl.description -f -i input.data -o output.data
does the same as before, but the output file will be overwritten if it exists.
The format of the description file is:
Each input line is read in and then, in a loop, each pattern is tried; if it matches the attached action will be executed. The patterns are matched against each input line in the same order they are found in in the description file.
The pattern has to be enclosed by delimiters. The delimiter is not part of the pattern. When the pattern is read, TextConvert checks if it begins and ends with the same character (the delimiter), and this character is stripped. The pattern itself has to be of the regular expression format utilised by Perl (see [PERL]).
If you attach actions to the same pattern repetitively, the actions will be executed (if the pattern matches) in exactly the same order as they are found in the description file. Thus, if you have two actions that, besides the pattern, have nothing in common, you don't have to put them together. This makes it easier to have a clear structure in the description file.
To match a pattern at the beginning of a line a caret sign "^" is used. To match the ID entry of the EMBL Database the following pattern can be used:
The delimiters differ in the examples, but they have to be the same at the beginning and the end of the pattern. You must not use matching braces! The pattern matches the string "ID" at the beginning of a line.
To match an exon in the feature table of the EMBL Database, looking like this:
the following pattern could be used:
It will be searched for a string "FT" at the beginning of the line, followed by a number of spaces or tabs, followed by the string "exon". An equivalent would be
Here the tabs and spaces will be matched by the escaped s (\s), which matches whitespace.
The following two pattern / action blocks
will be joined to something similar to the following:
Three special patterns exist. Two of them are BEGIN and END, which denote actions that are to be executed at the beginning and the end of the conversion process. These can be used to initialise variables, to insert text at the beginning or the end of the output stream or to read additional data from another file. The third special pattern is SPLIT, denoting an action that is to be executed instead of the default split action.
To insert a blank line at the beginning and the end of the output stream the following patterns and actions could be used:
To merge lines from the input data ending with the "\" the following SPLIT action can be used:
Subroutines can be declared as usual in Perl. They have to be called according to the calling conventions of Perl, but be advised to use the conventions of Perl 4 to make the description file usable with both Perl 4 and Perl 5 (this means, precede every subroutine call with the ampersand "&").
Let us define a subroutine, that examines if the parameter given is greater than 0:
And now we call it:
This prints "1\n".
In addition to the normal pattern / action pairs and to the subroutines, one special statement is implemented:
This works exactly as in C or C++, by including the file mentioned. This makes it easy to create libraries of common functions, and to use them by issuing this statement. Including of files can be nested, which means, that inside this included file other files can be included.
As in Perl, the '#' sign is the character, that is used to declare the rest of the line as a comment. The remaining part of the line is ignored.
One exception to this is the string '#FIELD'. This is an additional command (see next section). That means, that no comment can begin with this string of characters. I feel that this is no serious limitation, because, if there is at least a single other character in this string, it is treated as a comment (e. g. '#_FIELD' or '# FIELD'). But as a precaution, you should always leave an empty space between the comment delimiter '#' and the comment.
The other exception is, that on lines containing patterns no comments are allowed. This again is no serious limitation, put your comment on the line above the pattern instead.
The commands that are implemented by TextConvert are always in upper case, to distinguish them from normal Perl code. Many of them can be used with a different number of arguments that implement different functionality. The entries are ordered alphabetically, if there are different numbers of arguments the version with less arguments comes first.
They implement the Object Stack and its output to the output data stream, and ease the reading and decoding of the input data stream.
Following are the commands that influence the control flow of TextConvert.
TextConvert restarts to match the given patterns against the current input line. This can be another line than the one matched against the patterns the previous time, if you used the command NEXTLINE (reads a new line and makes it current). This enables you to write actions that work over multiple lines, and give control back if it is detected, that the line just read doesn't belong to this particular action. By restarting the matching process all patterns can be matched against this line.
TextConvert stops the matching of patterns for this line, reads a new input line and starts the matching of the patterns again. You can use this command, if you are sure, that no other pattern matches the current input line (this speeds things up), or if you don't want further matching to happen for this line.
If in the input data stream the data objects are nested or stacked, a construct is needed (or of great help at least) to represent this nesting internally. TextConvert does this by the notion of the Object Stack. It enables, without too much hassle, to change the focus of work from one of the nested objects to another, add data, and change the focus back to the topmost object. In the case of the ACEDB format (the supported output data format), an object is simply an instance of one of the defined classes, in the case of EMBL an each entry is an object. And nested inside of these entries other objects can be found, e.g. references. It lies in your responsibility to decide which data should be treated as an object of its own and which as attributes.
There are a few operations that are needed to work with the Object Stack. You have to be able to add objects to it, to remove objects from the top of the Stack, hereby writing their contents to the output data stream, and to access objects that are not on the top of the stack. You have to have access to class and instance name of the object you work on. Furthermore, it would be helpful to remove all objects from the Stack,to clear it up (and again, write the data to the ouput data stream). To be able to access objects on the Stack, which are not on the top, makes it easier to switch back and forth between different nesting levels of such a hierarchy of objects. To support this, a functionality called focus is implemented. After setting the focus to a specific object on the Stack, all subsequent operations work on this object in focus.
TextConvert, with some limitations, supports this functionality. The limitations are, that you are not free in choosing how many objects to remove (and write to the output) from the top of the Stack, and that you are not free in choosing how many objects down the stack you move the focus. You cannot say, "Move down 3 objects from the current focus", or, "Remove (and yes, write out) the 5 topmost objects on the Stack". But you can say "Move down the Object Stack to the next object of the class DNA", or "Remove (and you know what? yup, write out) all objects until the topmost object is of the class Paper". This makes work with the Stack more understandable and safer. You can't accidently remove an object you wanted to retain on the Stack, thereby destroying the Stack structure you depend on.
This command returns the class of the object currently in focus. If the Object Stack is empty, the command returns an empty string.
This command works analogous to the command CLASS(type), except that it doesn't remove the object on the Object Stack, but changes only the focus to the highest object of the given type. Subsequent operations will work on this object. If you specify the special argument TOPOFSTACK instead of a normal argument, the focus will be set to the object on top of the Object Stack.
This command returns the name of the object instance in focus. If the Object Stack is empty, the command returns an empty string. This is one possible way to determine if the Stack is empty, because in every other case the object instance has to have a name or the special argument POSTPONE.
It is possible that you don't know the name of an object at the first time you see it in the input data stream. In this case you can give the CLASS command a special argument as name, POSTPONE, that postpones the naming. If you do this, you need, later on, a command to name the class. FOCUSNAME(name) sets the name of the object in focus, if and only if it was postponed. See also CLASS(type, name).
This command examines the object on top of the Object Stack, and if it is not of the type given as the argument, the object will be removed (after its contents is written to the output data stream). This goes on, until the current topmost object is of the type given, or until the stack is empty. The focus will be set back to the top of the stack.
This command creates a new object on top of the Object Stack. The object is of the class type "type" and will be given the name "name". If the argument for name is the special argument POSTPONE instead of a normal string, the naming will be postponed. You can set the name later with the command FOCUSNAME. The focus will be set back to the top of the stack.
NEWCONTEXT removes all objects from the Object Stack, thereby printing each of the objects to the output data stream. This command is used to write all objects of the stack when you know that the end of the outermost object of the input data stream is reached.
When you convert the EMBL Database, the beginning of an entry is marked by the identifier ID. If this is read, it is a good time to clean up the stack:
All input will be read from the input data stream, all output will be written to the Object Stack. If an object is removed from the Object Stack, its contents is written to the output data stream.
This command examines the object on top of the stack, and if it is not of the type given, deletes it after writing it to the output data stream. See also entry in the previous section.
This command decodes 2 Bit or ASCII coded sequences of arbitrary data. The first argument is the type (2, "2", "2 Bit"... "8Bit", "ASCII"), the second the length of the sequence, the third an array of Strings containing the plain values of the coded sequence. This argument will be ignored if used in conjunction with coding type "ASCII" or "8 Bit". The number of data items will be read directly from the input data stream. The decoded data is to be found in the variable $DECODE.
To decode a 2 Bit coded DNA sequence with a length of 424 bases, you can use the following DECODE statement (the PRINT command outputs the converted data):
This command checks the input data stream, if the end of the file is reached. 1 is returned upon reaching end of file, 0 otherwise.
The current input line is found in FIELD. If you want to access parts of it, say, the second entry, or the last entry (delimited by spaces), you should use FIELD[n].
This command returns the nth field of the input line. Counting begins with 0 and ends with #FIELD (the number of entries minus one). If you want to access the last entry, you can use the following construct:
This command returns the number of entries (delimited by spaces) minus one in the current input line. It works similar to array indexing in C or C++.
NEWCONTEXT writes all objects currently on the Object Stack to the output data stream. See also the entry in previous section.
This command reads the next input line. After issuing NEXTLINE the contents of the new, current line can be accessed through the commands FIELD and FIELD[n]. The command can be used to access data, that is distributed among more than one line. If you discover, that you read one line too much, the command AGAIN (see section about control flow) could come in handy. This command works exactly like a sequential execution of the commands NEXTLINE_NOSPLIT and SPLIT.
NEXTLINE_NOSPLIT reads the next inputline without splitting it into the field array. Thus, if you only need the raw line without splitting it into separate fields, then this command can be used.
The PRINT command is similar to the print command of Perl, if there is no object on the stack. But if the Object Stack is not empty, then it prints its text to the current object on the stack (see FOCUSCLASS in the section about the Object Stack). The special argument NEWLINE in the place of text prints a newline character to the current object.
This command is similar to the PRINT command above, except that it takes as first argument a format string FORMAT to format the additional arguments and that it is not limited to one additional argument. The format string contains text with embedded field specifiers (following the normal C printf() conventions).
SPLIT splits the current input line into the FIELD array. This enables you to read a line without splitting it via the command NEXTLINE_NOSPLIT, make arbitrary changes to support the split command, and then split via this command. The sequence of the two commands NEXTLINE_NOSPLIT and SPLIT is equivalent to the command NEXTLINE.
This command enables you to set the split pattern to arbitrary values. Thus, you can split the input line on more than one pattern by repeatedly changing the pattern with this command followed by the split command described above. The default value for the split pattern is "[ \t\n]+", thus splitting on white space.
You need the language Perl to use TextConvert, because TextConvert is written in it. If you don't know what it is, ask your system administrator. You need at least the version 4, but it works with Perl 5 also. If the path to your executable of perl is not '/usr/local/bin/perl' (you can get this under Unix by issuing the command 'which perl'), you either have to call the program by with the name of the perl interpreter in front of it (e.g. if your perl interpreter is /usr/bin/perl5, you issue the command '/usr/bin/perl5 TextConvert <normal options>'), or by editing the first line of TextConvert.. Alternatively, you can use the included program 'fixin' (directly out of [PERL]).
Examples
if(FIELD =~ /RNA/)
{
PRINT("Found RNA", NEWLINE);
}
PRINT("Found RNA", NEWLINE) if FIELD =~ /RNA/;
if(FIELD[1] ne ".")
{
PRINT(substr(FIELD, 5), NEWLINE);
}
$test_string = FIELD[#FIELD];
for($i = 1; $i <= #FIELD; $i++)
{
chop FIELD[$i];
PRINT("FIELD[$i]", NEWLINE);
}
1.2.2 The Print Statements
Examples
PRINT(" > ", FIELD, NEWLINE);
PRINT(" > FIELD", NEWLINE);
PRINTF("%x", $hex);
$width = 20;
$value = sin(1.0);
foreach $precision (0 .. ($width - 2))
{
PRINTF("%${width}.${precision}f\n", $value);
}
1
0.8
0.84
0.841
0.8415
0.84147
0.841471
0.8414710
0.84147098
0.841470985
0.8414709848
0.84147098481
0.841470984808
0.8414709848079
0.84147098480790
0.841470984807897
0.8414709848078965
0.84147098480789650
0.841470984807896505
1.2.3 Other Commands influencing the Input or Output Data Stream
NEXTLINE
Example
/FH/
{
NEXTSCAN if EOF();
NEXTLINE;
while(FIELD =~ /^FT/)
{
NEXTSCAN if EOF();
NEXTLINE;
}
AGAIN;
}
/^KW/
{
while(substr(FIELD, -1) ne ".")
{
# We have to split the input line ourselves
# using ';' as delimiter. We drop the first
# 3 characters (they are "KW ")
@myfield = split(/; */, substr(FIELD, 3));
for($i = 1; $i <= $#myfield; $i++)
{
PRINT("Kw:", $myfield[$i], NEWLINE);
}
NEXTSCAN if EOF();
NEXTLINE;
}
# now we examine the last line
NEXTSCAN if FIELD[1] eq "."; # line is "KW ."
chop FIELD; # remove the period
@myfield = split(/; */, substr(FIELD, 3);
for($i = 1; $i <= $#myfield; $i++)
{
PRINT("Keyword: ", $myfield[$i]);
}
NEXTSCAN;
}
DECODE
Example
>>>>A00144 2/93 2BIT Len: 705
A00144 H.sapiens LAG-2 gene promoter region. 2/93
[Some binary data, 177 Bytes]>>>>A00149 3/93 ASCII Len: 567
A00149 H.sapiens IFN-alpha-J1 mRNA. 3/93
ATGGCCCGGTCCTTTTCTTTACTGATGGCCGTGCTGGTACTCACCTACAAATCCANCTGCTCTCTGGGCTGTG
ATCTGCCTCAGACCCACAGCCTGCGTAATAGGAGGGCCTTGATACTCCTGGCACAAATGGGAAGAATCTCTCC
TTTCTCCTGCTTGAAGGACAGACATGAATTCAGATTCCCGGAGGAGGAGTTTGATGGCCACCAGTTCCAGAAG
ACTCAAGCCATCTCTGTCCTCCATGAGATGATCCAGCAGACCTTCAATCTCTTCAGCACAGAGGACTCATCTG
CTGCTTGGGAACAGAGCCTCCTAGAAAAATTTTCCACTGAACTTTACCAGCAACTGAATGACCTGGAAGCATG
TGTGATACAGGAGGTTGGGGTGGAAGAGACTCCCCTGATGAATGAGGACTTCATCCTGGCTGTGAGGAAATAC
TTCCAAAGAATCACTCTTTATCTAACAGAGAAGAAATACAGCCCTTGTGCCTGGGAGGTTGTCAGAGCAGAAA
TCATGAGATCCTTCTCTTTTTCAACAAACTTGAAAAAAGGATTAAGGAGGAAGGAT
/>>>>/
{
NEWCONTEXT();
$coding = FIELD[2];
$length = FIELD[#FIELD];
NEXTSCAN if EOF();
NEXTLINE;
CLASS("DNA", "FIELD[0]");
DECODE($coding, $length, (C, T, A, G));
# Format output data in lines of 75 characters each
$position = 0;
while($characters =
substr($decoded, $position, 75))
{
PRINT("$characters\n");
$position += 75;
}
NEXTSCAN;
}
1.3 The Object Stack
CLASS("Object_Type1", "Object_Name1");
CLASS("Object_without_name", POSTPONE);
Examples
Outermost T'is_the_name
Outermost_entry "data to be converted"
... [here comes some more data and some embedded objects]
Innermost T'is_the_embedded_object
... [data of innermost object]
Back_to_Outermost "well, here's the outermost object again"
Go_on_with_Outermost "Say: Hello, Outermost"
...
# We've found the outermost object
/Outermost/
{
CLASS("OUTERMOST", FIELD[1]);
...
}
/Outermost_entry/
{
CLASS("OUTERMOST");
...
}
...
# The innermost object is found
/Innermost/
{
# Create a representation of the innermost object on
# the Object Stack
CLASS("INNERMOST", FIELD[1]);
...
}
...
/Back_to_Outermost/
{
# Ensure we have the right object type on top of the Stack
CLASS("OUTERMOST");
...
}
/Go_on_with_Outermost/
{
PRINT("Hello, Outermost", NEWLINE);
...
}
...
1.3.1 The Focus
Examples
CLASS("Object_1", "Harry");
CLASS("Object_2", POSTPONE);
CLASS("Object_3", "Sally");
$focus_class = FOCUSCLASS;
$focus_name = FOCUSNAME;
FOCUSCLASS("Object_1");
FOCUSNAME("This_will_not_be_set");
$focus_class = FOCUSCLASS;
$focus_name = FOCUSNAME;
FOCUSCLASS(TOPOFSTACK);
FOCUSCLASS("Object_2");
FOCUSNAME("and");
$focus_class = FOCUSCLASS;
$focus_name = FOCUSNAME;
FOCUSCLASS("Object_1");
PRINT("Reference_to: ", $focusclass, ", ", $focusname, NEWLINE);
FOCUSCLASS(TOPOFSTACK);
PRINT("Parent: ", $focus_name);
1.4 Control Flow
AGAIN
Example
FIRST <Text>
<Text>
...
<Text>
SECOND <Other Text>
<Other Text>
...
<Other Text>
FIRST <Text>
/FIRST/
{
PRINT("First Pattern found", NEWLINE);
# Discard the string "FIRST"
PRINT("> ", substr(FIELD, 6), NEWLINE);
NEXTSCAN if EOF();
NEXTLINE;
while(length(FIELD[0]) == 0 && FIELD =~ /\S/)
{
PRINT("> ", substr(FIELD, 1), NEWLINE);
NEXTSCAN if EOF();
NEXTLINE;
}
AGAIN;
}
/SECOND/
{
PRINT("Second Pattern found", NEWLINE);
# Discard the string "Second"
PRINT("> ",substr(FIELD, 7), NEWLINE);
NEXTSCAN if EOF();
NEXTLINE;
while(length(FIELD[0]) == 0 && FIELD =~ /\S/)
{
PRINT("> ", substr(FIELD, 1), NEWLINE);
NEXTSCAN if EOF();
NEXTLINE;
}
AGAIN;
}
NEXTSCAN
Example
/XX/
{
# Comments will begin with to slashes '//'
PRINT("//", substr(FIELD, 2), NEWLINE);
NEXTSCAN;
}
1.5 Communication between Actions
Example
/Hello/
{
$hello_variable = 1;
}
/Good Bye/
{
if($hello_variable == 1)
{
PRINT("Good Bye, friend", NEWLINE);
}
else
{
PRINT("You forgot to greet me", NEWLINE);
}
$hello_variable = 0;
}
1.6 Subroutines and the Include Statement
Example
[This is the File "Standard.des"]
BEGIN
{
&print_begin();
}
END
{
PRINT(NEWLINE, "END OF FILE", NEWLINE);
PRINT(NEWLINE);
}
!//!
{
FIELD =~ s!\s*//.*$!!;
# It is better to call AGAIN, we don't know where this
# file is included (i.e. which position the pattern has).
AGAIN;
}
sub print_begin
{
PRINT("// This file has been created by me", NEWLINE);
PRINT(NEWLINE);
}
#include "Standard.des"
...
1.7 Splitting of Input Lines
NEXTLINE_NOSPLIT
SPLIT
SPLITPATTERN
SPLITPATTERN("[ \t]+");
The SPLIT Action
SPLIT
{
s/\n$//;
study;
SPLIT;
}
Example
sub find_new_object
{
SPLITPATTERN("[ \t\n]+");
NEXTLINE() while /^\s*$/ && !EOF();
$current_class = FIELD[0];
$current_instance = "\"FIELD[NUMFIELD]\"";
SPLITPATTERN("\"?[ \t]+\"");
}
SPLIT
{
# merge lines ending with \
FIELD .= NEXTLINE_NOSPLIT while (s/\\\n$//);
SPLIT;
}
BEGIN
{
$target_class = "Locus :";
&find_new_object();
NEXTSCAN if EOF();
CLASS($target_class, $current_instance);
}
# The next object begins after an empty line
/^\s*$/
{
NEWCONTEXT();
&find_new_object();
NEXTSCAN if EOF();
CLASS($target_class, $current_instance);
NEXTSCAN;
}
2 Usage
2.1 The Command Line Options
syntactic correctness.
option is mandatory (without description file, TextConvert doesn't know
what to do). The file has to have the format described in the next section.
be overwritten. If it is ommitted and one of the output files exists,
TextConvert terminates with an error.
omitted (it is optional), the standard input stream is used to read the input
data.
all lines that have not been matched by at least one of the patterns given in
the description file. If the file exists, and the option -f is not given,
TextConvert terminates with an error.
be written to the standard output stream. If the file exists, and option -f is
not given, TextConvert terminates with an error.
Examples
2.2 The Description File
2.2.1 The Format of the File
/Pattern/
{
[PERL CODE, ADDITIONAL COMMANDS and subroutine calls];
}
/Pattern/
{
[PERL CODE, ADDITIONAL COMMANDS and subroutine calls];
}
sub test
{
[PERL CODE, ADDITIONAL COMMANDS and subroutine calls];
}
#include "library"
2.2.2 Pattern Format
Examples
/^ID/
*^ID*
|^ID|
!^ID!
&^ID&
FT exon 2439..2607
/^FT[ \t]+exon/
/^FT\sexon/
# Match anything
/./
{
#Here begins the first action
PRINT("Hello action 1");
PRINT(NEWLINE);
}
/./
{
#Here begins the second action
PRINT("Hello action 2");
PRINT(NEWLINE);
}
/./
{
#Here begins the first action
PRINT("Hello action 1");
PRINT(NEWLINE);
#Here begins the second action
PRINT("Hello action 2");
PRINT(NEWLINE);
}
2.2.3 Special Patterns
Examples
BEGIN
{
PRINT(NEWLINE);
}
END
{
PRINT(NEWLINE);
}
SPLIT
{
# merge lines ending with \
FIELD .= NEXTLINE_NOSPLIT while (s/\\\n$//);
SPLIT;
}
2.2.4 Subroutines
Examples
sub test_gt_zero
{
local($param) = @_;
return ($param > 0);
}
$t = &test_gt_zero(1);
$t2 = &test_gt_zero(0);
print "1\n" if $t;
print "2\n" if $t2;
2.2.5 Include Statements
#include "library"
2.2.6 Comments
2.3 The additional Commands
2.3.1 Control Flow
AGAIN
NEXTSCAN
2.3.2 The Object Stack
FOCUSCLASS
FOCUSCLASS(type)
FOCUSNAME
FOCUSNAME(name)
CLASS(type)
CLASS(type, name)
NEWCONTEXT
Example
/^ID/
{
NEWCONTEXT();
}
2.3.3 Input / Output
CLASS(type)
DECODE(type, length, Code_List)
Example
DECODE("2 Bit", 425, (C, T, A, G));
PRINT($DECODED, NEWLINE);
EOF()
FIELD
FIELD[n]
$lastfield = FIELD[#FIELD];
#FIELD
NEWCONTEXT
NEXTLINE
NEXTLINE_NOSPLIT
PRINT(text)
PRINTF(FORMAT, text, ...)
SPLIT
SPLITPATTERN
SPLITPATTERN("[ \t]+");
3 Installation
4 Changes
Cleaned up the code a little bit, renamed some functions for convenience.
Added the possibility to embed subroutines into the description file.
Added the include statement. This, combined with subroutines, makes libraries of subroutines possible.
Included the BEGIN and the END actions into the generated program instead of executing them beforehand and afterwards.
Blank lines are no longer removed.
Added the command NEXTLINE_NOSPLIT and SPLITPATTERN, and the action SPLIT.
5 References