A parser for DEC’s ANSI-compatible video terminals

Design Aims

This document presents a state machine for a parser for escape and control sequences, suitable for use in a VT emulator. It is claimed to have two important properties:

During the discussion of this design, I will mention some real terminal emulators by name. This is for comparative purposes only and no criticism is intended of decisions made by the authors or maintainers of the applications referred to. After all, I am not presenting a product to the world, merely an ideal software model and I am free to ignore efficiency!

In this document, “VT500” is used as shorthand for the VT500 series of terminals, the VT510, VT520 and VT525.

Why DEC-compatible, not just ANSI-compatible?

All of DEC’s terminals from the VT100 onward are compatible with ANSI X3.64-1979, “Additional Controls for Use with American National Standard Code for Information Interchange”, hereafter referred to just as X3.64. However, X3.64 defines many implementation-dependent features and error conditions without defining recovery procedures. A sample of these is given below; a more detailed treatment appears later.

A terminal is a closed box that doesn’t normally report errors in its input stream to the host, so it must define a recovery procedure for all the circumstances left undefined by X3.64. DEC defined the recoveries for their terminals, so emulators should match these exactly¹.

The State Diagram

You can see the state diagram in PNG format by selecting the thumbnail below. You may find it useful to open this in a separate window while reading this document. Unfortunately the diagram is huge and isn’t easily scalable. I have now drawn a scalable version of the parser diagram in SVG format with Inkscape.

State Diagram for a VT500-Series Parser

The UML State Diagram should be readable to anyone who has seen a picture of a state machine before, but here are some random notes on reading it.

State Definitions

ground

This is the initial state of the parser, and the state used to consume all characters other than components of escape and control sequences.

GL characters (20 to 7F) are printed. I have included 20 (SP) and 7F (DEL) in this area, although both codes have special behaviour. If a 94-character set is mapped into GL, 20 will cause a space to be displayed, and 7F will be ignored. When a 96-character set is mapped into GL, both 20 and 7F may cause a character to be displayed. Later models of the VT220 included the DEC Multinational Character Set (MCS), which has 94 characters in its supplemental set (i.e. the characters supplied in addition to ASCII), so terminals only claiming VT220 compatibility can always ignore 7F. The VT320 introduced ISO Latin-1, which has 96 characters in its supplemental set, so emulators with a VT320 compatibility mode need to treat 7F as a printable character.

escape

This state is entered whenever the C0 control ESC is received. This will immediately cancel any escape sequence, control sequence or control string in progress. If an escape sequence or control sequence was in progress, “cancel” means that the sequence will have no effect, because the final character that determines the control function (in conjunction with any intermediates) will not have been received. However, the ESC that cancels a control string may occur after the control function has been determined and the following string has had some effect on terminal state. For example, some soft characters may already have been defined. Cancelling a control string does not undo these effects.

A control string that started with DCS, OSC, PM or APC is usually terminated by the C1 control ST (String Terminator). In a 7-bit environment, ST will be represented by ESC \ (1B 5C). However, receiving the ESC character will “cancel” the control string, so the ST control function that is invoked by the arrival of the following “\” is essentially a “no-op” function. Does this point seem like pure trivia? Maybe, but I worried for ages about whether the control string recogniser needed a one character lookahead in order to know whether ESC \ was going to terminate it. The actual solution became clear when I was using ReGIS on a VT330: sending ESC immediately caused the graphics output cursor to disappear from the screen, so I knew that the control string had already finished before the “\” arrived. Many of the clues that enabled me to derive this state diagram have been as subtle as that.

escape intermediate

This state is entered when an intermediate character arrives in an escape sequence. Escape sequences have no parameters, so the control function to be invoked is determined by the intermediate and final characters. In this parser there is just one escape intermediate, and the parser uses the collect action to remember intermediate characters as they arrive, for processing by the esc_dispatch action when the final character arrives. An alternate approach (and the one adopted by xterm) is to have multiple copies of this state and choose the next appropriate one as each intermediate character arrives. I think that this alternate approach is merely an optimisation; the approach presented here doesn’t require any more states if the repertoire of supported control functions increases.

This state is only split from the escape state because certain escape sequences are the 7-bit representations of C1 controls that change the state of the parser. Without these “compatibility sequences”, there could just be one escape state to collect intermediates and dispatch the sequence when a final character was received.

csi entry

This state is entered when the control function CSI is recognised, in 7-bit or 8-bit form. This state will only deal with the first character of a control sequence, because the characters 3C-3F can only appear as the first character of a control sequence, if they appear at all. Strictly speaking, X3.64 says that the entire string is “subject to private or experimental interpretation” if the first character is one of 3C-3F, which allows sequences like CSI ?::<? F, but Digital’s terminals only ever used one private-marker character at a time. As far as I am aware, only characters 3D (=), 3E (>) and 3F (?) were used by Digital.

C0 controls are executed immediately during the recognition of a control sequence. C1 controls will cancel the sequence and then be executed. I imagine this treatment of C1 controls is prompted by the consideration that the 7-bit (ESC Fe) and 8-bit representations of C1 controls should act in the same way. When the first character of the 7-bit representation, ESC, is received, it will cancel the control sequence, so the 8-bit representation should do so as well.

csi param

This state is entered when a parameter character is recognised in a control sequence. It then recognises other parameter characters until an intermediate or final character appears. Further occurrences of the private-marker characters 3C-3F or the character 3A, which has no standardised meaning, will cause transition to the csi ignore state.

csi intermediate

This state is entered when an intermediate character is recognised in a control sequence. It then recognises other intermediate characters until a final character appears. If any more parameter characters appear, this is an error condition which will cause a transition to the csi ignore state.

Neither X3.64 nor Digital defined any control sequences with more than one intermediate character, although X3.64 doesn’t place any limit on the possible number.

csi ignore

This state is used to consume remaining characters of a control sequence that is still being recognised, but has already been disregarded as malformed. This state will only exit when a final character is recognised, at which point it transitions to ground state without dispatching the control function. This state may be entered because:

  1. a private-marker character 3C-3F is recognised in any place other than the first character of the control sequence,
  2. the character 3A appears anywhere, or
  3. a parameter character 30-3F occurs after an intermediate character has been recognised.

C0 controls will still be executed while a control sequence is being ignored.

dcs entry

This state is entered when the control function DCS is recognised, in 7-bit or 8-bit form. X3.64 doesn’t define any structure for device control strings, but Digital made them appear like control sequences followed by a data string, with a form and length dependent on the control function. This state is only used to recognise the first character of the control string, mirroring the csi entry state.

C0 controls other than CAN, SUB and ESC are not executed while recognising the first part of a device control string.

dcs param

This state is entered when a parameter character is recognised in a device control string. It then recognises other parameter characters until an intermediate or final character appears. Occurrences of the private-marker characters 3C-3F or the undefined character 3A will cause a transition to the dcs ignore state.

dcs intermediate

This state is entered when an intermediate character is recognised in a device control string. It then recognises other intermediate characters until a final character appears. If any more parameter characters appear, this is an error condition which will cause a transition to the dcs ignore state.

dcs passthrough

This state is a shortcut for writing state machines for all possible device control strings into the main parser. When a final character has been recognised in a device control string, this state will establish a channel to a handler for the appropriate control function, and then pass all subsequent characters through to this alternate handler, until the data string is terminated (usually by recognising the ST control function).

This state has an exit action so that the control function handler can be informed when the data string has come to an end. This is so that the last soft character in a DECDLD string can be completed when there is no other means of knowing that its definition has ended, for example.

dcs ignore

This state is used to consume remaining characters of a device control string that is still being recognised, but has already been disregarded as malformed. This state will only exit when the control function ST is recognised, at which point it transitions to ground state. This state may be entered because:

  1. a private-marker character 3C-3F is recognised in any place other than the first character of the control string,
  2. the character 3A appears anywhere, or
  3. a parameter character 30-3F occurs after an intermediate character has been recognised.

These conditions are only errors in the first part of the control string, until a final character has been recognised. The data string that follows is not checked by this parser.

osc string

This state is entered when the control function OSC (Operating System Command) is recognised. On entry it prepares an external parser for OSC strings and passes all printable characters to a handler function. C0 controls other than CAN, SUB and ESC are ignored during reception of the control string.

The only control functions invoked by OSC strings are DECSIN (Set Icon Name) and DECSWT (Set Window Title), present on the multisession VT520 and VT525 terminals. Earlier terminals treat OSC in the same way as PM and APC, ignoring the entire control string.

sos/pm/apc string

The VT500 doesn’t define any function for these control strings, so this state ignores all received characters until the control function ST is recognised.

anywhere

This isn’t a real state. It is used on the state diagram to show transitions that can occur from any state to some other state. These invariant transitions are:

On terminals earlier than the VT500, there would have been one other invariant action: the C0 control NUL was ignored on input to the terminal and would not take part in any processing. Its only purpose was as a time-fill character. However, the VT500 defines a control function DECNULM (Null Mode), which allows NUL to be passed to an attached printer. So in this parser, NUL is treated the same as other C0 controls.

Action Definitions

An event may cause one of these actions to occur with or without a change of state.

ignore

The character or control is not processed. No observable difference in the terminal’s state would occur if the character that caused this action was not present in the input stream. (Therefore, this action can only occur within a state.)

print

This action only occurs in ground state. The current code should be mapped to a glyph according to the character set mappings and shift states in effect, and that glyph should be displayed. 20 (SP) and 7F (DEL) have special behaviour in later VT series, as described in ground.

execute

The C0 or C1 control function should be executed, which may have any one of a variety of effects, including changing the cursor position, suspending or resuming communications or changing the shift states in effect. There are no parameters to this action.

clear

This action causes the current private flag, intermediate characters, final character and parameters to be forgotten. This occurs on entry to the escape, csi entry and dcs entry states, so that erroneous sequences like CSI 3 ; 1 CSI 2 J are handled correctly.

collect

The private marker or intermediate character should be stored for later use in selecting a control function to be executed when a final character arrives. X3.64 doesn’t place any limit on the number of intermediate characters allowed before a final character, although it doesn’t define any control sequences with more than one. Digital defined escape sequences with two intermediate characters, and control sequences and device control strings with one. If more than two intermediate characters arrive, the parser can just flag this so that the dispatch can be turned into a null operation.

param

This action collects the characters of a parameter string for a control sequence or device control sequence and builds a list of parameters. The characters processed by this action are the digits 0-9 (codes 30-39) and the semicolon (code 3B). The semicolon separates parameters. There is no limit to the number of characters in a parameter string, although a maximum of 16 parameters need be stored. If more than 16 parameters arrive, all the extra parameters are silently ignored.

The VT500 Programmer Information is inconsistent regarding the maximum value that a parameter can take. In section 4.3.3.2 of EK-VT520-RM it says that “any parameter greater than 9999 (decimal) is set to 9999 (decimal)”. However, in the description of DECSR (Secure Reset), its parameter is allowed to range from 0 to 16383. Because individual control functions need to make sure that numeric parameters are within specific limits, the supported maximum is not critical, but it must be at least 16383.

Most control functions support default values for their parameters. The default value for a parameter is given by either leaving the parameter blank, or specifying a value of zero. Judging by previous threads on the newsgroup comp.terminals, this causes some confusion, with the occasional assertion that zero is the default parameter value for control functions. This is not the case: many control functions have a default value of 1, one (GSM) has a default value of 100, and some have no default. However, in all cases the default value is represented by either zero or a blank value.

In the standard ECMA-48, which can be considered X3.64’s successor², there is a distinction between a parameter with an empty value (representing the default value), and one that has the value zero. There used to be a mode, ZDM (Zero Default Mode), in which the two cases were treated identically, but that is now deprecated in the fifth edition (1991). Although a VT500 parser needs to treat both empty and zero parameters as representing the default, it is worth considering future extensions by distinguishing them internally.

esc_dispatch

The final character of an escape sequence has arrived, so determined the control function to be executed from the intermediate character(s) and final character, and execute it. The intermediate characters are available because collect stored them as they arrived.

csi_dispatch

A final character has arrived, so determine the control function to be executed from private marker, intermediate character(s) and final character, and execute it, passing in the parameter list. The private marker and intermediate characters are available because collect stored them as they arrived.

Digital mostly used private markers to extend the parameters of existing X3.64-defined control functions, while keeping a similar meaning. A few examples are shown in the table below.

No Private Marker With Private Marker
SM, Set ANSI Modes SM, Set Digital Private Modes
ED, Erase in Display DECSED, Selective Erase in Display
CPR, Cursor Position Report DECXCPR, Extended Cursor Position Report

In the cases above, csi_dispatch needn’t know about the private marker at all, as long as it is passed along to the control function when it is executed. However, the VT500 has a single case where the use of a private marker selects an entirely different control function (DECSTBM, Set Top and Bottom Margins and DECPCTERM, Enter/Exit PCTerm or Scancode Mode), so this action needs to use the private marker in its choice. xterm takes the same approach for efficiency, even though it doesn’t support DECPCTERM.

The selected control function will have access to the list of parameters, which it will use some or all of. If more parameters are supplied than the control function requires, only the earliest parameters will be used and the rest will be ignored. If too few parameters are supplied, default values will be used. If the control function has no default values, defaulted parameters will be ignored; this may result in the control function having no effect. For example, if the SM (Set Mode) control function is invoked with the sequence CSI 2;0;5 h, the second parameter will be ignored because SM has no default value.

hook

This action is invoked when a final character arrives in the first part of a device control string. It determines the control function from the private marker, intermediate character(s) and final character, and executes it, passing in the parameter list. It also selects a handler function for the rest of the characters in the control string. This handler function will be called by the put action for every character in the control string as it arrives.

This way of handling device control strings has been selected because it allows the simple plugging-in of extra parsers as functionality is added. Support for a fairly simple control string like DECDLD (Downline Load) could be added into the main parser if soft characters were required, but the main parser is no place for complicated protocols like ReGIS.

put

This action passes characters from the data string part of a device control string to a handler that has previously been selected by the hook action. C0 controls are also passed to the handler.

unhook

When a device control string is terminated by ST, CAN, SUB or ESC, this action calls the previously selected handler function with an “end of data” parameter. This allows the handler to finish neatly.

osc_start

When the control function OSC (Operating System Command) is recognised, this action initializes an external parser (the “OSC Handler”) to handle the characters from the control string. OSC control strings are not structured in the same way as device control strings, so there is no choice of parsers.

osc_put

This action passes characters from the control string to the OSC Handler as they arrive. There is therefore no need to buffer characters until the end of the control string is recognised.

osc_end

This action is called when the OSC string is terminated by ST, CAN, SUB or ESC, to allow the OSC handler to finish neatly.

What X3.64 Doesn’t Say

As I said above, X3.64 deliberately leaves some decisions to implementers. It doesn’t define recovery from error conditions, and some limits are implementation dependent. The following sections define DEC’s method of coping with all of these sections of the standard.

An Implementation

As of 2005, Josh Haberman has implemented this parser in C and placed it in the public domain. You will also need Ruby to create the parser tables at compile time. Download vtparse.tar.gz (5 KiB).

Any Questions?

If you have any questions about this document, please send them to me, no matter how trivial you think they are. Even if the answer is already stated here, it may need clarification (or writing in bigger letters!) If you try to write the parser for a terminal emulator from this specification and you find you need a leap of logic, I’ve not done my job properly, and I’d like to hear about it.

Footnotes

  1. It is debatable how far it is necessary to go with making an emulator match the error-recovery behaviour of the terminal, for two reasons. Firstly, for the practical reason that information on error recovery isn’t contained in DEC’s terminal manuals and discovering it means taking detailed and seemingly-endless notes about the terminal’s behaviour when certain bizarre sequences are sent to it. (OK, I’ve done that!)

    Secondly, how often would erroneous sequences be sent to the terminal anyway? I would answer this by saying that people who write applications for terminals don’t always read the manuals and may rely on some observed behaviour of the terminal without realising that they are seeing the effects of error recovery. It appears to be common knowledge among emulator writers (and their critics) that the sequence CSI 2 LF C moves the cursor two columns right and one row down. How many realise that this behaviour is not specified in X3.64, but just happens to have been the error recovery chosen by the designers of the VT100? The lesson I take from this is that if you’re going to emulate a real terminal, you should match all observable behaviour.

  2. With its first edition having been published in 1976, ECMA-48 “Control Functions for Coded Character Sets” predates ANSI X3.64 and has been updated for longer. As ECMA make their standards available free of charge, I find it surprising that anyone ever bothered claiming conformance with ANSI X3.64.