User Defined Collating
======================

User-defined collating within Empress is a mechanism which allows
Empress to be customized so that certain of its string operators 
will retrieve and sort strings from a database in a user-defined way.

This document covers the following topics:

	1. What is collating?
	2. What are the capabilities provided by the Empress
		user-defined collating mechanism?
	3. How do you create your own collating mechanism?
	4. How do you install your own collating mechanism?
	5. How do you use a user-defined collating mechanism?
	6. Limitations.

===========================================================================

1. What is collating?

We use the term "collating mechanism" (or "collating")
to mean a method  used to determine the sorting order of 
arbitrary text.  A collating mechanism can be used to determine
the sorted order of text strings, and can also be used to
determine which strings match a pattern.

Typically, a collating mechanism does this:

	* breaks up text into a sequence of tokens
	* for a given piece of text, the sort value of its ith. token 
	is compared to the sort value of the ith. token of the other
	string.
		* if the sort values are equal, the next tokens are
		compared
		* if the sort values are unequal, the string with the 
		token having the lower sort value is sorted first.
	
Typically, the sorting order of text tokens is defined by a fixed
"collating sequence".  For example, a simple collating sequence would
define the order in which individual characters would be sorted.
Thus, the normal latin1 character set has a collating sequence
which specifies that 'c' sorts before 'z'.

However, a collating sequence can contain tokens which 
are sequences of characters as well.  For example, suppose 
there is a language in which the sequence 'SS' is accorded a 
sorting order of its own, independant of the sorting order of the 
individual character 'S'.  Thus, one might define a collating 
sequence such as the following:

	... < 'S' < 'SS' < 'T' < ...

Using this collating sequence, the following sorting of words would
then be obtained:

	'BASE' < 'BASTE' < 'BASS'

whereas in the usual latin1 character set, these words would sort
as follows:

	'BASE' < 'BASS' < 'BASTE'

Note that for a word like 'BASSS',  it is up to the programmer to
determine whether it should be interpreted as "BA 'SS' S", "BAS 'SS'",
etc.

===========================================================================

2. What are the capabilities provided by the Empress user-defined
collating mechanism?

The Empress user-defined collating mechanism allows some of the
capabilities of a generic mechanism above, and can be quite powerful
and sophisticated.  It is essentially a function which takes two
strings 'a' and 'b' as arguments, and returns one of three results
defining whether a < b, a = b, or a > b.

Note that although a collating sequence is often employed within the
collating mechanism (as described above), it is not necessary.
For example, one could use a hashing function h() on the text strings
a and b :

	#include <collate.h>
	#include <usrfns.h>

	int	hash_collate (
		char	*a,
		dtpar	*a_par,
		char	*b,
		dtpar	*b_par )
	{
		unsigned int	h();
		long	a_size,
			b_size;

		a_size = a_par->dtpar1;		/* max. size of attribute a */
		b_size = b_par->dtpar1;		/* max. size of attribute b */

		if ( h(a, a_size) < h(b, b_size) )
			return LESS;		/* i.e. -1 */
		else if ( h(a, a_size) > h(b, b_size) )
			return GREATER;		/* i.e. +1 */
		else
			return EQUAL;		/* i.e.  0 */
	}

	# define MAXHASH	(MAXINT - MAXCHAR - 1)

	unsigned int	h(
		char	*s,
		long	size )
	{
		unsigned int	val;

		for ( val=0; size > 0 && *s != '\0'; s++, size-- )
			val = val - ( val + (unsigned int)(*s) ) / (MAXHASH);

		return (val);
	}

In this example, the input strings a and b will be sorted according
to the result of the hash function applied to each of them.  If 'a'
hashed to 847, and 'b' hashed to 491, for example, then b < a would result.

It is possible to write sophisticated collating mechanisms which
can be aware of contextual sorting issues.  For example, in some languages
a token which appears at the beginning of a word can have a different
sorting order than if it appears elsewhere in a word.  Similarly, a token
can have its normal sorting order changed depending on the preceding or
succeeding tokens in the string.  You are, in principle, unlimited in
the simplicity or sophistication of your chosen mechanism.

However, note that the Empress collating mechanism is limited to working
with nlschar attributes in the SORT BY clause, and in the WHERE clause
in conjunction with various operators (in particular, <, <=, =, >=, > and
RANGE).  Other operators (such as MATCH, SMATCH, and LIKE) do not
use the user-defined collating mechanism (see below for more details).

===========================================================================

3. How do you create your own collating mechanism?

Creating a collating mechanism is the first step towards being able to 
use it within Empress.  Creating a collating mechanism is simple: just
write a function which returns one of three values:

	Return Value	String Condition
	------------	----------------
	-1		a < b
	0		a = b
	1		a > b

Each collating function must be prototyped as follows:

	int my_collate ( 
		char *s1, 	/* 1st. string to collate */
		dtpar *p1, 	/* 1st. attribute parameters */
		char *s2, 	/* 2nd. string to collate */
		dtpar *p2 	/* 2nd. attribute parameters */
		)

The 'dtpar' datatype is defined in the file $EMPRESSPATH/rdbms/include/usrfns.h.
A 'dtpar' is a structure which contains the parameters defining  an 
attribute:

        typedef struct
        {
                long    dttypeid;       /* type identifier */
                long    dtpar1,         /* parameters */
                long    dtpar2,
                long    dtpar3,
                long    dtpar4,
                addr    dtpfptr;        /* pointer for file operations */
        }  dtpar;

For the nlschar data type (the only data type which can take advantage
of user defined collating) the structure elements contain:

        dttypeid = a constant corresponding to 'nlschar'
        dtpar1   = length of the attribute (number of bytes)
        dtpar2   = type of the attribute (1, 2 or 3)
        dtpar3   = collation function index
        dtpar4   = 0

In the function prototype above,
the string 's1' is a pointer to a string which contains no more than
p1->dtpar1 bytes; similarly, 's2' is a string which contains now more
than p2->dtpar1 bytes.

Any string will be padded on the right with '\0' if it is smaller 
than its dtpar1 parameter, but will _not_ be null terminated when the 
string is equal in size to dtpar1.  Empress will pass neither a null
string nor an empty string ("") to the collating function.

The following is a skeleton which could be used for this purpose:

	/* ------------------------------------------------------ */

	#include <usrfns.h>

	#define LESS	-1
	#define EQUAL	0
	#define GREATER	1

	int	skeleton_collate (
				char 	* string_1,
				dtpar	* par_string1,
				char 	* string_2,
				dtpar	* par_string2
				)
	{
		int	token1,
			token2;
		int	token1_size,
			token2_size;
		long	string1_size,
			string2_size;
		char	*s1,
			*s2;

		/* NOTES --------------------------------------------
		1. Empress will not pass in NULL strings, or strings of
		zero length.
		2. The strings will not be null terminated if they are 
		of maximum length, but are null terminated otherwise.
		----------------------------------------------------- */
 
		token1 = token2 = START_OF_STRING;
		token1_size = token2_size = 0;
		s1 = string_1;
		s2 = string_2;

		string1_size = par_string1->dtpar1;
		string2_size = par_string2->dtpar1;

		/* get one token at a time from each of the strings */

		while ( token1 != END_OF_STRING
			&& token2 != END_OF_STRING
			&& string1_size > 0
			&& string1_size > 0 )
		{
			/* the get_token() function is the core of
			the user-defined collating functionality, and
			is what distinguishes one collating mechanism 
			from another */

			token1_size = get_token (s1, string1_size, & token1);
			token2_size = get_token (s2, string2_size, & token2);

			/* if tokens are equal, go to the start of the next
			token */

			if ( token1 == token2 )
			{
				s1 += token1_size;
				s2 += token2_size;
				string1_size -= token1_size;
				string2_size -= token2_size;
				continue;
			}
			else if ( token1 < token2 )
			{
				return (LESS);
			}
			else if ( token1 > token2 )
			{
				return (GREATER);
			}
		}

		/* all tokens are equal so far, and at least one
		of the strings has no more tokens.  We need to take
		care of the possible end-of-string conditions */

		if ( token1 == END_OF_STRING )
		{
			if ( token2 == END_OF_STRING )
			{
				/* neither string has more tokens, and
				all tokens are equal ==> strings are equal */

				return (EQUAL);
			}
			else	/* token2 != END_OF_STRING */
			{
				/* second string is longer than first string,
				but are other wise identical ==> second
				string sorts _after_ first string */

				return (LESS);
			}
		}
		else	/* token1 != END_OF_STRING, token2 == END_OF_STRING */
		{
			/* first string is longer than second string,
			but are otherwise identical ==> first string
			sorts _after_ second string */

			return (GREATER);
		}
	}

	/* ------------------------------------------------------ */

The design of the user-defined collating mechanism must take account
of the following principles:

	1. Strings which are byte-for-byte equal are deemed 'equal'.
	Thus, you cannot define a 'random collating' mechanism.

	2. You must not re-define the meaning of 'equal' strings.  As
	currently implemented, if two strings are not byte-for-byte
	equal, then the user defined collating mechanism  is invoked
	on the strings, and its result is returned.  Although the mechanism
	is not prevented from returning 'equal', it should return only
	'less' (-1) or 'greater' (+1).

===========================================================================

4. How do you install your own collating mechanism?

To install a user defined collating routine in Empress, please follow
the following instructions.  

(a) The files which relate to user-defined collating are in several
sub-directories in the Empress main directory, $EMPRESSPATH.

	rdbms/custom/src/collate
			Makefile	--- make file for driver program
			drv_test_cmp.c	--- driver program for sample
						collating function test_cmp()
			test_cmp.c	--- sample collating function test_cmp()
			test_cmp.h	--- sample collating function 
						header file
			README		--- this file
			usrcmp.fns	--- some examples with explanation
	rdbms/custom/src/usrfns
			usrcmp.c	--- default (empty) collate 
						interface tables
	rdbms/include
		usrfns.h		--- header file containing various
						definitions needed with the
						collating function interace.

b) Write your own comparison function(s).  These functions should mimic
the $EMPRESSPATH/rdbms/custom/collate/test_cmp.c function.  You should
copy them into the $EMPRESSPATH/rdbms/custom directory. e.g.,

	% cp my_collate.c $EMPRESSPATH/rdbms/custom/src/collate/my_collate.c

Collating functions accept two non-null strings as input, and 
must return as follows:

	Condition		Return Value
	---------		------------
	str1 < str2		-1
	str1 == str2		0		/* this should never be needed */
	str1 > str2		+1

NOTES:
	i) the strings str1 and str2 are null terminated if they 
	are smaller than the maximum string length allowed by the 
	attribute which they are drawn from.  They are non-null 
	terminated otherwise.  
	ii) note that the maximum string length of an attribute is 
	given in the attribute parameter structure which is passed 
	in (see above for a definition of the dtpar structure).
	iii) the definition of '=' between two strings is _not_ under
	user control at any time.  Only if two strings are byte-for-byte
	identical are they equal.  The user cannot re-define the
	meaning of 'equal'.

c) You must now modify the file $EMPRESSPATH/rdbms/custom/src/usrfns/usrcmp.c
to reflect the collating mechanism(s) that you want your version of Empress
to support.   There are two steps involved here.

	(1) you must declare your collating function(s) at the top of
	the file $EMPRESSPATH/rdbms/custom/src/usrfns/usrcmp.c, similar to 4(b)
	above.

	(2) you must add the name of your collating function(s) to the
	array usr_nlschar_compare_tab[]; this is located  at the bottom of 
	the file
	$EMPRESSPATH/rdbms/custom/src/usrfns/usrcmp.c. Make sure that the
	usr_nlschar_compare_tab[] array is terminated with a (compare) 0 value.

Note that the order in which these functions are added to the array
defines how they are invoked.  A collating mechanism can be invoked for
certain attribute types; when attributes of these types are created
they can be defined with a 'collating parameter' whose value associates
a particular function from the usr_nlschar_compare_tab[] array with that attribute.
If an attribute is defined with a collating parameter value of 'i', then the
i_th collating function from the usr_nlschar_compare_tab[] table is used for collating
operations with that attribute.  A collating parameter value of 1 will
use the _first_ entry in the usr_nlschar_compare_tab[] table.  The value of (i-1)
is the C-language index into the usr_nlschar_compare_tab[] table.  See section 5 (below)
for an example of this.

Thus, if you have an existing database with attributes having pre-defined
collating parameters, you must install the collating functions in the
correct order.

d) Compile the pieces:

	% cd $EMPRESSPATH/rdbms/custom/src/usrfns
	% empcc usrcmp.c
	% cd $EMPRESSPATH/rdbms/custom/src/collate
	% empcc my_collate.c

e) Archive the objects:

	% ar rv $EMPRESSPATH/rdbms/lib/msconfig.a $EMPRESSPATH/rdbms/custom/src/usrfns/usrcmp.o
	% ar rv $EMPRESSPATH/rdbms/lib/dt.a $EMPRESSPATH/rdbms/custom/src/collate/my_collate.o

On some systems, you must also run ranlib at this point:

	% ranlib $EMPRESSPATH/rdbms/lib/msconfig.a
	% ranlib $EMPRESSPATH/rdbms/lib/dt.a

f) If you are using shared libraries, you must now re-make them:

	% cd $EMPRESSPATH/rdbms/conf_bin
	% ./mklibsh

g) You must now re-make any executables.  Included here are the
executables which you create, plus those of the Empress system.
To re-make the Empress executables, do this:

	% cd $EMPRESSPATH/rdbms/conf_bin
	% ./mkexec

You should now be ready to use the collating functions which you have
installed with Empress executables such as empsql.  However, don't
forget to re-compile any other applications that you have created
which need this feature.

===========================================================================

5. How do you use a user-defined collating mechanism?

We assume that two collating mechanisms have been installed as in 
Section 4 above.   Let us suppose that the collating mechanisms 
correspond to the "german" and "czech" languages; the collating
table from $EMPRESSPATH/rdbms/custom/src/usrfns/usrcmp.c is:

	GLOBAL_VAR      compare       usr_nlschar_compare_tab[]=
	{
        	german_collate, 	/* german collating function */
        	czech_collate, 		/* czech collating function */
        	(compare) 0
	};

where the functions german_collate(), czech_collate() are defined 
as above.

Once these mechanisms are installed, users can create tables such as the
following:

	% empsql DB
	*> create table t (
		attr_char	char(20,3),
		attr_nlschar	nlschar(20,3),
		attr_german1	nlschar(20,3,1),
		attr_german2	nlschar(20,3,1),
		attr_czech1	nlschar(20,3,2),
		attr_czech2	nlschar(20,3,2)
		);

Only nlschar datatypes can take advantage of user-defined collating; all
other datatypes will use the standard (i.e., ISO Latin 1) collating.
To define an attribute's collating mechanism,
nlschar attributes (only) now have an optional third parameter
in their definition.  If the parameter is omitted (as in attr_nlschar)
then the standard collating sequence is used and the parameter takes on the
value 0.  Otherwise, the parameter identifies an entry in the collating
function array usr_nlschar_compare_tab[].  It is an error to create an 
attribute for which there is no collating mechanism.

Once attributes of the appropriate types are created, their associated
user-defined collating mechanisms can be invoked within the SORT BY
clause, or within the WHERE clause when using certain operators.  The
standard Empress operators, and their effect on collating, is given in
the table below:

	Operator		User Defined Collating Mechanism Used?
	--------		--------------------------------------
	<, <=, =, >=, >			yes
	range				yes
	like, match, smatch		no

Thus, the sorts of operations which can be done with this table are:

	*> select from t sort by attr_german1;

		- sorts table according to user defined collating function
		'german_collate'.

	*> select from t where attr_german1 > "Wagner";
	*> select from t where attr_german1 range from "Bach" to "Wagner";

		- both strings are evaluated using the user defined collating 
		function 'german_collate' and then the comparison is made.

	*> select from t where attr_german1 = attr_german2;

		- both attributes use the same collating function so this 
		statement is ok.  As with the previous case, both strings
		are evaluated with 'german_collate' function, and then
		compared with the '=' operator.

	*> select from t where attr_german1 > attr_czech1;

		-this statement is syntactically correct, but logically
		incorrect since the operands of a comparison operator
		must use the same collating function.  Empress allows
		this operation to proceed, but the results may be
		unpredictable.  The user should take care to compare
		only attributes of the same collating type.

	*> select from t where attr_char < attr_nlschar;

		-this uses the standard  ISO latin1 character set for
		comparison of the two attributes.

One further area in which sorting is needed is in the creation of
indices.  Indices can be built on an attribute using that attribute's
collating function.   Therefore, in the example above, if an index
is placed on attr_german1, the german collating routine will govern
the order of entries within the index.

The construction of indices requires certain rules to be following 
in the collating sequence:

	* for nlschar attributes, the 'smallest' string (i.e., the string
	which sorts below all other strings) is the string which has
	'\0'  as its first character.  This is referred to as 'dtmin'.
	* for nlschar attributes, the 'largest' string (i.e., the string
	which sorts above all other strings) is the string which has
	'\128' (0x80) as its first character.  This is referred to as
	'dtmax'.
	* all other strings evaluate to larger than dtmin and to less
	than dtmax.
	* the collating sequence can contain no 'loops'.  For any characters
	x, y and z: if x < y and y < z, then x < z is necessary.

6. Limitations
==============

1.  Only nlschar datatypes can take advantage of user-defined collating.

2. The third parameter of nlschar cannot be exported with the 'empexpt'
utility at present.  This parameter is lost when creating the export
file.  Therefore, if the table is imported the default (ISO Latin1)
sorting sequence will be used.  If another collating function is desired,
the table must be ALTERed to reflect this.

3. Do not redefine the meaning of 'equal'.  If you do, inconsistent results
may be obtained.

4. For an nlschar attribute, user-defined collating is never used with
the operators MATCH, SMATCH or LIKE.  Furthermore, depending on whether
the attribute is 'normal' (i.e., third parameter is zero) or has a
user-defined collating mechanism (i.e., third parameter is non-zero),
the collating mechanism which is actually used depends on whether the
attribute is indexed.  The following table summarizes the operations:


		UDC nlschar Attribute		non-UDC nlschar Attribute
		(parameter_3 != 0)		(parameter_3 = 0)
		------------------------------------------------------------
Index		Numeric Character 		ISO Latin1 Sort
		Value Sort	

No Index	Numeric Character		Numeric Character
		Value Sort			Value Sort

The 'Numeric Character Value Sort' mechanism simply implies that individual
bytes are sorted according to their numeric value.  Note that this is _not_
the same as ISO Latin1 sorting.  In ISO Latin1 sorting, 8-bit characters
are sorted between the usual ASCII (i.e. 7-bit) characters.  For example,
the letters 'a' and 'b' are numerically adjacent (having decimal values
of 97 and 98, respectively), but the accented letters a-acute, a-grave,
a-diaeresis, etc. , are sorted between them.

4. Even if an index is available, a MATCH, SMATCH or LIKE operation with
a UDC nlschar attribute will not be optimized, and so may be slower than
an equivalent operation on a non-UDC nlschar attribute.

5. The special strings whose first characters are '\0' and '\128' cannot
be inserted into an nlschar attribute.
