glib.Unicode.Unicode Class Reference

List of all members.

Detailed Description

Description This section describes a number of functions for dealing with Unicode characters and strings.

There are analogues of the traditional ctype.h character classification and case conversion functions, UTF-8 analogues of some string utility functions, functions to perform normalization, case conversion and collation on UTF-8 strings and finally functions to convert between the UTF-8, UTF-16 and UCS-4 encodings of Unicode. The implementations of the Unicode functions in GLib are based on the Unicode Character Data tables, which are available from www.unicode.org. GLib 2.8 supports Unicode 4.0, GLib 2.10 supports Unicode 4.1, GLib 2.12 supports Unicode 5.0.


Static Public Member Functions

static int unicharValidate (gunichar ch)
 Checks whether ch is a valid Unicode character.
static int unicharIsalnum (gunichar c)
 Determines whether a character is alphanumeric.
static int unicharIsalpha (gunichar c)
 Determines whether a character is alphabetic (i.e.
static int unicharIscntrl (gunichar c)
 Determines whether a character is a control character.
static int unicharIsdigit (gunichar c)
 Determines whether a character is numeric (i.e.
static int unicharIsgraph (gunichar c)
 Determines whether a character is printable and not a space (returns FALSE for control characters, format characters, and spaces).
static int unicharIslower (gunichar c)
 Determines whether a character is a lowercase letter.
static int unicharIsprint (gunichar c)
 Determines whether a character is printable.
static int unicharIspunct (gunichar c)
 Determines whether a character is punctuation or a symbol.
static int unicharIsspace (gunichar c)
 Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.
static int unicharIsupper (gunichar c)
 Determines if a character is uppercase.
static int unicharIsxdigit (gunichar c)
 Determines if a character is a hexidecimal digit.
static int unicharIstitle (gunichar c)
 Determines if a character is titlecase.
static int unicharIsdefined (gunichar c)
 Determines if a given character is assigned in the Unicode standard.
static int unicharIswide (gunichar c)
 Determines if a character is typically rendered in a double-width cell.
static int unicharIswideCjk (gunichar c)
 Determines if a character is typically rendered in a double-width cell under legacy East Asian locales.
static gunichar unicharToupper (gunichar c)
 Converts a character to uppercase.
static gunichar unicharTolower (gunichar c)
 Converts a character to lower case.
static gunichar unicharTotitle (gunichar c)
 Converts a character to the titlecase.
static int unicharDigitValue (gunichar c)
 Determines the numeric value of a character as a decimal digit.
static int unicharXdigitValue (gunichar c)
 Determines the numeric value of a character as a hexidecimal digit.
static GUnicodeType unicharType (gunichar c)
 Classifies a Unicode character by type.
static GUnicodeBreakType unicharBreakType (gunichar c)
 Determines the break type of c.
static void unicodeCanonicalOrdering (gunichar *string, uint len)
 Computes the canonical ordering of a string in-place.
static gunicharunicodeCanonicalDecomposition (gunichar ch, uint *resultLen)
 Computes the canonical decomposition of a Unicode character.
static int unicharGetMirrorChar (gunichar ch, gunichar *mirroredCh)
 In Unicode, some characters are mirrored.
static GUnicodeScript unicharGetScript (gunichar ch)
 Looks up the GUnicodeScript for a particular character (as defined by Unicode Standard Annex 24).
static gunichar utf8_GetChar (char[] p)
 Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
static gunichar utf8_GetCharValidated (char[] p, int maxLen)
 Convert a sequence of bytes encoded as UTF-8 to a Unicode character.
static char[] utf8_OffsetToPointer (char[] str, int offset)
 Converts from an integer character offset to a pointer to a position within the string.
static int utf8_PointerToOffset (char[] str, char[] pos)
 Converts from a pointer to position within a string to a integer character offset.
static char[] utf8_PrevChar (char[] p)
 Finds the previous UTF-8 character in the string before p.
static char[] utf8_FindNextChar (char[] p, char[] end)
 Finds the start of the next UTF-8 character in the string after p.
static char[] utf8_FindPrevChar (char[] str, char[] p)
 Given a position p with a UTF-8 encoded string str, find the start of the previous UTF-8 character starting before p.
static int utf8_Strlen (char[] p, int max)
 Returns the length of the string in characters.
static char[] utf8_Strncpy (char[] dest, char[] src, uint n)
 Like the standard C strncpy() function, but copies a given number of characters instead of a given number of bytes.
static char[] utf8_Strchr (char[] p, int len, gunichar c)
 Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes.
static char[] utf8_Strrchr (char[] p, int len, gunichar c)
 Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes.
static char[] utf8_Strreverse (char[] str, int len)
 Reverses a UTF-8 string.
static int utf8_Validate (char[] str, int maxLen, char **end)
 Validates UTF-8 encoded text.
static char[] utf8_Strup (char[] str, int len)
 Converts all Unicode characters in the string that have a case to uppercase.
static char[] utf8_Strdown (char[] str, int len)
 Converts all Unicode characters in the string that have a case to lowercase.
static char[] utf8_Casefold (char[] str, int len)
 Converts a string into a form that is independent of case.
static char[] utf8_Normalize (char[] str, int len, GNormalizeMode mode)
 Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character.
static int utf8_Collate (char[] str1, char[] str2)
 Compares two strings for ordering using the linguistically correct rules for the current locale.
static char[] utf8_CollateKey (char[] str, int len)
 Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().
static char[] utf8_CollateKeyForFilename (char[] str, int len)
 Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().
static gunichar2utf8_ToUtf16 (char[] str, int len, int *itemsRead, int *itemsWritten, GError **error)
 Convert a string from UTF-8 to UTF-16.
static gunicharutf8_ToUcs4 (char[] str, int len, int *itemsRead, int *itemsWritten, GError **error)
 Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4.
static gunicharutf8_ToUcs4_Fast (char[] str, int len, int *itemsWritten)
 Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input.
static gunicharutf16_ToUcs4 (gunichar2 *str, int len, int *itemsRead, int *itemsWritten, GError **error)
 Convert a string from UTF-16 to UCS-4.
static char[] utf16_ToUtf8 (gunichar2 *str, int len, int *itemsRead, int *itemsWritten, GError **error)
 Convert a string from UTF-16 to UTF-8.
static gunichar2ucs4_ToUtf16 (gunichar *str, int len, int *itemsRead, int *itemsWritten, GError **error)
 Convert a string from UCS-4 to UTF-16.
static char[] ucs4_ToUtf8 (gunichar *str, int len, int *itemsRead, int *itemsWritten, GError **error)
 Convert a string from a 32-bit fixed width representation as UCS-4.
static int unicharToUtf8 (gunichar c, char[] outbuf)
 Converts a single character to UTF-8.


Member Function Documentation

static gunichar2* glib.Unicode.Unicode.ucs4_ToUtf16 ( gunichar str,
int  len,
int itemsRead,
int itemsWritten,
GError **  error 
) [static]

Convert a string from UCS-4 to UTF-16.

A 0 character will be added to the result after the converted text. str: a UCS-4 encoded string len: the maximum length (number of characters) of str to use. If len < 0, then the string is terminated with a 0 character. items_read: location to store number of bytes read, or NULL. If an error occurs then the index of the invalid input is stored here. items_written: location to store number of gunichar2 written, or NULL. The value stored here does not include the trailing 0. error: location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. Returns: a pointer to a newly allocated UTF-16 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set.

static char [] glib.Unicode.Unicode.ucs4_ToUtf8 ( gunichar str,
int  len,
int itemsRead,
int itemsWritten,
GError **  error 
) [static]

Convert a string from a 32-bit fixed width representation as UCS-4.

to UTF-8. The result will be terminated with a 0 byte. str: a UCS-4 encoded string len: the maximum length (number of characters) of str to use. If len < 0, then the string is terminated with a 0 character. items_read: location to store number of characters read, or NULL. items_written: location to store number of bytes written or NULL. The value here stored does not include the trailing 0 byte. error: location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. Returns: a pointer to a newly allocated UTF-8 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. In that case, items_read will be set to the position of the first invalid input character.

static GUnicodeBreakType glib.Unicode.Unicode.unicharBreakType ( gunichar  c  )  [static]

Determines the break type of c.

c should be a Unicode character (to derive a character from UTF-8 encoded text, use g_utf8_get_char()). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such as pango_break() instead of caring about break types yourself. c: a Unicode character Returns: the break type of c

static int glib.Unicode.Unicode.unicharDigitValue ( gunichar  c  )  [static]

Determines the numeric value of a character as a decimal digit.

c: a Unicode character Returns: If c is a decimal digit (according to g_unichar_isdigit()), its numeric value. Otherwise, -1.

static int glib.Unicode.Unicode.unicharGetMirrorChar ( gunichar  ch,
gunichar mirroredCh 
) [static]

In Unicode, some characters are mirrored.

This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text. If ch has the Unicode mirrored property and there is another unicode character that typically has a glyph that is the mirror image of ch's glyph and mirrored_ch is set, it puts that character in the address pointed to by mirrored_ch. Otherwise the original character is put. ch: a Unicode character mirrored_ch: location to store the mirrored character Returns: TRUE if ch has a mirrored character, FALSE otherwise Since 2.4

static GUnicodeScript glib.Unicode.Unicode.unicharGetScript ( gunichar  ch  )  [static]

Looks up the GUnicodeScript for a particular character (as defined by Unicode Standard Annex 24).

No check is made for ch being a valid Unicode character; if you pass in invalid character, the result is undefined. ch: a Unicode character Returns: the GUnicodeScript for the character. Since 2.14

static int glib.Unicode.Unicode.unicharIsalnum ( gunichar  c  )  [static]

Determines whether a character is alphanumeric.

Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is an alphanumeric character

static int glib.Unicode.Unicode.unicharIsalpha ( gunichar  c  )  [static]

Determines whether a character is alphabetic (i.e.

a letter). Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is an alphabetic character

static int glib.Unicode.Unicode.unicharIscntrl ( gunichar  c  )  [static]

Determines whether a character is a control character.

Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is a control character

static int glib.Unicode.Unicode.unicharIsdefined ( gunichar  c  )  [static]

Determines if a given character is assigned in the Unicode standard.

c: a Unicode character Returns: TRUE if the character has an assigned value

static int glib.Unicode.Unicode.unicharIsdigit ( gunichar  c  )  [static]

Determines whether a character is numeric (i.e.

a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is a digit

static int glib.Unicode.Unicode.unicharIsgraph ( gunichar  c  )  [static]

Determines whether a character is printable and not a space (returns FALSE for control characters, format characters, and spaces).

g_unichar_isprint() is similar, but returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is printable unless it's a space

static int glib.Unicode.Unicode.unicharIslower ( gunichar  c  )  [static]

Determines whether a character is a lowercase letter.

Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is a lowercase letter

static int glib.Unicode.Unicode.unicharIsprint ( gunichar  c  )  [static]

Determines whether a character is printable.

Unlike g_unichar_isgraph(), returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is printable

static int glib.Unicode.Unicode.unicharIspunct ( gunichar  c  )  [static]

Determines whether a character is punctuation or a symbol.

Given some UTF-8 text, obtain a character value with g_utf8_get_char(). c: a Unicode character Returns: TRUE if c is a punctuation or symbol character

static int glib.Unicode.Unicode.unicharIsspace ( gunichar  c  )  [static]

Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.

). Given some UTF-8 text, obtain a character value with g_utf8_get_char(). (Note: don't use this to do word breaking; you have to use Pango or equivalent to get word breaking right, the algorithm is fairly complex.) c: a Unicode character Returns: TRUE if c is a space character

static int glib.Unicode.Unicode.unicharIstitle ( gunichar  c  )  [static]

Determines if a character is titlecase.

Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z. c: a Unicode character Returns: TRUE if the character is titlecase

static int glib.Unicode.Unicode.unicharIsupper ( gunichar  c  )  [static]

Determines if a character is uppercase.

c: a Unicode character Returns: TRUE if c is an uppercase character

static int glib.Unicode.Unicode.unicharIswide ( gunichar  c  )  [static]

Determines if a character is typically rendered in a double-width cell.

c: a Unicode character Returns: TRUE if the character is wide

static int glib.Unicode.Unicode.unicharIswideCjk ( gunichar  c  )  [static]

Determines if a character is typically rendered in a double-width cell under legacy East Asian locales.

If a character is wide according to g_unichar_iswide(), then it is also reported wide with this function, but the converse is not necessarily true. See the Unicode Standard Annex 11 for details. c: a Unicode character Returns: TRUE if the character is wide in legacy East Asian locales Since 2.12

static int glib.Unicode.Unicode.unicharIsxdigit ( gunichar  c  )  [static]

Determines if a character is a hexidecimal digit.

c: a Unicode character. Returns: TRUE if the character is a hexadecimal digit

static gunichar glib.Unicode.Unicode.unicharTolower ( gunichar  c  )  [static]

Converts a character to lower case.

c: a Unicode character. Returns: the result of converting c to lower case. If c is not an upperlower or titlecase character, or has no lowercase equivalent c is returned unchanged.

static gunichar glib.Unicode.Unicode.unicharTotitle ( gunichar  c  )  [static]

Converts a character to the titlecase.

c: a Unicode character Returns: the result of converting c to titlecase. If c is not an uppercase or lowercase character, c is returned unchanged.

static gunichar glib.Unicode.Unicode.unicharToupper ( gunichar  c  )  [static]

Converts a character to uppercase.

c: a Unicode character Returns: the result of converting c to uppercase. If c is not an lowercase or titlecase character, or has no upper case equivalent c is returned unchanged.

static int glib.Unicode.Unicode.unicharToUtf8 ( gunichar  c,
char[]  outbuf 
) [static]

Converts a single character to UTF-8.

c: a Unicode character code outbuf: output buffer, must have at least 6 bytes of space. If NULL, the length will be computed and returned and nothing will be written to outbuf. Returns: number of bytes written See Also g_locale_to_utf8(), g_locale_from_utf8() Convenience functions for converting between UTF-8 and the locale encoding. [3] surrogate pairs

static GUnicodeType glib.Unicode.Unicode.unicharType ( gunichar  c  )  [static]

Classifies a Unicode character by type.

c: a Unicode character Returns: the type of the character.

static int glib.Unicode.Unicode.unicharValidate ( gunichar  ch  )  [static]

Checks whether ch is a valid Unicode character.

Some possible integer values of ch will not be valid. 0 is considered a valid character, though it's normally a string terminator. ch: a Unicode character Returns: TRUE if ch is a valid Unicode character

static int glib.Unicode.Unicode.unicharXdigitValue ( gunichar  c  )  [static]

Determines the numeric value of a character as a hexidecimal digit.

c: a Unicode character Returns: If c is a hex digit (according to g_unichar_isxdigit()), its numeric value. Otherwise, -1.

static gunichar* glib.Unicode.Unicode.unicodeCanonicalDecomposition ( gunichar  ch,
uint resultLen 
) [static]

Computes the canonical decomposition of a Unicode character.

ch: a Unicode character. result_len: location to store the length of the return value. Returns: a newly allocated string of Unicode characters. result_len is set to the resulting length of the string.

static void glib.Unicode.Unicode.unicodeCanonicalOrdering ( gunichar string,
uint  len 
) [static]

Computes the canonical ordering of a string in-place.

This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information. string: a UCS-4 encoded string. len: the maximum length of string to use.

static gunichar* glib.Unicode.Unicode.utf16_ToUcs4 ( gunichar2 str,
int  len,
int itemsRead,
int itemsWritten,
GError **  error 
) [static]

Convert a string from UTF-16 to UCS-4.

The result will be terminated with a 0 character. str: a UTF-16 encoded string len: the maximum length (number of gunichar2) of str to use. If len < 0, then the string is terminated with a 0 character. items_read: location to store number of words read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. items_written: location to store number of characters written, or NULL. The value stored here does not include the trailing 0 character. error: location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. Returns: a pointer to a newly allocated UCS-4 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set.

static char [] glib.Unicode.Unicode.utf16_ToUtf8 ( gunichar2 str,
int  len,
int itemsRead,
int itemsWritten,
GError **  error 
) [static]

Convert a string from UTF-16 to UTF-8.

The result will be terminated with a 0 byte. Note that the input is expected to be already in native endianness, an initial byte-order-mark character is not handled specially. g_convert() can be used to convert a byte buffer of UTF-16 data of ambiguous endianess. str: a UTF-16 encoded string len: the maximum length (number of gunichar2) of str to use. If len < 0, then the string is terminated with a 0 character. items_read: location to store number of words read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. items_written: location to store number of bytes written, or NULL. The value stored here does not include the trailing 0 byte. error: location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. Returns: a pointer to a newly allocated UTF-8 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set.

static char [] glib.Unicode.Unicode.utf8_Casefold ( char[]  str,
int  len 
) [static]

Converts a string into a form that is independent of case.

The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling g_utf8_casefold() on other strings. Note that calling g_utf8_casefold() followed by g_utf8_collate() is only an approximation to the correct linguistic case insensitive ordering, though it is a fairly good one. Getting this exactly right would require a more sophisticated collation function that takes case sensitivity into account. GLib does not currently provide such a function. str: a UTF-8 encoded string len: length of str, in bytes, or -1 if str is nul-terminated. Returns: a newly allocated string, that is a case independent form of str.

static int glib.Unicode.Unicode.utf8_Collate ( char[]  str1,
char[]  str2 
) [static]

Compares two strings for ordering using the linguistically correct rules for the current locale.

When sorting a large number of strings, it will be significantly faster to obtain collation keys with g_utf8_collate_key() and compare the keys with strcmp() when sorting instead of sorting the original strings. str1: a UTF-8 encoded string str2: a UTF-8 encoded string Returns: < 0 if str1 compares before str2, 0 if they compare equal, > 0 if str1 compares after str2.

static char [] glib.Unicode.Unicode.utf8_CollateKey ( char[]  str,
int  len 
) [static]

Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().

The results of comparing the collation keys of two strings with strcmp() will always be the same as comparing the two original keys with g_utf8_collate(). Note that this function depends on the current locale. str: a UTF-8 encoded string. len: length of str, in bytes, or -1 if str is nul-terminated. Returns: a newly allocated string. This string should be freed with g_free() when you are done with it.

static char [] glib.Unicode.Unicode.utf8_CollateKeyForFilename ( char[]  str,
int  len 
) [static]

Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().

In order to sort filenames correctly, this function treats the dot '.' as a special case. Most dictionary orderings seem to consider it insignificant, thus producing the ordering "event.c" "eventgenerator.c" "event.h" instead of "event.c" "event.h" "eventgenerator.c". Also, we would like to treat numbers intelligently so that "file1" "file10" "file5" is sorted as "file1" "file5" "file10". Note that this function depends on the current locale. str: a UTF-8 encoded string. len: length of str, in bytes, or -1 if str is nul-terminated. Returns: a newly allocated string. This string should be freed with g_free() when you are done with it. Since 2.8

static char [] glib.Unicode.Unicode.utf8_FindNextChar ( char[]  p,
char[]  end 
) [static]

Finds the start of the next UTF-8 character in the string after p.

p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte. p: a pointer to a position within a UTF-8 encoded string end: a pointer to the end of the string, or NULL to indicate that the string is nul-terminated, in which case the returned value will be Returns: a pointer to the found character or NULL

static char [] glib.Unicode.Unicode.utf8_FindPrevChar ( char[]  str,
char[]  p 
) [static]

Given a position p with a UTF-8 encoded string str, find the start of the previous UTF-8 character starting before p.

Returns NULL if no UTF-8 characters are present in str before p. p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte. str: pointer to the beginning of a UTF-8 encoded string p: pointer to some position within str Returns: a pointer to the found character or NULL.

static gunichar glib.Unicode.Unicode.utf8_GetChar ( char[]  p  )  [static]

Converts a sequence of bytes encoded as UTF-8 to a Unicode character.

If p does not point to a valid UTF-8 encoded character, results are undefined. If you are not sure that the bytes are complete valid Unicode characters, you should use g_utf8_get_char_validated() instead. p: a pointer to Unicode character encoded as UTF-8 Returns: the resulting character

static gunichar glib.Unicode.Unicode.utf8_GetCharValidated ( char[]  p,
int  maxLen 
) [static]

Convert a sequence of bytes encoded as UTF-8 to a Unicode character.

This function checks for incomplete characters, for invalid characters such as characters that are out of the range of Unicode, and for overlong encodings of valid characters. p: a pointer to Unicode character encoded as UTF-8 max_len: the maximum number of bytes to read, or -1, for no maximum. Returns: the resulting character. If p points to a partial sequence at the end of a string that could begin a valid character, returns (gunichar)-2; otherwise, if p does not point to a valid UTF-8 encoded Unicode character, returns (gunichar)-1.

static char [] glib.Unicode.Unicode.utf8_Normalize ( char[]  str,
int  len,
GNormalizeMode  mode 
) [static]

Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character.

You should generally call g_utf8_normalize() before comparing two Unicode strings. The normalization mode G_NORMALIZE_DEFAULT only standardizes differences that do not affect the text content, such as the above-mentioned accent representation. G_NORMALIZE_ALL also standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. For example, g_utf8_collate() normalizes with G_NORMALIZE_ALL as its first step. G_NORMALIZE_DEFAULT_COMPOSE and G_NORMALIZE_ALL_COMPOSE are like G_NORMALIZE_DEFAULT and G_NORMALIZE_ALL, but returned a result with composed forms rather than a maximally decomposed form. This is often useful if you intend to convert the string to a legacy encoding or pass it to a system with less capable Unicode handling. str: a UTF-8 encoded string. len: length of str, in bytes, or -1 if str is nul-terminated. mode: the type of normalization to perform. Returns: a newly allocated string, that is the normalized form of str.

static char [] glib.Unicode.Unicode.utf8_OffsetToPointer ( char[]  str,
int  offset 
) [static]

Converts from an integer character offset to a pointer to a position within the string.

Since 2.10, this function allows to pass a negative offset to step backwards. It is usually worth stepping backwards from the end instead of forwards if offset is in the last fourth of the string, since moving forward is about 3 times faster than moving backward. str: a UTF-8 encoded string offset: a character offset within str Returns: the resulting pointer

static int glib.Unicode.Unicode.utf8_PointerToOffset ( char[]  str,
char[]  pos 
) [static]

Converts from a pointer to position within a string to a integer character offset.

Since 2.10, this function allows pos to be before str, and returns a negative offset in this case. str: a UTF-8 encoded string pos: a pointer to a position within str Returns: the resulting character offset

static char [] glib.Unicode.Unicode.utf8_PrevChar ( char[]  p  )  [static]

Finds the previous UTF-8 character in the string before p.

p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte. If p might be the first character of the string, you must use g_utf8_find_prev_char() instead. p: a pointer to a position within a UTF-8 encoded string Returns: a pointer to the found character.

static char [] glib.Unicode.Unicode.utf8_Strchr ( char[]  p,
int  len,
gunichar  c 
) [static]

Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes.

If len is -1, allow unbounded search. p: a nul-terminated UTF-8 encoded string len: the maximum length of p c: a Unicode character Returns: NULL if the string does not contain the character, otherwise, a pointer to the start of the leftmost occurrence of the character in the string.

static char [] glib.Unicode.Unicode.utf8_Strdown ( char[]  str,
int  len 
) [static]

Converts all Unicode characters in the string that have a case to lowercase.

The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing. str: a UTF-8 encoded string len: length of str, in bytes, or -1 if str is nul-terminated. Returns: a newly allocated string, with all characters converted to lowercase.

static int glib.Unicode.Unicode.utf8_Strlen ( char[]  p,
int  max 
) [static]

Returns the length of the string in characters.

p: pointer to the start of a UTF-8 encoded string. max: the maximum number of bytes to examine. If max is less than 0, then the string is assumed to be nul-terminated. If max is 0, p will not be examined and may be NULL. Returns: the length of the string in characters

static char [] glib.Unicode.Unicode.utf8_Strncpy ( char[]  dest,
char[]  src,
uint  n 
) [static]

Like the standard C strncpy() function, but copies a given number of characters instead of a given number of bytes.

The src string must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.) dest: buffer to fill with characters from src src: UTF-8 encoded string n: character count Returns: dest

static char [] glib.Unicode.Unicode.utf8_Strrchr ( char[]  p,
int  len,
gunichar  c 
) [static]

Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes.

If len is -1, allow unbounded search. p: a nul-terminated UTF-8 encoded string len: the maximum length of p c: a Unicode character Returns: NULL if the string does not contain the character, otherwise, a pointer to the start of the rightmost occurrence of the character in the string.

static char [] glib.Unicode.Unicode.utf8_Strreverse ( char[]  str,
int  len 
) [static]

Reverses a UTF-8 string.

str must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.) Note that unlike g_strreverse(), this function returns newly-allocated memory, which should be freed with g_free() when no longer needed. str: a UTF-8 encoded string len: the maximum length of str to use. If len < 0, then the string is nul-terminated. Returns: a newly-allocated string which is the reverse of str. Since 2.2

static char [] glib.Unicode.Unicode.utf8_Strup ( char[]  str,
int  len 
) [static]

Converts all Unicode characters in the string that have a case to uppercase.

The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.) str: a UTF-8 encoded string len: length of str, in bytes, or -1 if str is nul-terminated. Returns: a newly allocated string, with all characters converted to uppercase.

static gunichar* glib.Unicode.Unicode.utf8_ToUcs4 ( char[]  str,
int  len,
int itemsRead,
int itemsWritten,
GError **  error 
) [static]

Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4.

A trailing 0 will be added to the string after the converted text. str: a UTF-8 encoded string len: the maximum length of str to use. If len < 0, then the string is nul-terminated. items_read: location to store number of bytes read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. items_written: location to store number of characters written or NULL. The value here stored does not include the trailing 0 character. error: location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. Returns: a pointer to a newly allocated UCS-4 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set.

static gunichar* glib.Unicode.Unicode.utf8_ToUcs4_Fast ( char[]  str,
int  len,
int itemsWritten 
) [static]

Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input.

This function is roughly twice as fast as g_utf8_to_ucs4() but does no error checking on the input. str: a UTF-8 encoded string len: the maximum length of str to use. If len < 0, then the string is nul-terminated. items_written: location to store the number of characters in the result, or NULL. Returns: a pointer to a newly allocated UCS-4 string. This value must be freed with g_free().

static gunichar2* glib.Unicode.Unicode.utf8_ToUtf16 ( char[]  str,
int  len,
int itemsRead,
int itemsWritten,
GError **  error 
) [static]

Convert a string from UTF-8 to UTF-16.

A 0 character will be added to the result after the converted text. str: a UTF-8 encoded string len: the maximum length (number of characters) of str to use. If len < 0, then the string is nul-terminated. items_read: location to store number of bytes read, or NULL. If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be returned in case str contains a trailing partial character. If an error occurs then the index of the invalid input is stored here. items_written: location to store number of gunichar2 written, or NULL. The value stored here does not include the trailing 0. error: location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError other than G_CONVERT_ERROR_NO_CONVERSION may occur. Returns: a pointer to a newly allocated UTF-16 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set.

static int glib.Unicode.Unicode.utf8_Validate ( char[]  str,
int  maxLen,
char **  end 
) [static]

Validates UTF-8 encoded text.

str is the text to validate; if str is nul-terminated, then max_len can be -1, otherwise max_len should be the number of bytes to validate. If end is non-NULL, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise). Note that g_utf8_validate() returns FALSE if max_len is positive and NUL is met before max_len bytes have been read. Returns TRUE if all of str was valid. Many GLib and GTK+ routines require valid UTF-8 as input; so data read from a file or the network should be checked with g_utf8_validate() before doing anything else with it. str: a pointer to character data max_len: max bytes to validate, or -1 to go until NUL end: return location for end of valid data Returns: TRUE if the text was valid UTF-8


SourceForge.net Logo DSource.org Logo digitalmars.com Logo