Built-in rules
Besides ANY, matching any single Unicode character, pest provides several
rules to make parsing text more convenient.
ASCII rules
Among the printable ASCII characters, it is often useful to match alphabetic
characters and numbers. For numbers, pest provides digits in common
radixes (bases):
| Built-in rule | Equivalent |
|---|---|
ASCII_DIGIT | '0'..'9' |
ASCII_NONZERO_DIGIT | '1'..'9' |
ASCII_BIN_DIGIT | '0'..'1' |
ASCII_OCT_DIGIT | '0'..'7' |
ASCII_HEX_DIGIT | '0'..'9' | 'a'..'f' | 'A'..'F' |
For alphabetic characters, distinguishing between uppercase and lowercase:
| Built-in rule | Equivalent |
|---|---|
ASCII_ALPHA_LOWER | 'a'..'z' |
ASCII_ALPHA_UPPER | 'A'..'Z' |
ASCII_ALPHA | 'a'..'z' | 'A'..'Z' |
And for miscellaneous use:
| Built-in rule | Meaning | Equivalent |
|---|---|---|
ASCII | any ascii character | '\u{00}'..'\u{7F}' |
ASCII_ALPHANUMERIC | any digit or letter | ASCII_DIGIT | ASCII_ALPHA |
NEWLINE | any line feed format | "\n" | "\r\n" | "\r" |
Unicode rules
To make it easier to correctly parse arbitrary Unicode text, pest includes a
large number of rules corresponding to Unicode character properties. These
rules are divided into general category and binary property rules.
Unicode characters are partitioned into categories based on their general purpose. Every character belongs to a single category, in the same way that every ASCII character is a control character, a digit, a letter, a symbol, or a space.
In addition, every Unicode character has a list of binary properties (true or false) that it does or does not satisfy. Characters can belong to any number of these properties, depending on their meaning.
For example, the character "A", "Latin capital letter A", is in the general category "Uppercase Letter" because its general purpose is being a letter. It has the binary property "Uppercase" but not "Emoji". By contrast, the character "π °", "negative squared Latin capital letter A", is in the general category "Other Symbol" because it does not generally occur as a letter in text. It has both the binary properties "Uppercase" and "Emoji".
For more details, consult Chapter 4 of The Unicode Standard.
General categories
Formally, categories are non-overlapping: each Unicode character belongs to
exactly one category, and no category contains another. However, since certain
groups of categories are often useful together, pest exposes the hierarchy of
categories below. For example, the rule CASED_LETTER is not technically a
Unicode general category; it instead matches characters that are
UPPERCASE_LETTER or LOWERCASE_LETTER, which are general categories.
LETTERCASED_LETTERUPPERCASE_LETTERLOWERCASE_LETTER
TITLECASE_LETTERMODIFIER_LETTEROTHER_LETTER
MARKNONSPACING_MARKSPACING_MARKENCLOSING_MARK
NUMBERDECIMAL_NUMBERLETTER_NUMBEROTHER_NUMBER
PUNCTUATIONCONNECTOR_PUNCTUATIONDASH_PUNCTUATIONOPEN_PUNCTUATIONCLOSE_PUNCTUATIONINITIAL_PUNCTUATIONFINAL_PUNCTUATIONOTHER_PUNCTUATION
SYMBOLMATH_SYMBOLCURRENCY_SYMBOLMODIFIER_SYMBOLOTHER_SYMBOL
SEPARATORSPACE_SEPARATORLINE_SEPARATORPARAGRAPH_SEPARATOR
OTHERCONTROLFORMATSURROGATEPRIVATE_USEUNASSIGNED
Binary properties
Many of these properties are used to define Unicode text algorithms, such as the bidirectional algorithm and the text segmentation algorithm. Such properties are not likely to be useful for most parsers.
However, the properties XID_START and XID_CONTINUE are particularly notable
because they are defined "to assist in the standard treatment of identifiers",
"such as programming language variables". See Technical Report 31 for more
details.
ALPHABETICBIDI_CONTROLBIDI_MIRROREDCASE_IGNORABLECASEDCHANGES_WHEN_CASEFOLDEDCHANGES_WHEN_CASEMAPPEDCHANGES_WHEN_LOWERCASEDCHANGES_WHEN_TITLECASEDCHANGES_WHEN_UPPERCASEDDASHDEFAULT_IGNORABLE_CODE_POINTDEPRECATEDDIACRITICEMOJIEMOJI_COMPONENTEMOJI_MODIFIEREMOJI_MODIFIER_BASEEMOJI_PRESENTATIONEXTENDED_PICTOGRAPHICEXTENDERGRAPHEME_BASEGRAPHEME_EXTENDGRAPHEME_LINKHEX_DIGITHYPHENIDS_BINARY_OPERATORIDS_TRINARY_OPERATORID_CONTINUEID_STARTIDEOGRAPHICJOIN_CONTROLLOGICAL_ORDER_EXCEPTIONLOWERCASEMATHNONCHARACTER_CODE_POINTOTHER_ALPHABETICOTHER_DEFAULT_IGNORABLE_CODE_POINTOTHER_GRAPHEME_EXTENDOTHER_ID_CONTINUEOTHER_ID_STARTOTHER_LOWERCASEOTHER_MATHOTHER_UPPERCASEPATTERN_SYNTAXPATTERN_WHITE_SPACEPREPENDED_CONCATENATION_MARKQUOTATION_MARKRADICALREGIONAL_INDICATORSENTENCE_TERMINALSOFT_DOTTEDTERMINAL_PUNCTUATIONUNIFIED_IDEOGRAPHUPPERCASEVARIATION_SELECTORWHITE_SPACEXID_CONTINUEXID_START
Script properties
The Unicode script property has included built-in rules for matching characters in particular languages.
For example:
We want match a string that contains any CJK (regexp: \p{CJK}) characters such as δ½ ε₯½δΈη or γγγ«γ‘γ―δΈη or μλ
νμΈμ μΈκ³.
HAN: representing Chinese characters, including Simplified Chinese, Traditional Chinese, Japanese kanji, and Korean hanja.HIRAGANA: representing the Japanese hiragana syllabary.KATAKANA: representing the Japanese katakana syllabary.HANGUL: representing Korean alphabetical characters.BOPOMOFO: representing Chinese phonetic symbols.
So we define a rule named CJK like this:
CJK = { HAN | HIRAGANA | KATAKANA | HANGUL | BOPOMOFO }
All available rules:
ADLAMAHOMANATOLIAN_HIEROGLYPHSARABICARMENIANAVESTANBALINESEBAMUMBASSA_VAHBATAKBENGALIBHAIKSUKIBOPOMOFOBRAHMIBRAILLEBUGINESEBUHIDCANADIAN_ABORIGINALCARIANCAUCASIAN_ALBANIANCHAKMACHAMCHEROKEECHORASMIANCOMMONCOPTICCUNEIFORMCYPRIOTCYPRO_MINOANCYRILLICDESERETDEVANAGARIDIVES_AKURUDOGRADUPLOYANEGYPTIAN_HIEROGLYPHSELBASANELYMAICETHIOPICGEORGIANGLAGOLITICGOTHICGRANTHAGREEKGUJARATIGUNJALA_GONDIGURMUKHIHANHANGULHANIFI_ROHINGYAHANUNOOHATRANHEBREWHIRAGANAIMPERIAL_ARAMAICINHERITEDINSCRIPTIONAL_PAHLAVIINSCRIPTIONAL_PARTHIANJAVANESEKAITHIKANNADAKATAKANAKAWIKAYAH_LIKHAROSHTHIKHITAN_SMALL_SCRIPTKHMERKHOJKIKHUDAWADILAOLATINLEPCHALIMBULINEAR_ALINEAR_BLISULYCIANLYDIANMAHAJANIMAKASARMALAYALAMMANDAICMANICHAEANMARCHENMASARAM_GONDIMEDEFAIDRINMEETEI_MAYEKMENDE_KIKAKUIMEROITIC_CURSIVEMEROITIC_HIEROGLYPHSMIAOMODIMONGOLIANMROMULTANIMYANMARNABATAEANNAG_MUNDARINANDINAGARINEW_TAI_LUENEWANKONUSHUNYIAKENG_PUACHUE_HMONGOGHAMOL_CHIKIOLD_HUNGARIANOLD_ITALICOLD_NORTH_ARABIANOLD_PERMICOLD_PERSIANOLD_SOGDIANOLD_SOUTH_ARABIANOLD_TURKICOLD_UYGHURORIYAOSAGEOSMANYAPAHAWH_HMONGPALMYRENEPAU_CIN_HAUPHAGS_PAPHOENICIANPSALTER_PAHLAVIREJANGRUNICSAMARITANSAURASHTRASHARADASHAVIANSIDDHAMSIGNWRITINGSINHALASOGDIANSORA_SOMPENGSOYOMBOSUNDANESESYLOTI_NAGRISYRIACTAGALOGTAGBANWATAI_LETAI_THAMTAI_VIETTAKRITAMILTANGSATANGUTTELUGUTHAANATHAITIBETANTIFINAGHTIRHUTATOTOUGARITICVAIVITHKUQIWANCHOWARANG_CITIYEZIDIYIZANABAZAR_SQUARE