"))
- ``original_text_for``- restores any internal whitespace or suppressed
text within the tokens for a matched parse
expression. This is especially useful when defining expressions
for ``scan_string`` or ``transform_string`` applications.
- ``with_attribute(*args, **kwargs)`` - helper to create a validating parse action to be used with start tags created
with ``make_xml_tags`` or ``make_html_tags``. Use ``with_attribute`` to qualify a starting tag
with a required attribute value, to avoid false matches on common tags such as
``| `` or `` ``.
``with_attribute`` can be called with:
- keyword arguments, as in ``(class="Customer", align="right")``, or
- a list of name-value tuples, as in ``(("ns1:class", "Customer"), ("ns2:align", "right"))``
An attribute can be specified to have the special value
``with_attribute.ANY_VALUE``, which will match any value - use this to
ensure that an attribute is present but any attribute value is
acceptable.
- ``match_only_at_col(column_number)`` - a parse action that verifies that
an expression was matched at a particular column, raising a
``ParseException`` if matching at a different column number; useful when parsing
tabular data
- ``common.convert_to_integer()`` - converts all matched tokens to int
- ``common.convert_to_float()`` - converts all matched tokens to float
- ``common.convert_to_date()`` - converts matched token to a datetime.date
- ``common.convert_to_datetime()`` - converts matched token to a datetime.datetime
- ``common.strip_html_tags()`` - removes HTML tags from matched token
- ``common.downcase_tokens()`` - converts all matched tokens to lowercase
- ``common.upcase_tokens()`` - converts all matched tokens to uppercase
Common string and token constants
---------------------------------
- ``alphas`` - same as ``string.ascii_letters``
- ``nums`` - same as ``string.digits``
- ``alphanums`` - a string containing ``alphas + nums``
- ``alphas8bit`` - a string containing alphabetic 8-bit characters::
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
.. _identchars:
- ``identchars`` - a string containing characters that are valid as initial identifier characters::
ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzª
µºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
- ``identbodychars`` - a string containing characters that are valid as identifier body characters (those following a
valid leading identifier character as given in identchars_)::
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzª
µ·ºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
- ``printables`` - same as ``string.printable``, minus the space (``' '``) character
- ``empty`` - a global ``Empty()``; will always match
- ``sgl_quoted_string`` - a string of characters enclosed in 's; may
include whitespace, but not newlines
- ``dbl_quoted_string`` - a string of characters enclosed in "s; may
include whitespace, but not newlines
- ``quoted_string`` - ``sgl_quoted_string | dbl_quoted_string``
- ``python_quoted_string`` - ``quoted_string | multiline quoted string``
- ``c_style_comment`` - a comment block delimited by ``'/*'`` and ``'*/'`` sequences; can span
multiple lines, but does not support nesting of comments
- ``html_comment`` - a comment block delimited by ``''`` sequences; can span
multiple lines, but does not support nesting of comments
- ``comma_separated_list`` - similar to DelimitedList_, except that the
list expressions can be any text value, or a quoted string; quoted strings can
safely include commas without incorrectly breaking the string into two tokens
- ``rest_of_line`` - all remaining printable characters up to but not including the next
newline
- ``common.integer`` - an integer with no leading sign; parsed token is converted to int
- ``common.hex_integer`` - a hexadecimal integer; parsed token is converted to int
- ``common.signed_integer`` - an integer with optional leading sign; parsed token is converted to int
- ``common.fraction`` - signed_integer '/' signed_integer; parsed tokens are converted to float
- ``common.mixed_integer`` - signed_integer '-' fraction; parsed tokens are converted to float
- ``common.real`` - real number; parsed tokens are converted to float
- ``common.sci_real`` - real number with optional scientific notation; parsed tokens are convert to float
- ``common.number`` - any numeric expression; parsed tokens are returned as converted by the matched expression
- ``common.fnumber`` - any numeric expression; parsed tokens are converted to float
- ``common.ieee_float`` - any floating-point literal (int, real number, infinity, or NaN), returned as float
- ``common.identifier`` - a programming identifier (follows Python's syntax convention of leading alpha or "_",
followed by 0 or more alpha, num, or "_")
- ``common.ipv4_address`` - IPv4 address
- ``common.ipv6_address`` - IPv6 address
- ``common.mac_address`` - MAC address (with ":", "-", or "." delimiters)
- ``common.iso8601_date`` - date in ``YYYY-MM-DD`` format
- ``common.iso8601_datetime`` - datetime in ``YYYY-MM-DDThh:mm:ss.s(Z|+-00:00)`` format; trailing seconds,
milliseconds, and timezone optional; accepts separating ``'T'`` or ``' '``
- ``common.url`` - matches URL strings and returns a ParseResults with named fields like those returned
by ``urllib.parse.urlparse()``
Unicode character sets for international parsing
------------------------------------------------
Pyparsing includes the ``unicode`` namespace that contains definitions for ``alphas``, ``nums``, ``alphanums``,
``identchars``, ``identbodychars``, and ``printables`` for character ranges besides 7- or 8-bit ASCII. You can
access them using code like the following::
import pyparsing as pp
ppu = pp.unicode
greek_word = pp.Word(ppu.Greek.alphas)
greek_word[...].parse_string("Καλημέρα κόσμε")
The following language ranges are defined.
========================== ================= ========================================================
Unicode set Alternate names Description
-------------------------- ----------------- --------------------------------------------------------
``Arabic`` العربية
``Chinese`` 中文
``CJK`` Union of Chinese, Japanese, and Korean sets
``Cyrillic`` кириллица
``Devanagari`` देवनागरी
``Greek`` Ελληνικά
``Hangul`` Korean, 한국어
``Hebrew`` עִברִית
``Japanese`` 日本語 Union of Kanji, Katakana, and Hiragana sets
``Japanese.Hiragana`` ひらがな
``Japanese.Kanji`` 漢字
``Japanese.Katakana`` カタカナ
``Latin1`` All Unicode characters up to code point 0x7f (255)
``LatinA`` Unicode characters for code points 0x100-0x17f (256-383)
``LatinB`` Unicode characters for code points 0x180-0x24f (384-591)
``Thai`` ไทย
``BasicMultilingualPlane`` BMP All Unicode characters up to code point 0xffff (65535)
========================== ================= ========================================================
The base ``unicode`` class also includes definitions based on all Unicode code points up to ``sys.maxunicode``. This
set will include emojis, wingdings, and many other specialized and typographical variant characters.
Generating Railroad Diagrams
============================
Grammars are conventionally represented in what are called "railroad diagrams", which allow you to visually follow
the sequence of tokens in a grammar along lines which are a bit like train tracks. You might want to generate a
railroad diagram for your grammar in order to better understand it yourself, or maybe to communicate it to others.
Usage
-----
To generate a railroad diagram in pyparsing, you first have to install pyparsing with the ``diagrams`` extra.
To do this, just run ``pip install pyparsing[diagrams]``, and make sure you add ``pyparsing[diagrams]`` to any
``setup.py`` or ``requirements.txt`` that specifies pyparsing as a dependency.
Create your parser as you normally would. Then call ``create_diagram()``, passing the name of an output HTML file.::
street_address = Word(nums).set_name("house_number") + Word(alphas)[1, ...].set_name("street_name")
street_address.set_name("street_address")
street_address.create_diagram("street_address_diagram.html")
This will result in the railroad diagram being written to ``street_address_diagram.html``.
`create_diagram` takes the following arguments:
- ``output_html`` (str or file-like object) - output target for generated diagram HTML
- ``vertical`` (int) - threshold for formatting multiple alternatives vertically instead of horizontally (default=3)
- ``show_results_names`` - bool flag whether diagram should show annotations for defined results names
- ``show_groups`` - bool flag whether groups should be highlighted with an unlabeled surrounding box
- ``show_hidden`` - bool flag whether internal pyparsing elements that are normally omitted in diagrams should be shown (default=False)
- ``embed`` - bool flag whether generated HTML should omit , , and tags to embed
the resulting HTML in an enclosing HTML source (such as PyScript HTML)
- ``head`` - str containing additional HTML to insert into the section of the generated code;
can be used to insert custom CSS styling
- ``body`` - str containing additional HTML to insert at the beginning of the section of the
generated code
Example
-------
You can view an example railroad diagram generated from `a pyparsing grammar for
SQL SELECT statements <_static/sql_railroad.html>`_ (generated from
`examples/select_parser.py `_).
Naming tip
----------
Parser elements that are separately named will be broken out as their own sub-diagrams. As a short-cut alternative
to going through and adding ``.set_name()`` calls on all your sub-expressions, you can use ``autoname_elements()`` after
defining your complete grammar. For example::
a = pp.Literal("a")
b = pp.Literal("b").set_name("bbb")
pp.autoname_elements()
`a` will get named "a", while `b` will keep its name "bbb".
Customization
-------------
You can customize the resulting diagram in a few ways.
To do so, run ``pyparsing.diagrams.to_railroad`` to convert your grammar into a form understood by the
`railroad-diagrams `_ module, and
then ``pyparsing.diagrams.railroad_to_html`` to convert that into an HTML document. For example::
from pyparsing.diagram import to_railroad, railroad_to_html
with open('output.html', 'w') as fp:
railroad = to_railroad(my_grammar)
fp.write(railroad_to_html(railroad))
This will result in the railroad diagram being written to ``output.html``
You can then pass in additional keyword arguments to ``pyparsing.diagrams.to_railroad``, which will be passed
into the ``Diagram()`` constructor of the underlying library,
`as explained here `_.
In addition, you can edit global options in the underlying library, by editing constants::
from pyparsing.diagram import to_railroad, railroad_to_html
import railroad
railroad.DIAGRAM_CLASS = "my-custom-class"
my_railroad = to_railroad(my_grammar)
These options `are documented here `_.
Finally, you can edit the HTML produced by ``pyparsing.diagrams.railroad_to_html`` by passing in certain keyword
arguments that will be used in the HTML template. Currently, these are:
- ``head``: A string containing HTML to use in the ```` tag. This might be a stylesheet or other metadata
- ``body``: A string containing HTML to use in the ```` tag, above the actual diagram. This might consist of a
heading, description, or JavaScript.
If you want to provide a custom stylesheet using the ``head`` keyword, you can make use of the following CSS classes:
- ``railroad-group``: A group containing everything relating to a given element group (ie something with a heading)
- ``railroad-heading``: The title for each group
- ``railroad-svg``: A div containing only the diagram SVG for each group
- ``railroad-description``: A div containing the group description (unused)
|