Welcome to retrie’s documentation!
retrie
retrie offers fast methods to match and replace (sequences of) strings based on efficient Trie-based regex unions.
Trie
Instead of matching against a simple regex union, which becomes slow for large sets of words, a more efficient regex pattern can be compiled using a Trie structure:
from retrie.trie import Trie
trie = Trie()
trie.add("abc", "foo", "abs")
assert trie.pattern() == "(?:ab[cs]|foo)" # equivalent to but faster than "(?:abc|abs|foo)"
trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"
trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"
trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"
A Trie
may be populated with zero or more strings at instantiation or via Trie.add
, from which method chaining is possible. Two instances can be merged with the +
(new instance) and +=
(in-place update) operators. Instances will compare equal if their data dictionaries are equal.
trie = Trie()
trie += Trie("abc")
assert (
trie + Trie().add("foo")
== Trie("abc", "foo")
== Trie(*["abc", "foo"])
== Trie().add(*["abc", "foo"])
== Trie().add("abc", "foo")
== Trie().add("abc").add("foo")
)
Installation
This pure-Python, OS independent package is available on PyPI:
$ pip install retrie
Usage
For documentation, see retrie.readthedocs.io.
The following objects are all subclasses of retrie.retrie.Retrie
, which handles filling the Trie and compiling the corresponding regex pattern.
Blacklist
The Blacklist
object can be used to filter out bad occurences in a text or a sequence of strings:
from retrie.retrie import Blacklist
# check out docstrings and methods
help(Blacklist)
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good foobar"
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good bar"
Whitelist
Similar methods are available for the Whitelist
object:
from retrie.retrie import Whitelist
# check out docstrings and methods
help(Whitelist)
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("bad abc foobar")) == "abc"
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("bad abc foobar")) == "abcfoo"
Replacer
The Replacer
object does a fast single-pass search & replace for occurrences of replacement_mapping.keys()
with corresponding values.
from retrie.retrie import Replacer
# check out docstrings and methods
help(Replacer)
replacement_mapping = dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"]))
replacer = Replacer(replacement_mapping, match_substrings=True)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"
replacer = Replacer(replacement_mapping, match_substrings=False)
replacer.compiled
# re.compile(r'\b(?:ab[cs]|foo)\b', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"
replacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)
replacer.compiled # on py3, re.UNICODE is always enabled
# re.compile(r'\b(?:ab[cs]|foo)\b')
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"
replacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=" ")
replacer.compiled
# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"
Development
Run make help
for options like installing for development, linting and testing.
Code Reference
retrie package
Submodules
retrie.retrie module
Submodule containing the Retrie
class, which handles filling the Trie and compiling the corresponding regex pattern, and its high-level wrappers.
The Blacklist
class can be used to filter out bad occurences in a text or a sequence of strings:
from retrie.retrie import Blacklist
# check out docstrings and methods
help(Blacklist)
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled
# re.compile(r'(?<=)(?:ab[cs]|foo)(?=)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good foobar"
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good bar"
Similar methods are available for the Whitelist
class:
from retrie.retrie import Whitelist
# check out docstrings and methods
help(Whitelist)
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled
# re.compile(r'(?<=)(?:ab[cs]|foo)(?=)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("bad abc foobar")) == "abc"
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("bad abc foobar")) == "abcfoo"
The Replacer
class does a fast single-pass search & replace for occurrences of replacement_mapping.keys()
with corresponding values.
from retrie.retrie import Replacer
# check out docstrings and methods
help(Replacer)
replacement_mapping = dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"]))
replacer = Replacer(replacement_mapping, match_substrings=True)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"
replacer = Replacer(replacement_mapping, match_substrings=False)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"
replacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)
replacer.compiled # on py3, re.UNICODE is always enabled
# re.compile(r'(?:ab[cs]|foo)')
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"
replacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=" ")
replacer.compiled
# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"
- class retrie.retrie.Blacklist(blacklisted, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Checklist
Mutate [sequences of] strings based on their match against blacklisted.
Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
blacklisted (Sequence) – Strings to build the Retrie from.
match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
- class retrie.retrie.Checklist(keys, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Retrie
Check and mutate strings against a Retrie.
Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
keys (Sequence) – Strings to build the Retrie from.
match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
- class retrie.retrie.Replacer(replacement_mapping, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Checklist
Replace occurrences of
replacement_mapping.keys()
with corresponding values.Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
replacement_mapping (Mapping) – Mapping
{old: new}
to replace.match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- replace(text, count=0)[source]
Replace occurrences of replacement_mapping.keys() with corresponding values.
- replacement_mapping
- class retrie.retrie.Retrie(word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
object
Wrap a
retrie.trie.Trie
to compile the corresponding regex pattern with word boundary and regex flags.Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- compile(word_boundary=None, re_flags=-1)[source]
Compile a
re.Pattern
for the current Trie.Optionally the following args can be passed to temporarily override class attrs.
- Parameters:
word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- Returns:
Pattern capturing the Trie items enclosed by word_boundary.
- Return type:
- classmethod parse_re_flags(re_flags)[source]
Convert re_flags to integer.
- Parameters:
re_flags (re.RegexFlag | int | None) – The flags to cast to integer.
- Return type:
- pattern()[source]
Build regex pattern for the current Trie.
- Returns:
Non-capturing regex representation.
- Return type:
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
- class retrie.retrie.Whitelist(whitelisted, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Checklist
Mutate [sequences of] strings based on their match against whitelisted.
Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
whitelisted (Sequence) – Strings to build the Retrie from.
match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
retrie.trie module
Submodule containing code to build a regex pattern from a trie of strings.
Standalone usage:
from retrie.trie import Trie
trie = Trie()
trie.add("abc", "foo", "abs")
assert trie.pattern() == "(?:ab[cs]|foo)" # equivalent to but faster than "(?:abc|abs|foo)"
trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"
trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"
trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"
A Trie
may be populated with zero or more strings at instantiation or via Trie.add()
, from
which method chaining is possible. Two instances can be merged with the +
(new instance) and
+=
(in-place update) operators. Instances will compare equal if their data dictionaries are equal.
trie = Trie()
trie += Trie("abc")
assert (
trie + Trie().add("foo")
== Trie("abc", "foo")
== Trie(*["abc", "foo"])
== Trie().add(*["abc", "foo"])
== Trie().add("abc", "foo")
== Trie().add("abc").add("foo")
)
- class retrie.trie.Trie(*word)[source]
Bases:
object
Create a Trie with zero or more words at instantiation or (later via
Trie.add()
).The Trie can be exported to a Regex pattern via
Trie.pattern()
, which should match much faster than a simple Regex union. For best performance, pass the pattern tore.compile()
and cache it to avoid recompiling for every search. See alsoretrie.retrie.Checklist.compiled
.Two instances can be merged with the
+
(new instance) and+=
(in-place update) operators. Instances will compare equal if their data dictionaries are equal.- Parameters:
word (str) – A string to add to the Trie.