retrie package
Submodules
retrie.retrie module
Submodule containing the Retrie
class, which handles filling the Trie and compiling the corresponding regex pattern, and its high-level wrappers.
The Blacklist
class can be used to filter out bad occurences in a text or a sequence of strings:
from retrie.retrie import Blacklist
# check out docstrings and methods
help(Blacklist)
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled
# re.compile(r'(?<=)(?:ab[cs]|foo)(?=)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good foobar"
blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good bar"
Similar methods are available for the Whitelist
class:
from retrie.retrie import Whitelist
# check out docstrings and methods
help(Whitelist)
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled
# re.compile(r'(?<=)(?:ab[cs]|foo)(?=)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("bad abc foobar")) == "abc"
whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("bad abc foobar")) == "abcfoo"
The Replacer
class does a fast single-pass search & replace for occurrences of replacement_mapping.keys()
with corresponding values.
from retrie.retrie import Replacer
# check out docstrings and methods
help(Replacer)
replacement_mapping = dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"]))
replacer = Replacer(replacement_mapping, match_substrings=True)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"
replacer = Replacer(replacement_mapping, match_substrings=False)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"
replacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)
replacer.compiled # on py3, re.UNICODE is always enabled
# re.compile(r'(?:ab[cs]|foo)')
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"
replacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=" ")
replacer.compiled
# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"
- class retrie.retrie.Blacklist(blacklisted, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Checklist
Mutate [sequences of] strings based on their match against blacklisted.
Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
blacklisted (Sequence) – Strings to build the Retrie from.
match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
- class retrie.retrie.Checklist(keys, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Retrie
Check and mutate strings against a Retrie.
Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
keys (Sequence) – Strings to build the Retrie from.
match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
- class retrie.retrie.Replacer(replacement_mapping, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Checklist
Replace occurrences of
replacement_mapping.keys()
with corresponding values.Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
replacement_mapping (Mapping) – Mapping
{old: new}
to replace.match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- replace(text, count=0)[source]
Replace occurrences of replacement_mapping.keys() with corresponding values.
- replacement_mapping
- class retrie.retrie.Retrie(word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
object
Wrap a
retrie.trie.Trie
to compile the corresponding regex pattern with word boundary and regex flags.Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- compile(word_boundary=None, re_flags=-1)[source]
Compile a
re.Pattern
for the current Trie.Optionally the following args can be passed to temporarily override class attrs.
- Parameters:
word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- Returns:
Pattern capturing the Trie items enclosed by word_boundary.
- Return type:
- classmethod parse_re_flags(re_flags)[source]
Convert re_flags to integer.
- Parameters:
re_flags (re.RegexFlag | int | None) – The flags to cast to integer.
- Return type:
- pattern()[source]
Build regex pattern for the current Trie.
- Returns:
Non-capturing regex representation.
- Return type:
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
- class retrie.retrie.Whitelist(whitelisted, match_substrings=False, word_boundary='\\b', re_flags=RegexFlag.IGNORECASE | UNICODE)[source]
Bases:
Checklist
Mutate [sequences of] strings based on their match against whitelisted.
Note
Although the Trie is case-sensitive, by default
re.IGNORECASE
is used for better performance. Passre_flags=None
to perform case-sensitive replacements.- Parameters:
whitelisted (Sequence) – Strings to build the Retrie from.
match_substrings (bool) – Whether to override word_boundary with
""
.word_boundary (str) – Token to wrap the retrie to exclude certain matches.
re_flags (re.RegexFlag) – Flags passed to regex engine.
- re_flags
Regex flags passed to
re.compile()
.
- trie
The underlying
retrie.trie.Trie
.
- word_boundary
The boundary token to wrap the
retrie.trie.Trie
pattern in.
retrie.trie module
Submodule containing code to build a regex pattern from a trie of strings.
Standalone usage:
from retrie.trie import Trie
trie = Trie()
trie.add("abc", "foo", "abs")
assert trie.pattern() == "(?:ab[cs]|foo)" # equivalent to but faster than "(?:abc|abs|foo)"
trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"
trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"
trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"
A Trie
may be populated with zero or more strings at instantiation or via Trie.add()
, from
which method chaining is possible. Two instances can be merged with the +
(new instance) and
+=
(in-place update) operators. Instances will compare equal if their data dictionaries are equal.
trie = Trie()
trie += Trie("abc")
assert (
trie + Trie().add("foo")
== Trie("abc", "foo")
== Trie(*["abc", "foo"])
== Trie().add(*["abc", "foo"])
== Trie().add("abc", "foo")
== Trie().add("abc").add("foo")
)
- class retrie.trie.Trie(*word)[source]
Bases:
object
Create a Trie with zero or more words at instantiation or (later via
Trie.add()
).The Trie can be exported to a Regex pattern via
Trie.pattern()
, which should match much faster than a simple Regex union. For best performance, pass the pattern tore.compile()
and cache it to avoid recompiling for every search. See alsoretrie.retrie.Checklist.compiled
.Two instances can be merged with the
+
(new instance) and+=
(in-place update) operators. Instances will compare equal if their data dictionaries are equal.- Parameters:
word (str) – A string to add to the Trie.