pypolibox Package¶
pypolibox
Package¶
database
Module¶
The database
module is responsible for parsing the user’s requirements
(both from command line options, as well as interactively from the Python
interpreter), transforming these requirements into an SQL query, querying the
sqlite database and returning the results.
-
class
pypolibox.database.
Book
(db_item, db_columns, query_args)[source]¶ a
Book
instance representsone
book from a database query
-
class
pypolibox.database.
Books
(results)[source]¶ a
Books
instance storesall
books that were found by a database query as a list ofBook
instances inself.books
-
get_book_ranks
(possible_matches)[source]¶ ranks ‘OR query’ results according to the number of query parameters they match.
Parameters: possible_matches (int) – the number of (meaningful) parameters of the query. Returns: book_ranks – a list of tuples, where each tuple consists of the score of a book and its index in self.books Return type: list of (float, int) tuples
-
-
class
pypolibox.database.
Query
(argv)[source]¶ a
Query
instance represents one user query to the databaseQueries can be made from the command line, as well as from the Python interpreter. From the command line, queries can be made using either abbreviated or long parameters. The following examples both query the database for books that contain code examples and deal with both semantics and parsing:
python pypolibox.py -k semantics, parsing -c 1 python pypolibox.py --keywords semantics, parsing --codeexamples 1
When calling
pypolibox.py
from within the Python interpreter, the same query can be made using the following command:Query(["-k", "semantics", "parsing", "-c", "1"])
If you print the
Query
instance (by using theprint
command), it will return the SQL query that was constructed from the user input:SELECT * FROM books WHERE keywords like '%semantics%' AND keywords like '%parsing%' AND examples = 1
TODO: This module talks directly to the database. To make it easier to adapt pypolibox to a different domain, an SQL abstraction layer (e.g. SQL Alchemy) should be used.
-
class
pypolibox.database.
Results
(query)[source]¶ A
Results
instance sends queries to the database, retrieves and stores the results.-
get_number_of_possible_matches
()[source]¶ Counts the number of query paramters that
could
be matched by books from the results set. The actual scoring of books takes place inBooks.get_book_ranks()
.For example, if a query contains the parameters:
keywords = pragmatics, keywords = semantics, language = German
it means that a book could possible match 3 parameters (possible_matches = 3).
Returns: the number of possible matches Return type: int
-
get_table_header
(table_name)[source]¶ get the column names (e.g. title, year, authors) and their index from the books table of the db and return them as a dictionary.
Parameters: table_name ( str
) – name of a database table, e.g. ‘books’Returns: a dictionary, which contains the names of the table columns as keys and their index as values :rtype:
dict
, withstr
keys andint
values
-
debug
Module¶
The debug
module contains a number of functions, which can be used to test
the behaviour of pypolibox’ classes, test its error handling or simply
provides short cuts to generate frequently needed data.
-
pypolibox.debug.
abbreviate_textplan
(textplan)[source]¶ recursive helper function that prints only the skeletton of a textplan (message types and RST relations but not the actual message content)
Parameters: textplan ( TextPlan
orConstituentSet
orMessage
) – a text plan, a constituent set or a messageReturns: a message (without the attribute value pairs stored in it) Return type: Message
-
pypolibox.debug.
apply_rule
(messages, rule_name)[source]¶ debugging: take a rule and apply it to your list of messages.
the resulting
ConstituentSet
will be added to the list, while the messages involved in its construction will be removed. repeat this step until you’ve found an erroneous/missing rule.
-
pypolibox.debug.
compare_hlds_variants
()[source]¶ TODO: kill bugs
BUG1: sentence001-original-test contains 2(!) <item> sentences.
-
pypolibox.debug.
compare_textplans
()[source]¶ helps to find out how many different text plan structures there are.
-
pypolibox.debug.
enumprint
(obj)[source]¶ prints every item of an iterable on its own line, preceded by its index
-
pypolibox.debug.
find_applicable_rules
(messages)[source]¶ debugging: find out which rules are directly (i.e. without forming ConstituentSets first) applicable to your messages
-
pypolibox.debug.
findrule
(ruletype='', attribute='', value='')[source]¶ debugging: find rules that have a certain ruleType and some attribute-value pair
Example: findrule(“Concession”, “nucleus”, “usermodel_match”) finds rules of type ‘Concession’ where rule.nucleus == ‘usermodel_match’.
-
pypolibox.debug.
gen_all_messages_of_type
(msg_type)[source]¶ generate all messages for all books from all testqueries, but return only those which match the given message type, e.g. ‘id’ or ‘extra’.
-
pypolibox.debug.
gen_all_textplans
()[source]¶ generates all text plans for each query in the predefined list of test queries.
Return type: list
of ``TextPlan``s or ``str``sReturns:
-
pypolibox.debug.
gen_textplans
(query)[source]¶ debug function: generates all text plans for a query.
Parameters: query ( int
orlist
ofstr
) – can be the index of a test query (e.g. 4) OR a list ofquery parameters (e.g. [“-k”, “phonology”, “-l”, “German”])
Return type: TextPlans
Returns: a TextPlans
instance, containing a number of text plans
-
pypolibox.debug.
genallmessages
(query)[source]¶ debug function: generates all messages plans for a query.
Parameters: query ( int
orlist
ofstr
) – can be the index of a test query (e.g. 4) OR a list ofquery parameters (e.g. [“-k”, “phonology”, “-l”, “German”])
Return type: AllMessages
Returns: all messages that could be generated for the query
-
pypolibox.debug.
genmessages
(booknumber=0, querynumber=10)[source]¶ generates all messages for a book regarding a specific database query.
Parameters: booknumber ( int
) – the index of the book from the results list (“0”would be the first book with the highest score)
Parameters: querynumber ( int
) – the index of a query from the predefined list oftest queries (named ‘testqueries’)
Return type: list
of ``Message``s
-
pypolibox.debug.
genprops
(querynumber=10)[source]¶ generates all propositions for all books in the database concerning a specific query.
Parameters: querynumber ( int
) – the index of a query from the predefined list oftest queries (named ‘testqueries’)
Return type: AllPropositions
-
pypolibox.debug.
msgtypes
(messages)[source]¶ print message types / rst relation types, no matter which data structure is used to represent them
-
pypolibox.debug.
test_cli
(query_arguments=[[], ['-k', 'pragmatics'], ['-k', 'pragmatics', '-r', '4'], ['-k', 'pragmatics', 'semantics'], ['-k', 'pragmatics', 'semantics', '-r', '7'], ['-l', 'German'], ['-l', 'German', '-p', 'Lisp'], ['-l', 'German', '-p', 'Lisp', '-k', 'parsing'], ['-l', 'English', '-s', '0', '-c', '1'], ['-l', 'English', '-s', '0', '-e', '1', '-k', 'discourse'], ['-k', 'syntax', 'parsing', '-l', 'German', '-p', 'Prolog', 'Lisp', '-s', '2', '-t', '0', '-e', '1', '-c', '1', '-r', '7']])[source]¶ run several complex queries and print their results to stdout
facts
Module¶
The facts
module takes the information stored in Book
instances and
converts them into attribute value matrices (Facts
). Furthermore, the
module compares each book with its predecessor (e.g. book A is newer than book
B and has code examples, while B is shorter and targets beginners …). The
insights gathered from these comparisons are also stored in Facts
instances.
-
class
pypolibox.facts.
AllFacts
(b)[source]¶ Simply speaking, an
AllFacts
instance contains all facts about all books that were returned by a database query. More formally, it contains aFacts
instance for eachBook
in aBooks
instance.In a
Books
instance, all books returned by a database query are sorted by the number of query parameters they match (‘user model match’) in descending order. This means, thatAllFacts
will contain facts about the best-matching book, followed by facts about the second-best matching book (including a comparison to the best matching one), followed by facts about the third-best matching book (including a comparison to the second one) etc.
-
class
pypolibox.facts.
Facts
(book, book_score, index=0, preceding_book=False)[source]¶ A
Facts
instance represents facts about a single book, but also contains a comparison of that particular book with its predecessor.-
generate_extra_facts
(index, book)[source]¶ generates
extra_facts
, if the current book is very new/old or very short/long.Parameters: - index (
int
) – the index of the book in theBooks
list of books - book (
Book
) – aBook
instance
Returns: a dictionary that contains information about the recency and
length of a book :rtype:
dict
- index (
-
generate_id_facts
(index, book)[source]¶ generates a dictionary of id facts about the current book which will be stored in
self.facts["id_facts"]
. In contrast to other facts,id_facts
are those kind of facts that can be directly retrieved from the database (i.e. there is no comparison between books or reasoning involved). The id_facts dictionary contains the following keys:id_facts keys database book table columns 'authors' 'codeexamples' 'examples' 'exercises' 'keywords' 'language' 'lang' 'pages' 'proglang' 'plang' 'target' 'title' 'year'
The key names should be self-exlanatory. In those cases where they do not exactly match their counterparts in the database, the corresponding database table column name is given in the table above.
Parameters: - index (
int
) – the index of the book in theBooks
list of books - book (
Book
) – aBook
instance
Returns: a dictionary with the keys described above
Return type: dict
- index (
-
generate_lastbook_facts
(index, book, preceding_book)[source]¶ generates facts that compare the current book with the preceding one. A typical example of a lastbook_facts dictionary would look like this:
lastbook_facts: lastbook_nomatch: {'language': 'German', 'keywords_preceding_book_only': set(['pragmatics', 'chart parsing']), 'keywords_current_book_only': set([' ', 'grammar', 'language hierarchy', 'corpora', 'syntax', 'morphology', 'left associative grammar']), 'codeexamples': 0, 'proglang': set(['Lisp']), 'newer': 11, 'keywords': set([' ', 'grammar', 'language hierarchy', 'corpora', 'syntax', 'left associative grammar', 'morphology', 'chart parsing', 'pragmatics']), 'proglang_preceding_book_only': set(['Lisp'])} lastbook_match: {'exercises': 1, 'keywords': set(['semantics', 'parsing']), 'target': 0, 'pagerange': 1}
This method will calculate if is newer/older/shorter/longer than its predecessor (if so, it will store the difference as an integer). For keys that have sets as their values (
keywords
andproglang
), the resulting dictionary will list which values differed and which were only present in either the preceding or the current book.Parameters: - index (
int
) – the index of the book in theBooks
list of books - book (
Book
) – aBook
instance - preceding_book – if True, there is a book preceding this one
and both books will be compared :type preceding_book:
bool
Returns: a dictionary with two keys: lastbook_match
andlastbook_nomatch
, which in turn are dictionaries themselves and contain facts that are shared between the two books (lastbook_match) or that differ between the two (lastbook_nomatch).- index (
-
generate_query_facts
(index, book, book_score)[source]¶ generates facts that describes if a book matches (parts of) the query (a.k.a the user model). a typical query_facts dictionary will look like this:
query_facts: usermodel_nomatch: {'codeexamples': 0} usermodel_match: {'exercises': 1, 'keywords': set(['semantics', 'parsing']), 'language': 'German'} book_score: 0.8
The book described in this examples matches 80 % of the user requirements (it contains exercises and deals with semantics and parsing and is written in German) but does not contain code examples (as was asked for by the user).
Parameters: - index (
int
) – the index of the book in theBooks
list of books - book (
Book
) – aBook
instance - book_score – the score of the book that was calculated in
Books.get_book_ranks()
:type book_score:float
Returns: a dictionary that contains three keys, the book_score
,the
usermodel_match
as well as theusermodle_nomatch
. ‘usermodel_match’ contains all the features that were requested by the user and are present in the book. ‘usermodle_nomatch’ contains all features that were requested but are missing from the book. :rtype:dict
- index (
-
hlds
Module¶
HLDS (Hybrid Logic Dependency Semantics) is the format internally used by the OpenCCG realizer. This module shall allow the conversion between HLDS-XML files and NLTK feature structures. In addition, it can also be used as a commandline to convert HLDS-XML files in printable versions of ``nltk.FeatStruct``s. The following command produces a LaTeX file that can be compiled into a PDF:
python hlds.py --format latex --outfile output.tex input1.xml input2.xml
Alternatively, you can also produce ‘ASCII art’ with this command:
python hlds.py --format nltk --outfile output.tex input1.xml input2.xml
This way, the phrase ‘das Buch’ can be transformed from this HLDS-XML representation:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<lf>
<satop nom="b1:artefaktum">
<prop name="Buch"/>
<diamond mode="NUM">
<prop name="sing"/>
</diamond>
<diamond mode="ART">
<nom name="d1:sem-obj"/>
<prop name="def"/>
</diamond>
</satop>
</lf>
<target>das Buch</target>
</xml>
To this attribute-value matrix (LaTeX):
\begin{avm}
\[ $*$nom$*$ & `b1:artefaktum' \\
$*$prop$*$ & `Buch' \\
$*$text$*$ & `das Buch' \\
NUM & \[ prop & `sing' \] \\
ART & \[ nom & `d1:sem-obj' \\
prop & `def' \] \\
\]
\end{avm}
or this one (plain text):
[ *root_nom* = 'b1:artefaktum' ]
[ *root_prop* = 'Buch' ]
[ *text* = 'das Buch' ]
[ ]
[ 00__NUM = [ *mode* = 'NUM' ] ]
[ [ prop = 'sing' ] ]
[ ]
[ [ *mode* = 'ART' ] ]
[ 01__ART = [ nom = 'd1:sem-obj' ] ]
[ [ prop = 'def' ] ]
-
class
pypolibox.hlds.
Diamond
(features=None, **morefeatures)[source]¶ Bases:
nltk.featstruct.FeatDict
A {Diamond} represents an HLDS diamond in form of a (nested) feature structure containing the elements nom? prop? diamond*
<diamond mode="AGENS"> <nom name="s1:addition"/> <prop name="sowohl"/> <diamond mode="NP1"> <nom name="h1:nachname"/> <prop name="Hausser"/> </diamond> ... </diamond>
-
append_subdiamond
(subdiamond, mode=None)[source]¶ appends a subdiamond structure to an existing diamond structure, while allowing to change the mode of the subdiamond
Parameters: mode ( str
orNoneType
) – the mode that the subdiamond shall have. this willalso be used to determine the subdiamonds identifier. if the diamond already has two subdiamonds (e.g. “00__AGENS” and “01__PATIENS”) and add a third subdiamond with mode “TEMP”, its identifier will be “02__TEMP”. if mode is None, the subdiamonds mode will be left untouched.
-
change_mode
(mode)[source]¶ changes the mode of a
Diamond
, which is sometimes needed when embedding it into anotherDiamond
orSentence
.
-
insert_subdiamond
(index, subdiamond_to_insert, mode=None)[source]¶ insert a
Diamond
into this one before the index, while allowing to change the mode of the subdiamond.Parameters: mode ( str
orNoneType
) – the mode that the subdiamond shall have. this willalso be used to determine the subdiamonds identifier. if the diamond already has two subdiamonds (e.g. “00__AGENS” and “01__PATIENS”) and we’ll insert a third subdiamond at index ‘1’ with mode “TEMP”, its identifier will be “01__TEMP”, while the remaining two subdiamond identifiers will will be changed accordingly, e.g. “00__AGENS” and “02__PATIENS”. if mode is None, the subdiamonds mode will be left untouched.
-
prepend_subdiamond
(subdiamond_to_prepend, mode=None)[source]¶ prepends a subdiamond structure to an existing diamond structure, while allowing to change the mode of the subdiamond
Parameters: mode ( str
orNoneType
) – the mode that the subdiamond shall have. this willalso be used to determine the subdiamonds identifier. if the diamond already has two subdiamonds (e.g. “00__AGENS” and “01__PATIENS”) and we’ll prepend a third subdiamond with mode “TEMP”, its identifier will be “00__TEMP”, while the remaining two subdiamond identifiers will will be incremented by 1, e.g. “01__AGENS” and “02__PATIENS”. if mode is None, the subdiamonds mode will be left untouched.
-
-
class
pypolibox.hlds.
HLDSReader
(hlds, input_format='file')[source]¶ represents a list of sentences (as NLTK feature structures) parsed from an HLDS XML testbed file.
-
parse_sentences
(tree)[source]¶ Parses all sentences (represented as HLDS XML structures) into feature structures. These structures are saved as a list of ``Sentence``s in self.sentences.
If there’s only one sentence in a file, it’s root element is <xml>. If there’s more than one, they are each <xml> sentence “roots” is wrapped in an <item>…</item> (and <regression> becomes the root tag of the document).
Parameters: tree ( etree._ElementTree
) – an etree tree element
-
-
class
pypolibox.hlds.
Sentence
(features=None, **morefeatures)[source]¶ Bases:
nltk.featstruct.FeatDict
represents an HLDS sentence as an NLTK feature structure.
-
create_sentence
(sent_str, expected_parses, root_nom, root_prop, diamonds)[source]¶ wraps all ``Diamond``s that were already constructed by HLDSReader.parse_sentences() plus some meta data (root verb etc.) into a NLTK feature structure that represents a complete sentence.
Parameters: - sent_str (
str
) – the text that should be generated - expected_parses (
int
) – the expected number of parses - root_prop (
str
) – the root element of that text (in case we’re
actually generating a sentence: the main verb)
Parameters: - root_nom (
str
) – the root (element/verb) category, e.g. “b1:handlung” - diamonds (
list
of ``Diamond``s) – a list of the diamonds that are contained in the
sentence
- sent_str (
-
-
pypolibox.hlds.
add_nom_prefixes
(diamond)[source]¶ Adds a prefix/index to the name attribute of every <nom> tag of a
Diamond
orSentence
structure. Without this,ccg-realize
will only produce gibberish.Every <nom> tag has a ‘name’ attribute, which contains a category/type-like description of the corresponding <prop> tag’s name attribute, e.g.:
<diamond mode="PRÄP"> <nom name="v1:zugehörigkeit"/> <prop name="von"/> </diamond>
Here ‘zugehörigkeit’ is the name of a category that the preposition ‘von’ belongs to. usually, the nom prefix is the first character of the prop name attribute with an added index. index iteration is done by a depth-first walk through all diamonds contained in the given feature structure. In this example ‘v1:zugehörigkeit’ means, that “von” is the first
diamond
in the structure that starts with ‘v’ and belongs to the category ‘zugehörigkeit’.
-
pypolibox.hlds.
convert_diamond_xml2fs
(etree)[source]¶ transforms a HLDS XML <diamond>…</diamond> structure (that was parsed into an etree element) into an NLTK feature structure.
Parameters: etree_or_tuple ( etree._Element
) – a diamond etree elementReturn type: Diamond
-
pypolibox.hlds.
create_diamond
(mode, nom, prop, nested_diamonds_list)[source]¶ creates an HLDS feature structure from scratch (in contrast to convert_diamond_xml2fs, which converts an HLDS XML structure into its corresponding feature structure representation)
NOTE: I’d like to simply put this into __init__, but I don’t know how to subclass FeatDict properly. FeatDict.__new__ complains about Diamond.__init__(self, mode, nom, prop, nested_diamonds_list) having too many arguments.
-
pypolibox.hlds.
create_hlds_file
(sent_or_sent_list, mode='test', output='etree')[source]¶ this function transforms ``Sentence``s into a a valid HLDS XML testbed file
Parameters: testbed file (ccg-test). “realize”, if the sentence will be put in a file on its own (ccg-realize).
Parameters: output ( str
) – “etree” (etree element) or “xml” (formatted, valid xmldocument as a string)
Return type: str
-
pypolibox.hlds.
diamond2sentence
(diamond)[source]¶ Converts a Diamond feature structure into a Sentence feature structure. This becomes necessary whenever we want to realize a short utterance, e.g. “die Autoren” or “die Themen Syntax und Pragmatik”.
Note: OpenCCG does not really distinguish between a sentence and smaller units of meaning. It simply assigns the <sentence> tag to every HLDS structure it realizes, whereas each substructure of this “sentence” (no matter how complex) is labelled as a <diamond>.
Return type: Sentence
-
pypolibox.hlds.
etreeprint
(element, debug=True, raw=False)[source]¶ pretty print function for etree trees or elements
Parameters: debug – if True: not only return the XML string, but also print it to stdout. if False: only return the XML string
Parameters: raw – if True: just transform the etree (element) into a string, don’t add or prettify anything. if False: add an XML declaration and use pretty print to make the output more readable for humans.
-
pypolibox.hlds.
featstruct2avm
(featstruct, mode='non-recursive')[source]¶ converts an NLTK feature structure into an attribute-value matrix that can be printed with LaTeX’s avm environment.
Return type: str
-
pypolibox.hlds.
hlds2xml
(featstruct)[source]¶ debug function that returns the string representation of a feature structure (Diamond or Sentence) and its HLDS XML equivalent.
Return type: str
-
pypolibox.hlds.
last_diamond_index
(featstruct)[source]¶ Returns the highest index currently used withing a given
Diamond
orSentence
. E.g., if this structure contains three diamonds (“00__ART”, “01__NUM” and “02__TEMP”), the return value will be 2. The return value is -1, if the feature structure doesn’t contain any ``Diamond``s.Return type: int
-
pypolibox.hlds.
test_conversion
()[source]¶ tests HLDS XML <-> NLTK feature structures conversions. converts an HLDS XML testbed file into a list of sentences in NLTK feature structure. picks one of these sentences randomly and converts it back to HLDS XML. prints boths versions of this sentence. returns an HLDSReader instance (containing a list of ``Sentence``s in NLTK feature structure notation) and a HLDS XML testbed file (as a string) created from those feature structures.
Return type: tuple
of (HLDSReader
,str
)Returns: a tuple containing an HLDSReader instance and a string representation of an HLDS XML testbed file
lexicalization
Module¶
lexicalize_messageblocks
Module¶
messages
Module¶
The messages
module contains the Message
class and related classes.
Message``s contain propositions about books. The text planner applies
``Rule``s to these ``Message``s to form ``ConstituentSet``s. ``Rule``s will
also be applied to ``ConstituentSet``s, ultimately forming one ``TextPlan
that contains all the information to be realized.
-
class
pypolibox.messages.
AllMessages
(allpropositions)[source]¶ represents all Messages generated from AllPropositions about all Books() that were returned by a query
-
class
pypolibox.messages.
Message
(msgType=None)[source]¶ Bases:
nltk.featstruct.FeatDict
A
Message
combines and stores knowledge about an object (here: books) in a logical structure. Messages are constructed during content selection (taking the user’s requirements, querying a database and processing its results), which precedes text planning.Each
Message
has amsgType
which describes the kind of information it includes. For example, the msgType ‘id’ specifies information that is needed to distinguish a book from other books:[ *msgType* = 'id' ] [ authors = frozenset(['Roland Hausser']) ] [ codeexamples = 0 ] [ language = 'German' ] [ pages = 572 ] [ proglang = frozenset([]) ] [ target = 0 ] [ title = 'Grundlagen der Computerlinguistik' ] [ year = 2000 ]
-
class
pypolibox.messages.
Messages
(propositions)[source]¶ represents all
Message
instances generated fromPropositions
about aBook
.-
add_identification_to_message
(message)[source]¶ Adds special ‘reference_title’ and ‘reference_authors’ attributes to messages other than the
id_message
.In contrast to the
id_message
, other messages will not be used to produce sentences that contain their content (i.e. no statement of the ‘author X wrote book Y in 1979’ generated from an ‘extra_message’ or a ‘lastbook_nomatch’ message). Nevertheless, they will need to make reference to the title and the authors of the book (e.g. ‘Y is a rather short book’). As an example, look at this ‘usermodel_match’ message:[ *msgType* = 'usermodel_match' ] [ *reference_authors* = frozenset(['Ulrich Schmitz']) ] [ *reference_title* = 'Computerlinguistik. Eine Einführung' ] [ language = 'German' ] [ proglang = frozenset(['Lisp']) ]
The message contains two bits of information (the language and programming language used), which both have regular strings as keys. The ‘referential’ keys on the other hand are
nltk.Feature
instances and not strings. This distinction should be regarded as a syntactic trick used to emphasize a semantic differce (READ: if you have a better solution, please change it).
-
generate_extra_message
(proposition_dict)[source]¶ generates a
Message
from an ‘extra’Proposition
. Extra propositions only exist if a book is remarkably new / old or very short / long.
-
propositions
Module¶
The propositions
module evaluates the facts generated by the
pypolibox.facts
module and stores its results as nested dictionaries.
pypolibox
Module¶
The pypolibox module is the ‘main’ module of the pypolibox package. It’s the module you’d usually call from the command line or load into your Python interpreter. It just imports all the important modules and runs some demo code in case it is run from the command line without any arguments.
-
pypolibox.pypolibox.
check_and_realize_textplan
(openccg, textplan, lexicalize_message_block, phrase2sentence)[source]¶ realizes a text plan and warns about message blocks that cannot be realized due to current restrictions in the OpenCC grammar.
Parameters:
-
pypolibox.pypolibox.
generate_textplans
(query)[source]¶ generates all text plans for a database query
-
pypolibox.pypolibox.
initialize_openccg
(lang='de')[source]¶ starts OpenCCG’s tccg realizer as a server in the background (ca. 20s).
realization
Module¶
The realization
module shall take HLDS XML structures, realize them with
the OpenCCG surface realizer and parse its output string.
-
class
pypolibox.realization.
OpenCCG
(grammar_dir='/home/docs/checkouts/readthedocs.org/user_builds/pypolibox/envs/latest/local/lib/python2.7/site-packages/pypolibox-1.0.2-py2.7.egg/pypolibox/grammar', lang='de')[source]¶ Bases:
object
command-line interaction with OpenCCG’s
tccg
parser/generator, which can either be run as a JSON-RPC server or simply imported as a Python module.-
parse
(text, verbose=True, raw_output=True)[source]¶ This is the core interaction with the parser.
It returns a Python data-structure, while the parse() function returns a JSON object
Returns: if raw_output=True, the raw response string from the server will be returned. otherwise, a list of dictionaries will be returned (one for each input sentence). :rtype:
str
ORlist
of ``dict``s
-
rules
Module¶
The rules
module contains rules, which are used by the text planner to
combine messages into constituent sets and ultimately form one TextPlan
.
-
class
pypolibox.rules.
ConstituentSet
(relType=None, nucleus=None, satellite=None)[source]¶ Bases:
nltk.featstruct.FeatDict
ConstituentSet
is the contstuction built up by applyingRules
to a set ofConstituentSet``s and ``Message``s. Each ``ConstituentSet
is of a specificrelType
, and has two constituents, one which is designated thenucleus
and one which is designatedaux
. These ``ConstituentSet``s can then be combined with other ``ConstituentSet``s or ``Message``s.ConstituentSet
is based onnltk.featstruct.FeatDict
.
-
class
pypolibox.rules.
Rule
(name, ruleType, nucleus, satellite, conditions, heuristic)[source]¶ Bases:
object
Rules
are the elements which specify relationships which hold between elements of the document. These elements can be ``Message``s or ``ConstituentSet``s.Each
Rule
specifies a list ofinputs
, which are is a minimal specification of aMessage
orConstituentSet
. To be a valid input to this Rule, a givenMessage
orConstituentSet
must subsume one of the specified ``input``s.Each
Rule
can also specify a set of conditions which must be met in order for the Rule to hold between the inputs.Each
Rule
specifies a heuristic, which will be evaluated to provide a score by which to rank the order in which rules should be applied.Each
Rule
specifies which of the inputs will be thenucleus
and which will be theaux
of the outputConstituentSet
.-
find_message_candidates
(messages, message_prototype)[source]¶ takes a list of messages and returns only those with the right message type (as specified in Rule.inputs)
Parameters: messages ( list
of ``Message``s) – a list ofMessage
objects, each containing onemessage about a book
Parameters: message_prototype – a tuple consisting of a message name and a Message
orConstituentSet
:type message_prototype:tuple
of (string,Message
orConstituentSet
)Return type: list
oftuple``s of (string, ``Message
)Returns: a list containing all (name, message) tuples which are subsumed by the input message type (self.nucleus or self.satellite). If a rule should only be applied to UserModelMatch and UserModelNoMatch messages, the return value contains a list of messages with these types.
-
get_conditions
(group)[source]¶ applies __name_eval to all conditions a Rule has, i.e. checks if a group meets all conditions
ConstituentSet
) :param group: a list of message tuples of the form (message name, message)Return type: list
ofbool
Returns: a list of truth values, each of which tells if a group met all conditions specified in self.conditions
-
get_options
(messages)[source]¶ this is the main method used for document planning
From the list of
Messages
,get_options
selects all possible ways the Rule could be applied.The planner can then select with the
textplan.__bottom_up_search
function one of these possible applications of the Rule to use.non_empty_message_combinations
is a list of combinations, where each combination is a (nucleus, satellite)-tuple. both the nucleus and the satellite each consist of a (name, message) tuple.The method returns an empty list if
get_options
can’t find a way to apply theRule
.Parameters: messages (list of Message
objects) – a list ofMessage
objects, each containing onemessage about a book
Return type: empty list or a list containing one tuple
of (int
,ConstituentSet
,list
), wherelist
consists ofMessage
orConstituentSet
objects :return: a list containing one 3-tuple (score,ConstituentSet
, inputs) where:- score is the evaluated heuristic score for this application of
the Rule - ConstituentSet is the new
ConstituentSet
instance returned by the application of the Rule - inputs is the list of inputs (Message``s or ``ConstituentSets
used in this application of the rule
-
get_satisfactory_groups
(groups)[source]¶ Message
orConstituentSet
) :param groups: a list of group elements. each group contains a list which contains one or more message tuples of the form (message name, message)Return type: list
oflist
’s oftuple
’s of (str
,Message
or
ConstituentSet
) :return: a list of group elements. contains only those groups which meet all the conditions specified in self.conditions
-
-
class
pypolibox.rules.
Rules
[source]¶ creates Rule() instances
Each rule of the form Rule(ruleType, inputs, conditions, nucleus, aux, heuristic) is generated by its own method. Important note: these methods have to adhere to a naming convention, i.e. begin with ‘genrule_’; otherwise, self.__init__ will fail!
-
genrule_book_differences
()[source]¶ Contrast({id, id_extra_sequence}, lastbook_nomatch)
Meaning: id/id_extra_sequence. In contrast to book X, this book is in German, targets advanced users and … Condition: There are differences between the two books
-
genrule_book_similarities
()[source]¶ Elaboration(id_usermodelmatch, lastbook_match)
Meaning: ‘id_usermodelmatch’ mentions that the books matches ALL requirements. In addition, the book shares many features with its predecessor. Condition: There are both differences and commonalities (>=50%) between the two books.
-
genrule_compare_eval
()[source]¶ Sequence(concession_books, {pos_eval, neg_eval, usermodel_match, usermodel_nomatch})
Meaning: ‘concession_books’ describes common and diverging features of the books. ‘pos_eval/neg_eval/usermodel_match/usermodel_nomatch’ explains how many user requirements they meet
-
genrule_concession_book_differences_usermodelmatch
()[source]¶ Concession(book_differences, usermodel_match)
Meaning: ‘book_differences’ explains the differences between both books. Nevertheless, this book meets ALL your requirements … Condition: All user requirements are met.
-
genrule_concession_books
()[source]¶ Concession(book_differences, lastbook_match)
Meaning: After ‘book_differences’ explains the differences between both books, their common features are explained.
-
genrule_contrast_books_posneg_eval
()[source]¶ Sequence(book_differences, {pos_eval, neg_eval})
Meaning: book_differences mentions the differences between the books, pos_eval/neg_eval explains how many user requirements they meet Conditions: matches some of the requirements
-
genrule_id_extra_sequence
()[source]¶ Sequence(id_complete, extra), if ‘extra’ exists:
adds an additional “sentence” about extra facts after the id messages
-
genrule_id_usermodelmatch
()[source]¶ Elaboration({id, id_extra_sequence}, usermodel_match), if there’s no usermodel_nomatch
Meaning: This book fulfills ALL your requirments. It was written in …, contains these features … and … etc
-
genrule_neg_eval
()[source]¶ Concession(usermodel_nomatch, usermodel_match)
Meaning: Although this book fulfills some of your requirements, it doesn’t match most of them. Therefore, this book might not be the best choice.
-
genrule_no_similarities_concession
()[source]¶ Concession({id, id_extra_sequence}, lastbook_nomatch)
Meaning: Book X has these features BUT share none of them with its predecessor. Condition: There is a predecessor to this book, but they don’t share ANY features.
-
genrule_pos_eval
()[source]¶ Concession(usermodel_match, usermodel_nomatch)
Meaning: Book matches many (>= 50%) of the requirements, but not all of them
-
genrule_single_book_complete
()[source]¶ Sequence({id, id_extra_sequence}, {pos_eval, neg_eval})
Meaning: The nucleus mentions all the (remaining) facts (that aren’t mentioned in the evaluation), while the satellite evaluates the book (in terms of usermodel matches)
-
genrule_single_book_complete_usermodelmatch
()[source]¶ Sequence({id, id_extra_sequence}, usermodel_match)
Meaning: The satellite states that the book matches ALL the user’s requirements. The nucleus mentions the remaining facts about the book. Condition: there’s no preceding book and there are only usermodel matches.
-
genrule_single_book_complete_usermodelnomatch
()[source]¶ Sequence({id, id_extra_sequence}, usermodel_nomatch)
Meaning: The satellite states that the book matches NONE of the user’s requirements. The nucleus mentions the remaining facts about the book. Condition: there’s no preceding book and there are no usermodel matches.
-
textplan
Module¶
The textplan
module is based on Nicholas FitzGerald’s py_docplanner``[1],
in particular on his idea to represent RST trees as attribute value matrices
by using the ``nltk.featstruct
data structure.
textplan
converts Proposition
instances into Message``s (using
attribute value notation). Via a set of ``Rule``s, these messages are combined
into ``ConstituentSet``s. Rules are applied bottom-up, via a recursive
best-first search (cf. ``__bottom_up_search
).
Not only messages, but also constituent sets can be combined
via rules. If all messages present can be combined into one large
ConstituentSet
, this constituent set is called a TextPlan
. A
TextPlan
represents a complete text plan in form of an attribute value
matrix.
[1] Fitzgerald, Nicholas (2009). Open-Source Implementation of Document Structuring Algorithm for NLTK.
-
class
pypolibox.textplan.
TextPlan
(book_score=None, dtype='TextPlan', text=None, children=None)[source]¶ Bases:
nltk.featstruct.FeatDict
TextPlan
is the output of Document Planning. A TextPlan consists of an optional title and text, and a childConstituentSet
.- TODO: append __str__ method: should describe verbally if a TP is
- describing one book or comparing two books
-
class
pypolibox.textplan.
TextPlans
(allmessages, debug=False)[source]¶ Bases:
object
generates all
TextPlan``s for an ``AllMessages
instance, i.e. one DocumentPlan for each book that is returned as a result of the user’s database query
-
pypolibox.textplan.
generate_textplan
(messages, rules=[<pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>, <pypolibox.rules.Rule object>], book_score=None, dtype='TextPlan', text='')[source]¶ The main method implementing the Bottom-Up document structuring algorithm from “Building Natural Language Generation Systems” figure 4.17, p. 108.
The method takes a list of
Message``s and a set of ``Rule``s and creates a document plan by repeatedly applying the highest-scoring Rule-application (according to the Rule's heuristic score) until a full tree is created. This is returned as a ``TextPlan
with the tree set aschildren
.If no plan is reached using bottom-up,
None
is returned.Parameters: messages – a list of ``Message``s which have been selected during content selection for inclusion in the TextPlan :type messages: list of
Message``s :param rules: a list of ``Rule``s specifying relationships which can hold between the messages :type rules: list of ``Rule``s :param dtype: an optional type for the document :type dtype: string :param text: an optional text string describing the document :type text: string :return: a document plan. if no plan could be created: return None :rtype: ``TextPlan
orNoneType
-
pypolibox.textplan.
linearize_textplan
(textplan)[source]¶ takes a text plan (an RST tree represented as a NLTK.featstruct data structure) and returns an ordered list of ``Message``s for surface generation.
Return type: list
of ``Message``s
-
pypolibox.textplan.
test_textplan2xml_conversion
()[source]¶ test text plan to XML conversion with all the text plans that were generated for all test queries with debug.gen_all_textplans().
util
Module¶
The util
module contains a number of ‘bread and butter’ functions that are
needed to run pypolibox, but are not particularly interesting (e.g. format
converters, existence checks etc.).
There shouldn’t be any code in this module that require loading other modules from pypolibox!
-
pypolibox.util.
ensure_unicode
(string_or_int)[source]¶ ensures that a string does use unicode instead of UTF8. converts integer input to a unicode string.
-
pypolibox.util.
ensure_utf8
(string_or_int)[source]¶ ensures that a string does not use unicode but UTF8. converts integer input to a string.
-
pypolibox.util.
exists
(thing, namespace)[source]¶ checks if a variable/object/instance exists in the given namespace
Return type: bool
-
pypolibox.util.
flatten
(nested_list)[source]¶ flattens a list, where each list element is itself a list
Parameters: nested_list (list) – the nested list Returns: flattened list
-
pypolibox.util.
freeze_all_messages
(message_list)[source]¶ makes all messages (``FeatDict``s) immutable, which is necessary for turning them into sets
-
pypolibox.util.
msgs_instance_to_list_of_msgs
(messages_instance)[source]¶ converts a
Messages
instance into a list ofMessage
instances
-
pypolibox.util.
sql_array_to_list
(sql_array)[source]¶ converts SQL string “arrays” into a list of strings
Our book database uses ‘[‘ and ‘]’ to handle attributes w/ more than one value: e.g. authors = ‘[Noam Chomsky][Alan Touring]’. This function turns those multi-value strings into a set with separate values.
Parameters: sql_array ( str
) – a string from the database that represents one ormore items delimited by ‘[‘ and ‘]’, e.g. “[Noam Chomsky]” or “[Noam Chomsky][Alan Touring]”
Return type: list
ofstr
Returns: a list of strings, where each string represents one item from the database, e.g. [“Noam Chomsky”, “Alan Touring”]
-
pypolibox.util.
sql_array_to_set
(sql_array)[source]¶ converts SQL string “arrays” into a set of strings
our book database uses ‘[‘ and ‘]’ to handle attributes w/ more than one value: e.g. authors = ‘[Noam Chomsky][Alan Touring]’
this function turns those multi-value strings into a set with separate values
Parameters: sql_array ( str
) – a string from the database that represents one ormore items delimited by ‘[‘ and ‘]’, e.g. “[Noam Chomsky]” or “[Noam Chomsky][Alan Touring]”
Return type: set
ofstr
Returns: a set of strings, where each string represents one item from the database, e.g. [“Noam Chomsky”, “Alan Touring”]