XML
XML, the Extensible Markup Language, is a W3C-endorsed for document markup. It defines a generic syntax used to mark up data with simple,
human-readable tags. It provides a standard format for computer documents that is flexible enough to be customized for domains as diverse as web
sites, electronic data interchange, vector graphics, genealogy, real estate listings, object serialization, remote procedure calls, voice mail
systems, and more.
XML is a metamarkup language for text documents. Data are included in XML documents as strings of text. The data are surrounded by text markup
that describes the data. XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks
like, what names are acceptable for elements, where attributes are placed, and so forth. XML allows developers and writers to invent the elements
they need as they need them. The X in XML stands for EXtensible. Extensible means that the language can be extended and adapted to meet many
different needs. The markup is an XML document describes the structure of the document. It lets you see which elements are associated with
which other elements. In a well-designed XML document, the markup also describes the document's semantics. The markup says nothing about how the
document should be displayed. That is, it does not say that an element is bold or italicized or a list item. XML is a structural and semantic markup
language, not a presentation language. The markup permitted in a particular XML application can be documented in a schema. Particular document
instances can be compared to the schema. Documents that match the schema are said to be valid. Documents that do not match are invalid. Validity
depends on the schema. That is, whether a document is valid or invalid depends on which schema you compare it to. Although XML is quit flexible in
the elements it allows, it is quit strict in many other respects. The XML specification defines a grammar for XML documents that says where tags may
be placed, what they must like, which element names are legal, how attributed are attached to elements, and so forth. This grammer is specific
enough
to allow the development of XML parsers that can read any XML document. Documents that satisfy this grammer are said to be well-formed. XML
processors
reject documents that contain well-formedness errors.
Not all documents need to be valid. For many purposes it is enough that the document be well-formed. There are many different XML schema
languages with different levels of expressivity. The most broadly supported schema language is the document type definition (DTD). There are many
other schema languages, including the W3C XML schema language, RELAX NG, Schematron, Hook and Examplotron.
Some standards
There is a set of standards that is related to XML. Some of the better known ones are Document Object Model (DOM), Simple API for XML (SAX),
xXtensible Stylesheet Language (XSL), XML Linking language (XLink), XML pointer Language (Xpointer), XML Query Language (XQL), Extensive Stylesheet
Language Transformation (XSLT), XML path Language (XPath) etc.
Next wave of the internet technology ?
XML allows customized tags to add more semantic to the web content page. Today's search engines are mostly based on word matching,
rather than considering the semantics of both the query formulation and the document text. For example, a search on "Thinkpad" will
give you all documents which contain the word "Thinkpad" when the user is more interested in product information pages on
"Thinkpad". To solve this problem, one can use XML customized tags to classify data with semantics as "product",
"related-product", etc. Another important feature of XML is interoperability support: different applications can communicate and extract
information from the same XML document as long as they use the same DTD. XML also allows for the separation of data and presentation directives,
specified via XSL, in a document. As a result, the same data can be presented in different formats according to such criteria as output device type
or business purpose.
Using MySQL with XML
The following list indicates just some of the possibilities open to you for employing XML processing techniques to
make more productive use of your MySQL server:
XML as a data transfer medium.
Writing a query result as an XML document results in a platform-neutral ASCII file that can be used by
other applications, even those that are not necessarily database-oriented. The recipient of such a document can employ standard XML tools to parse
it and recover the original data values. Used this way, XML serves as an interface between your MySQL database and other applications that can read
XML but may know nothing about MySQL. This works in the other direction, too. If an application can produce XML-formatted documents, you can read
them and store the information contained therein into MySQL by using simple XML parsing techniques.
XML as a web delivery format.
XML's simple, well-defined structure makes it useful for information delivery in a web
environment. For example, you can set up an information feed that clients can use on an automated basis: Define XML
formatting conventions with which to express the information and provide access to it through a web script. Clients
can send requests to the script, which connects to MySQL, retrieves the desired information, and formats it as an
XML
document that is returned to the client. The client then extracts information from the document using standard XML
tools.
Using XML to write web pages.
As the linitations of HTML for writing web pages become more keenly felt, web developers turn increasingly to the more expressive
capabilities of XML. HTML serves primarily as a destination format, whereas XML is useful both as source and destination formats. For example, an
XML document can incorporate the results of database queries and then, with the help of a rendering engine such as AxKit, be transformed into a
format that matches the type of client you wish to serve. You can send HTML, WML, or plain text to web browsers, wireless devices, or printers. (Or,
as indicated in the previous item, you can serve the document directly to clients that understand XML.) Contrast this with HTML, which does not
render well into other formats.
Storing XML directly. You can of course store XML itself in your database. You might store templates for documents
such as form letters that you combine with customer records to produce mailings, for example.
Reading XML Documents into MySQL
This task generally requires that you know something about the structure
of the document and the table, so that you
can determine the correspondence between document elements and table columns.
XML is a very good choice for storing data in many cases. It is easy to parse and write, and it is open for users to edit the data. Parsers
have mechanisms to verify syntax and completeness, so you can protect your program from corrupted data.
XML retrieval
Any where there is information, you will find XML, or at least hear it scratching at the door. XML has grown into a huge topic, inspiring many
technologies and branching into new areas.
XML retrieval is an important new area for the application of IR methods. Whereas little research has been performed on retrieval of structured
documents in the past, the increasing availability of XML collections offers the opportunity for developing appropriate retrieval methods.
Pire For Xml retrieval(PireFox)
Extending PIRE(An extensible IR engine based on probabilistic logics)
for retrieval of XML documents was a training step to explore xml and query languages.
I chose the query language NEXI as starting point, which was introduced in INEX 2004 as a query language for specifying a structured and unstructured
queries on XML documents.
Using Java,the XML document(s) will be read, parsed and will be stored
intoNode and Word tables at Mysql
database.
When translation of the NEXI queries to Mysql relation database, an intermediate
Pdatalog++ (an extension for pDatalog will be used.
Examples:
this is a simple xml document
here is the tree shows the pre/post order for the nodes
node table at Mysql for the doc1.xml
word table at Mysql for the doc1.xml
This is just a simple NEXI query example from INEX topics :
//article[about(.//(abs|kwd), description logics)]//fm//au
After the xml document have been parsed and sent to Mysql into Node and Word tables,as intermediate step the query will be transfered first into
Pdatalog++ format and then from Pdatalog++ into Mysql commands,where the
relevant information will be retrieved using PIRE with weighting scheme
uses approach based on DFR.
When transforming into Pdatalog the above query will be similar to the following:
abskwd(D,PRE1,POST1):-(D,PRE1,POST1,"abs",node,~)
abskwd(D,PRE1,POST1):-(D,PRE1,POST1,"kwd",node,~)
q(D,PRE2):-abskwd(D,PRE1,POST1)
& NT(D,PRE2,POST2,"au","node",~)
& NT(D,PRE3,POST3,"fm","node",~)
& NT(D,PRE4,POST4,"article","node",~)
& NT(D,PRE5,POST5,"PCDATA","text",~)
& WT(D,PRE5,"description")
& WT(D,PRE5,"logics")
& gt(PRE2,PRE3) & lt(POST2,POST3)
& gt(PRE3,PRE4) & lt(POST3,POST4)
& gt(PRE1,PRE4) & lt(POST1,POST4)
& add(PRE5,PRE1,1)
click here to check your query
Whereas little research has been performed on retrieval of structured documents in the past,
the increasing availability of XML collections offers the opportunity for developing appropriate
retrieval methods. My project work now dealing with developing appropriate language models for
XML retrieval.
Applying the Divergence From Randomness Approach
(DFR)
Now I am developing a probabilistic retrival model based on the technique of "logistic regression".
This model will rank documents by probability of relevance, which is distinguished from other approaches
such "vector space" retrieval model in which the retrieved items are ranked by a similarity measure (e.g the
cosine function).
IR group
at
the university of Duisburg-Essen
please feel free to E-mail me