XML


XML, the Extensible Markup Language, is a W3C-endorsed for document markup. It defines a generic syntax used to mark up data with simple, human-readable tags. It provides a standard format for computer documents that is flexible enough to be customized for domains as diverse as web sites, electronic data interchange, vector graphics, genealogy, real estate listings, object serialization, remote procedure calls, voice mail systems, and more.
XML is a metamarkup language for text documents. Data are included in XML documents as strings of text. The data are surrounded by text markup that describes the data. XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks like, what names are acceptable for elements, where attributes are placed, and so forth. XML allows developers and writers to invent the elements they need as they need them. The X in XML stands for EXtensible. Extensible means that the language can be extended and adapted to meet many different needs. The markup is an XML document describes the structure of the document. It lets you see which elements are associated with which other elements. In a well-designed XML document, the markup also describes the document's semantics. The markup says nothing about how the document should be displayed. That is, it does not say that an element is bold or italicized or a list item. XML is a structural and semantic markup language, not a presentation language. The markup permitted in a particular XML application can be documented in a schema. Particular document instances can be compared to the schema. Documents that match the schema are said to be valid. Documents that do not match are invalid. Validity depends on the schema. That is, whether a document is valid or invalid depends on which schema you compare it to. Although XML is quit flexible in the elements it allows, it is quit strict in many other respects. The XML specification defines a grammar for XML documents that says where tags may be placed, what they must like, which element names are legal, how attributed are attached to elements, and so forth. This grammer is specific enough to allow the development of XML parsers that can read any XML document. Documents that satisfy this grammer are said to be well-formed. XML processors reject documents that contain well-formedness errors.
Not all documents need to be valid. For many purposes it is enough that the document be well-formed. There are many different XML schema languages with different levels of expressivity. The most broadly supported schema language is the document type definition (DTD). There are many other schema languages, including the W3C XML schema language, RELAX NG, Schematron, Hook and Examplotron.

Some standards

There is a set of standards that is related to XML. Some of the better known ones are Document Object Model (DOM), Simple API for XML (SAX), xXtensible Stylesheet Language (XSL), XML Linking language (XLink), XML pointer Language (Xpointer), XML Query Language (XQL), Extensive Stylesheet Language Transformation (XSLT), XML path Language (XPath) etc.

Next wave of the internet technology ?

XML allows customized tags to add more semantic to the web content page. Today's search engines are mostly based on word matching,  rather than considering the semantics of both the query formulation and the document text. For example, a search on "Thinkpad" will give you all documents which contain the word "Thinkpad" when the user is more interested in product information pages on "Thinkpad". To solve this problem, one can use XML customized tags to classify data with semantics as "product", "related-product", etc. Another important feature of XML is interoperability support: different applications can communicate and extract information from the same XML document as long as they use the same DTD. XML also allows for the separation of data and presentation directives, specified via XSL, in a document. As a result, the same data can be presented in different formats according to such criteria as output device type or business purpose.

Using MySQL with XML

The following list indicates just some of the possibilities open to you for employing XML processing techniques to make more productive use of your MySQL server:

XML as a data transfer medium.

Writing a query result as an XML document results in a platform-neutral ASCII file that can be used by other applications, even those that are not necessarily database-oriented. The recipient of such a document can employ standard XML tools to parse it and recover the original data values. Used this way, XML serves as an interface between your MySQL database and other applications that can read XML but may know nothing about MySQL. This works in the other direction, too. If an application can produce XML-formatted documents, you can read them and store the information contained therein into MySQL by using simple XML parsing techniques.

XML as a web delivery format.

XML's simple, well-defined structure makes it useful for information delivery in a web environment. For example, you can set up an information feed that clients can use on an automated basis: Define XML formatting conventions with which to express the information and provide access to it through a web script. Clients can send requests to the script, which connects to MySQL, retrieves the desired information, and formats it as an XML document that is returned to the client. The client then extracts information from the document using standard XML tools.

Using XML to write web pages.

As the linitations of HTML for writing web pages become more keenly felt, web developers turn increasingly to the more expressive capabilities of XML. HTML serves primarily as a destination format, whereas XML is useful both as source and destination formats. For example, an XML document can incorporate the results of database queries and then, with the help of a rendering engine such as AxKit, be transformed into a format that matches the type of client you wish to serve. You can send HTML, WML, or plain text to web browsers, wireless devices, or printers. (Or, as indicated in the previous item, you can serve the document directly to clients that understand XML.) Contrast this with HTML, which does not render well into other formats.
Storing XML directly. You can of course store XML itself in your database. You might store templates for documents such as form letters that you combine with customer records to produce mailings, for example.

Reading XML Documents into MySQL

This task generally requires that you know something about the structure of the document and the table, so that you can determine the correspondence between document elements and table columns.
XML is a very good choice for storing data in many cases. It is easy to parse and write, and it is open for users to edit the data. Parsers have mechanisms to verify syntax and completeness, so you can protect your program from corrupted data.

XML retrieval

Any where there is information, you will find XML, or at least hear it scratching at the door. XML has grown into a huge topic, inspiring many technologies and branching into new areas.
XML retrieval is an important new area for the application of IR methods. Whereas little research has been performed on retrieval of structured documents in the past, the increasing availability of XML collections offers the opportunity for developing appropriate retrieval methods.

Pire For Xml retrieval(PireFox)

Extending PIRE(An extensible IR engine based on probabilistic logics) for retrieval of XML documents was a training step to explore xml and query languages. I chose the query language NEXI as starting point, which was introduced in INEX 2004 as a query language for specifying a structured and unstructured queries on XML documents. Using Java,the XML document(s) will be read, parsed and will be stored intoNode and Word tables at Mysql database. When translation of the NEXI queries to Mysql relation database, an intermediate Pdatalog++ (an extension for pDatalog will be used.

Examples:


doc1.xml

this is a simple xml document
here is the tree shows the pre/post order for the nodes


Node table

node table at Mysql for the doc1.xml

Word table

word table at Mysql for the doc1.xml

This is just a simple NEXI query example from INEX topics :


//article[about(.//(abs|kwd), description logics)]//fm//au


After the xml document have been parsed and sent to Mysql into Node and Word tables,as intermediate step the query will be transfered first into Pdatalog++ format and then from Pdatalog++ into Mysql commands,where the relevant information will be retrieved using PIRE with weighting scheme uses approach based on DFR.

When transforming into Pdatalog the above query will be similar to the following:

abskwd(D,PRE1,POST1):-(D,PRE1,POST1,"abs",node,~)
abskwd(D,PRE1,POST1):-(D,PRE1,POST1,"kwd",node,~)

q(D,PRE2):-abskwd(D,PRE1,POST1) & NT(D,PRE2,POST2,"au","node",~) & NT(D,PRE3,POST3,"fm","node",~) & NT(D,PRE4,POST4,"article","node",~) & NT(D,PRE5,POST5,"PCDATA","text",~) & WT(D,PRE5,"description") & WT(D,PRE5,"logics") & gt(PRE2,PRE3) & lt(POST2,POST3) & gt(PRE3,PRE4) & lt(POST3,POST4) & gt(PRE1,PRE4) & lt(POST1,POST4) & add(PRE5,PRE1,1)


click here to check your query

Whereas little research has been performed on retrieval of structured documents in the past,
the increasing availability of XML collections offers the opportunity for developing appropriate
retrieval methods. My project work now dealing with developing appropriate language models for
XML retrieval.

Applying the Divergence From Randomness Approach (DFR)




Now I am developing a probabilistic retrival model based on the technique of "logistic regression". This model will rank documents by probability of relevance, which is distinguished from other approaches such "vector space" retrieval model in which the retrieved items are ranked by a similarity measure (e.g the cosine function).


IR group at the university of Duisburg-Essen

please feel free to E-mail me