Skip to main content

A Rule Based Question Answering System in Malayalam corpus Using Vibhakthi and POS Tag Analysis

INTRODUCTION
The main goal of Question Answering system is to process requests in natural language form and to provide the accurate short answers to them. Most of the web Browsers we are using today handles QA tasks as information retrieval. So instead of retrieving the precise answers we get all
documents similar to our query. Rather than keyword based queries natural language expressions would be processed by efficient QA systems. Mainly there are two types of QA systems: closed domain question answering systems and open domain question answering system . Also questions can be of different forms: factoid, list, definition, description . Here we focus on factoid type question answering.
In Malayalam no efficient question answering systems exist now. Other than keyword processing we need natural language processing techniques for the QA system in Malayalam. Hence this work is important in Malayalam NLP related works.
Importance of Karaka Thoery and Vibhakthis for Indian Language Analysis
According to Paninian grammar, in the dravidian lamguages the karaka thoery is useful for both the syntax analysis and semantic analysis of sentences . Karakas denote the semantic roles. The study of roles associated with specific verbs and across classes of verbs is called thematic role analysis or case role (karaka) analysis.
The morphemes which are added at the end of a word to make it more meaningful and to relate them with other words are called prathyayas. Prathyayas are added to nouns for mainly three
purposes.
1) Linga Prathyayam(To change Gender)
2) Vachana Prathyayam(To change Number)
3) Vibhakti Prathyayam (To relate nouns with other words)
Here analysis of Vibhakti of words are done to make QA system for Malayalam Sentences.
Eg. “To mother” : "അമ്മയോട് " (ammayodu) = അമ്മ (mother-amma )+ഓട് (to-oodu)
ie. the noun for “mother” in malayalam undergoes inflections when using in a sentence. These inflections are added using these Vibhakthis like “oodu” etc.


System Architecture
Tokenization
A tokenizer will segment the character stream into sequence of tokens. Here we use sentence tokenization and then word tokenization; the document in Malayalam is divided into meaningful units such as sentences or words. Our work involves tokenization using space and comma.

Sandhi splitter
Dravidian languages words can join together and they may undergo morphophonemic changes at the point of joining. This phenomenon is called Sandhi and the word formed so is called compound word. Sandhi splitting is used to split these words into it’s component words. It is considered as the primary task for computational processing of text in Dravidian languages. In
Malayalam also presence of Sandhi is high when compared to other Dravidian languages. To find out the compound words we perform TnT tagger.
POS Tagging
TnT (TRIGRAMS n TAGS) is an efficient statistical part-of-speech tagger used here. The system
is trained with tagged corpora which are tagged by IIITH tag set.
Vibhakthi Analysis
There are 7 different vibhakthis in Malayalam. The vibhakthi of words are identified using rule
based matching.
Procedure:
Question sentence undergoes word-level splitting and finds most matching sentence from the given answer corpus by using Keyword-matching technique. Then this sentence undergoes analysis to find out which word gives exact answer for the question. For this the Vibhakti and POS tag of the question word is compared with the Vibhakti and POS tag of words of the most matching sentence. The word which has the same Vibhakthi and POS tag of the question word is considered as the answer.

Conclusion
In this work a system for question answering (factoid) in Malayalam is proposed. We analyzed the
question words first and identified the vibhakthi that will associate the corresponding answer.
Then after finding out the sentences containing the relative words of answer, we implemented a rule based system for checking vibhakthi of each word. Then the answers are retrieved if the vibhakthi of question module and answer module is same. POS tagging, TnT tagger, compound word splitting are also used for our question answering system.
In our system we used features of vibhakthi for factoid level question answering system. This work could be extended to analyze the corresponding Karaka roles in Malayalam, so a more efficient system for analyzing the questions could be developed.

Comments

Popular posts from this blog

Coursera Course 3 Structuring Machine Learning Projects

Week One - Video One - Why ML STrategy Why we should learn care about ML Strategy Here when we try to improve the performance of the system we should consider about a lot of things . They are: -Amount of data - Amount of diverse data - Train algorithm longer with gradient descent -use another optimization algorithm like Adam -  use bigger network or smaller network depending out requirement -  use drop out - add l2 regularization - network architecture parameters like number of hidden units, Activation function etc. Second Video - Orthogonalization Orthogonalization means in a deep learning network we can change/tune so many things for eg. hyper parameters to get a more performance in the network . So most effective people know what to tune in order to achieve a particular effect. For every set of problem there is a separate solution. Don't mix up the problems and solutions. For that, first we should find out where is the problem , whether it is with training ...

Converting DICOM images into JPG Format in Centos

Converting DICOM images into JPG Format in Centos I wanted to work with medical image classification using Deep learning. The Image data set was .dcm format. So to convert the images to jpg format following steps have performed. Used ImageMagick software. http://www.ofzenandcomputing.com/batch-convert-image-formats-imagemagick/ Installed ImageMagick in Centos by downloading the rom and installing its libraries : rpm -Uvh ImageMagick-libs-7.0.7-10.x86_64.rpm rpm -Uvh ImageMagick-7.0.7-10.x86_64.rpm After installation the the image which is to be converted is pointed in directory. Inside the directory executed the command: mogrify -format jpg *.dcm Now dcm image is converted to JPG format.