Skip to main content

A Rule Based Question Answering System in Malayalam corpus Using Vibhakthi and POS Tag Analysis

INTRODUCTION
The main goal of Question Answering system is to process requests in natural language form and to provide the accurate short answers to them. Most of the web Browsers we are using today handles QA tasks as information retrieval. So instead of retrieving the precise answers we get all
documents similar to our query. Rather than keyword based queries natural language expressions would be processed by efficient QA systems. Mainly there are two types of QA systems: closed domain question answering systems and open domain question answering system . Also questions can be of different forms: factoid, list, definition, description . Here we focus on factoid type question answering.
In Malayalam no efficient question answering systems exist now. Other than keyword processing we need natural language processing techniques for the QA system in Malayalam. Hence this work is important in Malayalam NLP related works.
Importance of Karaka Thoery and Vibhakthis for Indian Language Analysis
According to Paninian grammar, in the dravidian lamguages the karaka thoery is useful for both the syntax analysis and semantic analysis of sentences . Karakas denote the semantic roles. The study of roles associated with specific verbs and across classes of verbs is called thematic role analysis or case role (karaka) analysis.
The morphemes which are added at the end of a word to make it more meaningful and to relate them with other words are called prathyayas. Prathyayas are added to nouns for mainly three
purposes.
1) Linga Prathyayam(To change Gender)
2) Vachana Prathyayam(To change Number)
3) Vibhakti Prathyayam (To relate nouns with other words)
Here analysis of Vibhakti of words are done to make QA system for Malayalam Sentences.
Eg. “To mother” : "അമ്മയോട് " (ammayodu) = അമ്മ (mother-amma )+ഓട് (to-oodu)
ie. the noun for “mother” in malayalam undergoes inflections when using in a sentence. These inflections are added using these Vibhakthis like “oodu” etc.


System Architecture
Tokenization
A tokenizer will segment the character stream into sequence of tokens. Here we use sentence tokenization and then word tokenization; the document in Malayalam is divided into meaningful units such as sentences or words. Our work involves tokenization using space and comma.

Sandhi splitter
Dravidian languages words can join together and they may undergo morphophonemic changes at the point of joining. This phenomenon is called Sandhi and the word formed so is called compound word. Sandhi splitting is used to split these words into it’s component words. It is considered as the primary task for computational processing of text in Dravidian languages. In
Malayalam also presence of Sandhi is high when compared to other Dravidian languages. To find out the compound words we perform TnT tagger.
POS Tagging
TnT (TRIGRAMS n TAGS) is an efficient statistical part-of-speech tagger used here. The system
is trained with tagged corpora which are tagged by IIITH tag set.
Vibhakthi Analysis
There are 7 different vibhakthis in Malayalam. The vibhakthi of words are identified using rule
based matching.
Procedure:
Question sentence undergoes word-level splitting and finds most matching sentence from the given answer corpus by using Keyword-matching technique. Then this sentence undergoes analysis to find out which word gives exact answer for the question. For this the Vibhakti and POS tag of the question word is compared with the Vibhakti and POS tag of words of the most matching sentence. The word which has the same Vibhakthi and POS tag of the question word is considered as the answer.

Conclusion
In this work a system for question answering (factoid) in Malayalam is proposed. We analyzed the
question words first and identified the vibhakthi that will associate the corresponding answer.
Then after finding out the sentences containing the relative words of answer, we implemented a rule based system for checking vibhakthi of each word. Then the answers are retrieved if the vibhakthi of question module and answer module is same. POS tagging, TnT tagger, compound word splitting are also used for our question answering system.
In our system we used features of vibhakthi for factoid level question answering system. This work could be extended to analyze the corresponding Karaka roles in Malayalam, so a more efficient system for analyzing the questions could be developed.

Comments

Popular posts from this blog

Converting DICOM images into JPG Format in Centos

Converting DICOM images into JPG Format in Centos I wanted to work with medical image classification using Deep learning. The Image data set was .dcm format. So to convert the images to jpg format following steps have performed. Used ImageMagick software. http://www.ofzenandcomputing.com/batch-convert-image-formats-imagemagick/ Installed ImageMagick in Centos by downloading the rom and installing its libraries : rpm -Uvh ImageMagick-libs-7.0.7-10.x86_64.rpm rpm -Uvh ImageMagick-7.0.7-10.x86_64.rpm After installation the the image which is to be converted is pointed in directory. Inside the directory executed the command: mogrify -format jpg *.dcm Now dcm image is converted to JPG format. 

TensorFlow for Beginners

TensorFlow - Image Recognition for New Data Set Tensorflow is an open source machine learning tool provided by Google. It provides various machine learning solutions. Most prominent use for Tensorflow is Computer vision. Here is a small post, about how you can do tensorflow training on your new image data set using python. First Install tensorflow in your system using following command (using python pip) sudo apt-get install python-pip python-dev #used for python 2.7 Then download the TensorFlow Inception model folder from https://github.com/tensorflow/models/tree/master/inception Save your new Image data under a folder named Images.. Inside the folder group each class of images into each separate folder(with corresponding names) Then change flowers_data.py in inception model according to your new data set Places to change:   def num_classes(self):     """Returns the number of classes in the data set."""     re...