nQuery - A Natural Language Statement to SQL Query Generator

In this research, an intelligent system is designed between the user and the database system which accepts natural language input and then converts it into an SQL query. The research focuses on incorporating complex queries along with simple queries irrespective of the database. The system accommodates aggregate functions, multiple conditions in WHERE clause, advanced clauses like ORDER BY, GROUP BY and HAVING. The system handles single sentence natural language inputs, which are with respect to selected database. The research currently concentrates on MySQL database system. The natural language statement goes through various stages of Natural Language Processing like morphological, lexical, syntactic and semantic analysis resulting in SQL query formation.


Introduction
Today, virtually every relational database management system (RDBMS) uses Structured Query Language (SQL) for querying and maintaining the database. Users accessing relational databases need to learn SQL and build queries in the right syntax for retrieving the data. It becomes a big hurdle for all those who are not technically knowledgeable in this domain to write the queries in SQL. It would be very convenient if the relational database system can be queried using natural language like English.
Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken. While natural language may be easy for people to learn and use, it has been proved to be hard for a computer to master. Despite such challenges, natural language processing is regarded as a promising and important endeavor in the field of computer research. nQuery will translate natural language queries into SQL before retrieving data from database. It will deal with single sentence inputs given by the user using a particular database. The system mainly focuses on data retrieval but also provides the facility to convert DML natural language statements to SQL. However, the system will output queries which can be used for querying the MySQL database system only. The aim of the system is to reduce the complexity of database querying. The approach our system uses, extracts certain keywords from the natural language statement and goes through various steps of Natural Language Processing. This system focuses on table mapping, attribute mapping and clause tagging to generate the resultant query.

Related Work
Over the years, certain systems which focus only on a particular database have been built to serve a particular purpose. (Woods, 1973) developed a system called LUNAR, that answered questions about rock samples brought back from the moon. LIFER/LADDER designed by (Hendrix, 1978) was designed as a natural language interface to a database of information about US Navy ships. The system could only support simple one-table queries on a specific database.
Some of the recent developments try to build a complete system which can generate various types of queries. An expert system was proposed by (Siasar et al., May 2008) using the concepts of syntactic and semantic knowledge. They have also suggested a selection algorithm to select most appropriate query from the suggested possi-  (Rao et al., 2010) have put forth a system in which simple queries along with basic implicit queries are covered. (Chaudhari, November 2013) implemented a system which handled simple queries and aggregate functions using a prototype approach. Both the above methods have not handled multiple tables and advanced clauses. . (Ghosh et al., 2014) built on the development done by (Chaudhari, November 2013) and developed an automatic query generator which took natural language input in the form of text or speech. It provided support for simple nested queries and aggregate functions. The system handled sentences which explicitly mention the attribute names as they are in the Database. Our system handles the problem by a specific substring algorithm we have developed. We have looked to do the tasks which the above systems do in a more efficient way by building a different type of algorithm relying on conditional substring matching to map the words to attributes and tables. We also go beyond by including various different types of queries.
A different type of approach was used by (Reinaldha and Widagdo, 2014) in which the different kinds of questions which can be asked when a database is to be queried were analyzed. They have made use of semantic rules to find out dependencies among the words present in the question asked. (Palakurthi et al., 2015) provides information about the types of attributes and classification features. They describe how different kinds of attributes are handled differently when they occur in sentences. We handle explicit attributes and certain types of implicit attributes in sentences. (Ghosal et al., March 2016) proposed a system which worked well on simple queries involving multiple tables. But the data dictionaries used for the system are limited and the grammar is hard coded. (Kaur and J, Jan 2016) emphasized on simple queries and basic JOIN operations. However, the system does not accommodate advance clauses like aggregate functions, GROUP BY and HAVING Clauses. Our system incorporates advanced clauses along with all the simple queries and generalizes well on different databases. (Singh and Solanki, 2016) proposed an algorithm to convert natural language sentence to SQL queries. They used verb lists, noun lists and rules to map attributes and tables to the words in the sentence. The system also handled ambiguity among the inputs. We have tried to use concepts discussed in this algorithm like noun and verb lists in order to develop our algorithm.
From the above literature survey, we were able to get a fair idea of the work carried out in this field of research. The shortcomings of the referred papers and applications along with the future work mentioned motivated us to take up this research. The increasing importance of Natural Language Processing lured us towards learning these concepts. The system we propose looks to go beyond the accomplished work.

System Design
As we have seen from the literature survey, every system had limitations. We propose a system which looks to overcome the shortcomings of the existing systems. Our system gets a natural language sentence as an input, which is then passed through various phases of NLP to form the final SQL query. Refer Fig 1 for

Tokenize and Tag
The input natural language query gets split into different tokens with the help of the tokenizer from 'NLTK' package. The tokenized array of words is tagged according to the part-of-speech tagger. All processes following this step use these tagged tokens for processing.

Analyze tagged tokens
Based on the tagged tokens of earlier step, the noun map and verb list is prepared through one iteration over the tokens. The tokens corresponding to aggregate functions are also mapped with their respective nouns. The decision whether the natural language statement represents a data retrieval query (SELECT) or a DML query (INSERT, UP-DATE, DELETE) is taken at this stage with the help of certain 'data arrays' for denoting type of query. For example, when words like 'insert' and its certain synonyms appear in the input, the type of query is 'INSERT' and so on. In any type of query, the tentative tags 'S' (SELECT), 'W' (WHERE), 'O' (ORDER BY) are mapped to the nouns indicating the clauses to which they belong. For this, we have designed 'data dictionaries' for different clauses. These data dictionaries consist of the token-clause term pair, for e.g. aggregate clause data dictionary is "number": "COUNT", "count": "COUNT", "total": "SUM", "sum": "SUM", "average": "AVG", "mean": "AVG". Thus, if any of these tokens is encountered, it is likely to have aggregate clause and accordingly the nouns are tagged with the clause tag.

Map to table names and attributes
Using the noun map and verb list, the table set is prepared, which will hold the tables that are needed in the query to be formed. This is based on the fact that the table names are either nouns or verbs. The noun map is used to find the attributes which are needed in the final query. The    Figure 2: Algorithm with example attributes, the table associated with the attribute and the clause tag are stored in an attribute-table map which is used in the final stage of query formation. This is done using the string matching algorithm that we have implemented in our system. The words in the input sentence need not exactly be as they are in the database. The stemmer and lemmatiser are applied on the words before they are matched using our string matching algorithm. The data obtained during this step i.e. table set and attribute-table map, is most likely to be in the final query, however, it might be refined later.

Filter redundancy and finalize clauses of the query
Using the various data dictionaries defined, the system has already decided which clauses are likely to exist in the final query and has mapped the data to the clauses. But, some of the data has to be finalized at this stage. The data related to GROUP BY and HAVING clause is collected using the previous data and the basic rules of SQL. For example, if aggregate function is compared to a constant, i.e. 'MAX(salary) > 40000', then 'HAVING' clause has to be used instead of 'WHERE' clause. As mentioned in the earlier step, the refinement of data must be done. Here, the redundant tables and attributes are removed using some filter algorithms. For example, one of the algorithm filters the

Form the final query and execute
Currently, as our system handles only MySQL queries, the templates used for the query formation will be according to the MySQL syntax. According to the type of query selected in the second stage of the process (Analyze tagged tokens), the appropriate template is chosen. The template is selected from the following: 1. For data retrieval queries (SELECT): Based on the data about various clauses collected from earlier steps and the information about attributes and tables stored in the attribute-table map, the final query is formed by filling in the information into the appropriate template. Depending on the clause data collected from earlier steps, corresponding <> are filled. Depending on the relation between multiple tables, the decision of INNER JOIN or NATURAL JOIN is taken. For example, if there are two tables. If these two tables have one common attribute and is named the same in both, then there is NATURAL JOIN between the tables. But if the common attribute is named differently in the two tables, then there is INNER JOIN between the tables. The final query is as shown in Fig 2.

Results and Analysis
The corpus that can be used to test our system is not readily available and is dependent on a database. Hence, we have tested our system on a synthesized corpus of natural language statements related to a bank and a university database. The university and bank database consists of 11 and 6 tables respectively. However, system can work on any complex database. The natural language statement has to be a single sentence. The system has been evaluated on a corpus of around 75 natural language statements of university database and around 50 related to bank database. The accuracy of the system is found out to be around 86%. The system gives the same SQL query as the output when the same natural language statement is represented in different ways. If the system fails to generate SQL query corresponding to any natural language statement, an error message is displayed. These are a few results given by the system on the university corpus: 2. Find all student name whose credits are between 90 and 100 and department name is 'Finance' or 'Biology'.
• SELECT DISTINCT student.stud name FROM student WHERE ( student.tot cred BETWEEN '90' AND '100' ) AND ( student.dep name = 'Finance' OR student.dep name = 'Biology' ) The above query showcases multiple conditions within the WHERE clause. This query also involves use of BETWEEN clause and logical clauses like AND, OR.
3. List all student names whose credits are 50 in decreasing order of credits.
• SELECT DISTINCT student.stud name FROM student WHERE student.tot cred = '50' ORDER BY student.tot cred DESC Another type of query is the one involving sorting its result based on some attribute. For this purpose, the query uses the ORDER BY clause to sort the results in decreasing order.
4. Give the department name where maximum salary of instructor is greater than 50000.
• SELECT DISTINCT instructor.dep name FROM instructor GROUP BY instructor.dep name HAVING MAX(instructor.salary) >'50000' In SQL, when an aggregate function is compared to constant, like in this case maximum of salary is compared to 50000, then the query involves use of HAVING clause instead of a WHERE clause. Also, whenever HAVING is used, the results are supposed to be grouped by the attributes in the SELECT clause.
5. Give the department name where salary of instructor is greater than average of salary.
• Till now, we have seen cases in which an attribute associated to the value is mentioned in the natural language statement. In this case, we handle cases where attribute is not mentioned. We find out the most appropriate attribute for the given value. • SELECT DISTINCT instructor.name FROM instructor ORDER BY instructor.name ASC As seen in this example, there can be multiple ways of representing the same natural language statement.The system gives the same SQL query as the output when the same natural language statement is represented in different ways.
8. Insert a student whose id is 5, name is Josh, department name is Physics and credits are 150.
• INSERT INTO student ( student.ID, student.stud name, student.dep name, student.tot cred ) VALUES ( '5' ,'Josh' ,'Physics' ,'150' ) In addition to the data retrieval queries, our system also provides a natural language interface to insert data into the database. Other DML queries such as UPDATE and DELETE are also provided by the system.

Limitations
The following are some of the types of inputs that are not presently handled by our system.

Find the capacity of the classroom number 3128 in building Taylor
• SELECT * FROM classroom WHERE classroom.capacity = '3128' AND classroom.building = 'Taylor' In this particular example, the system fails to decide whether to take 'capacity of classroom' or 'classroom number' as an n-gram. Hence, the mapping fails.

Who teaches Physics?
• SELECT * FROM department WHERE department.dep name = 'Physics' In this example, the implicit query module of our system is able to map Physics to 'department name' attribute from table 'department'. But it fails to identify that 'who' refers to a person (an instructor).

Comparison and Conclusion
Similar existing systems: