Mining HIV : 1 information from literature
Lim, Clarence Jia Xian
Date of Issue2014
School of Computer Engineering
HIV-1 virus frequently mutates to increase resistance against certain drugs. The mutations are partly due to the histones modification in the patient’s genomes. Information of histones modifications are not easily accessible. There are online databases that contained a large amount of documents about the histones modification. However, they are very time consuming for biologist to retrieve manually. Thus, the project attempts to automate the retrieval of the information from the databases and integrate them into a single source for ease of access. The program created consists of certain components to aid the construction of the information source. Document Collection System is the first component of the program which collects documents and abstracts from the online databases and cleaned them for the next stage to process. TEES is the next component which takes in the cleaned documents and extracts the proteins and histone modification events from them. TEEStoCSV Convertor program takes the output of TEES and convert the individual file data into CSV format. Histone Events Compilation program combines the individual CSV files into 1 overall CSV file and filter out the invalid histones. Sampling Program takes the overall CSV file and randomly select 100 samples for the verification process. Normalization Program takes the overall CSV file and normalized the terms for the visualization program, Graphviz. GeneToUniprot program takes the overall CSV file and convert the genes names to Swiss-Prot IDs. Lastly, the XML Constructor program uses the output from the GeneToUniprot program and combined with an extracted histone file to construct the XML file. The overall design architecture uses a pipe and filter style to allow extensibility and ease of modification to individual components. The verification results were overall satisfied as more than half of the samples were correct. Some of the error types found were also able to be resolved. The final result of the program is a XML file which allows the information to be easily distributed and access. Some recommendation is suggested in this project to increase the quality of the results by improving the TEES system’s event detection.
DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Final Year Project (FYP)
Nanyang Technological University