A semantic big biodiversity data integration tool
Our planet is facing huge effects of global climate changes that are threatening biodiversity data to be surviving. Biodiversity data exist in very complex characteristics, such as high volume, variety, veracity, velocity, and value, as Big data. The variety or heterogeneity of biodiversity data provides a very high challenging research problem since they exist in unstructured, semi-structured, quasi-structured, and generated in XML, EML, Excel sheets, videos, images, or ontologies. In addition, the availability of biodiversity data includes trait-measurements, species distribution, species’ morphology, genetic sequences, phylogenetic trees, spatial data, and ecological niches; data are collected and uploaded in Bio Portals via citizen scientists, museums’ collections, ecological surveys, and environmental studies. These data collections generate big data, which is important current research. The first phase of Big data analytics life cycle discovers whether the data is enough to perform the analytics process, which takes more time than any other phase. In addition, Big biodiversity data management life cycle includes data integration as a main phase, affecting storage, indexing, and querying. In the data integration phase, we apply semantic data integration in order to combine data from different sources and consolidate them into valuable information that depends on semantic technologies. A number of research attempts have been achieved on semantic big data integration. For example, Ontology-Based Data Access (OBDA) has been proposed in relational schema and in NOSQL [1,2] databases since it provides a semantically conceptual schema over data repository. Another example is Semantic Extract Transform Load (ETL) framework , which integrates and publishes data from multiple sources as open linked data provides through semantic technologies. Moreover, Semantic MongoDB-based has been developed where researchers represented as an OWL ontology. However, the need for semantic big data integration tools becomes highly recommended because of the growth of biodiversity big data. In the current work, a semantic big data integration system is developed, which handles the following features: 1) Data heterogeneity, 2) NoSQL databases, 3) Ontology based Integration, and 4) User Interaction, where data integration components can be chosen. A proof-of-concept will be developed based on biodiversity data, having various data formats. In addition, related ontologies will be used from BioPortal.