Faculty of Science KROAK: A METADATA COLLECTION SYSTEM FOR LONG TERM MICROBIAL COMMUNITY MONITORING 2017 | LEE HAMMOND BERGSTRAND B.Sc. Honours thesis - Biology KROAK: A METADATA COLLECTION SYSTEM FOR LONG TERM MICROBIAL COMMUNITY MONITORING. by LEE HAMMOND BERGSTRAND A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF SCIENCE (HONS.) in the Department of Biological Sciences (Biology) This thesis has been accepted as conforming to the required standards by: Jonathan D. Van Hamme (Ph.D.), Thesis Supervisor, Dept. Biological Sciences Mila Kwiatkowska (Ph.D.), Examining Committee member, Dept. Computing Science Don Nelson (Ph.D.), Examining Committee member, Dept. Biological Sciences Dated this 31th day of May 2017, in Kamloops, British Columbia, Canada © Lee Hammond Bergstrand, 2017 ABSTRACT Amplytica is start-up company whose software, called the Amplytica Cloud Platform, helps organizations determine how microbes influence their bioprocesses. Examples of such bioprocesses include anaerobic digestion, wastewater treatment and mine site reclamation. The Amplytica Cloud Platform does this by integrating and analyzing metagenomically derived microbial community data (species composition, diversity, and abundance) and industrial bioprocess data (e.g. temperature, pH, nutrients). To achieve data integration, industrial bioprocess data is considered metadata to the microbial community information and describes the environmental conditions where the microbial community is found. The capture of this industrial metadata requires a robust metadata capture system. Kroak is a metadata capture system for the Amplytica Cloud Platform that facilitates tagging per-sample microbial community information with industrial environmental metadata. It uses a modern web interface for easy deployment, Office Open XML Workbook (XLSX) template files for easy metadata capture, and metadata classes to ensure data consistency and type identification for follow on automated statistics and machine learning. Kroak is a functional metadata capture system which will be iteratively improved upon by Amplytica. Potential improvements include changes to Kroak’s data model, increasing the reliability of its metadata parsing and the expansion of its existing web application programming interface. Thesis Supervisor: Associate Professor Jonathan Van Hamme ii ACKNOWLEDGMENTS I would like to thank Dr. Van Hamme for his continued support of my academic and entrepreneurial endeavors. I am also grateful to the members of my committee for their patience and support in overcoming the numerous obstacles that I have faced throughout my research. I would also like to thank Kamloops Innovation and Amplytica Inc. for their support of this research project. iii TABLE OF CONTENTS Abstract ............................................................................................................................................ ii Acknowledgments........................................................................................................................... iii Table of Contents ............................................................................................................................ iv List of Figures ................................................................................................................................. vi 1. Background .............................................................................................................................. 1 2. Introduction .............................................................................................................................. 2 3. Design ...................................................................................................................................... 4 3.1. 3.1.1. Project ...................................................................................................................... 4 3.1.2. Sample...................................................................................................................... 4 3.1.3. Metadata and Datapoint ........................................................................................... 5 3.2. 4. Data Model....................................................................................................................... 4 Information Architecture ................................................................................................. 5 3.2.1. Metadata Classes ...................................................................................................... 5 3.2.2. Data Flow ................................................................................................................. 6 3.2.3. Views ....................................................................................................................... 8 Implementation ...................................................................................................................... 10 4.1. User Interface ................................................................................................................. 10 4.2. Web Server..................................................................................................................... 11 4.3. Metadata Template Generation ...................................................................................... 12 4.4. Metadata Template Parsing............................................................................................ 12 iv 4.5. Data Representation and Storage ................................................................................... 12 5. Results .................................................................................................................................... 13 6. Discussion .............................................................................................................................. 13 7. 6.1. Static pages and Asynchronous Client Data Loading .................................................... 13 6.2. Metadata Classes ............................................................................................................ 14 6.3. Per-sample Metadata Storage ........................................................................................ 15 6.4. XLSX Files .................................................................................................................... 15 6.5. PostgreSQL .................................................................................................................... 16 Conclusions and Future Directions ........................................................................................ 16 7.1. Further Platform Integration .......................................................................................... 17 7.2. Improved Data Model .................................................................................................... 17 7.3. Improved Metadata Parsing ........................................................................................... 19 7.4. Expanded Web Application Programming Interface (API) ........................................... 19 7.5. Creation of Automated Tests ......................................................................................... 19 Literature Cited .............................................................................................................................. 20 Appendix A .................................................................................................................................... 23 v LIST OF FIGURES Figure 1: An example of a Project object which contains multiple Samples. ................................. 4 Figure 2: An example of a Sample object which contains a Metadata object with multiple Datapoints. ........................................................................................................................ 5 Figure 3: An overview of the classes used in Kroak. Samples contain Metadata. Metadata is composed of Datapoints. Datapoints are members of metadata classes. A metadata class is a subset of data points across samples in a project which contain the same type of data. ................................................................................................................................... 6 Figure 4: The standard kroak work flow consists of the following steps: 1) The user connects to Kroak, 2) Existing metadata and metadata classes are pulled from the database, 3) They are then sent to the User, 4) The user then sends updated metadata classes to Kroak, 5) These along with existing metadata are used to create an XLSX template file, 6) This template is then sent the user, 7) The user then adds metadata to the template, 8) This completed template is uploaded to Kroak, 9) Finally, the metadata is parsed from the XLSX template and placed in the database. ..................................................................... 7 Figure 5: Editing a template XLSX file in Microsoft Excel. ........................................................... 8 Figure 6: Kroak’s user interface for view one which allows end users to declare and modify metadata classes and view existing metadata. .................................................................. 9 Figure 7: Kroak’s user interface for view two which allows users to upload metadata templates. . 9 Figure 8: Kroak uses metadata templates and web technologies to input per-sample metadata into a PostgreSQL database. .................................................................................................. 10 Figure 9: The representation of data points within a Sample object.............................................. 13 Figure 10: A) The current Kroak database schema. B) An improved Kroak database schema with new tables for metadata classes and data points. ............................................................ 18 vi 1. BACKGROUND In various industries, mixed microbial culture bioreactors are used to perform operations such as wastewater treatment (Henze et al. 2001), mineral extraction (Okibe et al. 2003) and biogas generation from biomass sources such a farm manure or sewage sludge (Mata-Alvarez, Mace & Llabres 2000). In these bioprocesses, various microbial populations either compete for resources or cooperate to achieve some metabolic outcome (Mata-Alvarez, Mace & Llabres 2000, Henze et al. 2001, Okibe et al. 2003). Fostering desirable microbial communities is of utmost importance as the ratio of “good” to “bad” microbes in a reactor is what dictates bioprocess efficiency and ultimately, cost (Chen, Cheng & Creamer 2008). The composition of these communities is driven primarily by the environmental conditions in the reactor vessel such as temperature, pH, organics and available nutrients (Chen, Cheng & Creamer 2008). Amplytica Inc. is a bioinformatics company that specializes in the development of software for exploring microbial community dynamics via the analysis of high throughput sequencing data. The current rendition of this software is called the Amplytica Cloud Platform (ACP). A future commercial application of ACP is the automated optimization of mixed culture bioreactors, such as those discussed above, through the integrated analysis of microbial community dynamics and bioprocess environmental conditions. Specifically, the goal is to allow an operator to routinely take samples for amplicon-based metagenomic sequencing while collecting process data such as bioreactor inputs (e.g. nutrients) and outputs (e.g. methane and H2S in the case of biogas generators). Such a dataset would be then loaded into the ACP. In the context of the ACP, this process data is considered metadata to the genomics sample and describes the environmental conditions when the sample was taken from 1 the reactor. Inside the ACP, the amplicon-based metagenomic sequencing data is converted into microbial community composition, abundance and diversity using the QIIME microbial community analysis pipeline (Caporaso et al. 2010). When combined with process metadata this microbial community information creates a time series multidimensional dataset detailing the shift in reactor conditions and the resulting microbial community response. To predict optimal bioreactor conditions, the dataset would then be further processed by statistical and/or machine learning algorithms. For example, in anaerobic digestion this would be an optimization of reactor conditions for increased biogas production while maintaining low levels of contaminating gases such as H2S. 2. INTRODUCTION The type of environmental metadata available for the above optimization analysis varies by bioprocess. For example, an operator of a tank-based gold tailings bioleaching reactor would collect different process data than and an operator of a Biological Nutrient Removal (BNR) Wastewater Treatment Plant (WWTP). The former is concerned about how much gold is being leached whereas the latter is concerned with how much excess nitrogen and phosphorous is being discharged into the environment. Additionally, different operators of the same bioprocess may not have the same sensors for their bioreactors and thus will be collecting different types of process metadata. Therefore, the ACP must be configurable to capture different types of metadata on a project by project basis. A future goal of the ACP is for it to be able to automatically identify the type of statistical or machine learning technique to use for an optimization analysis depending on the data available. The type and requirements of these techniques are yet to determined, however, early candidates 2 include Temporal Gaussian Process Model for Compositional Data Analysis (TGP-CODA) (Aijo, Mueller & Bonneau 2016) and Bayesian algorithms such as Bayesian Adaptive Penalized Counts Splines (BAPCS), Bayesian adaptive lasso (BAL) and Bayesian variable selection (BVS) as implemented in MDSINE (Bucci et al. 2016). Nonetheless, retaining information about the type of environmental data captured, such as whether it is an integer number, floating point number or categorical will be useful for the future development of the ACP. Previous metadata capture systems for microbial community information have a primarily focus on storing specific types of metadata. For example, Metazen (Bischof et al. 2014) only allows certain types of pre-approved metadata to be captured. Each category of metadata must be found in a controlled vocabulary from BioOntology (Whetzel et al. 2011). Integrating ontologically organized metadata into ACP, would be overly time consuming and may limit the type of metadata which could be captured. The subject of this thesis is Kroak, a metadata capture system for the ACP which addresses the concerns raised above. It is a lightweight system that facilitates tagging per-sample microbial community information with industrial environmental metadata to assist in the analysis of how microbial communities influence industrial processes. It uses a web interface for easy deployment, Office Open XML Workbook (XLSX) template files (ISO/IEC 29500-1:2016) for easy metadata capture, and metadata classes to ensure data consistency and type identification for follow on automated statistics and machine learning. 3 3. DESIGN 3.1. DATA MODEL Kroak’s current data model consists of several object-oriented programming (OOP) classes (Cox 1986) which layout a blueprint for the various objects used in the system such as objects representing individual projects, samples, metadata, and data points. Henceforth, objects and classes representing projects, samples, metadata, and data points will be capitalized. For example, the word “Sample” refers to an object which represents a single physical sample which is being analyzed. 3.1.1. PROJECT A Project represents a collection of samples which were put through a single bioinformatics analysis (Fig. 1). Each Project has a unique identifier and has an optional human readable name and description for easy manual identification. A Project contains one or more Samples (Fig. 1). Figure 1: An example of a Project object which contains multiple Samples. 3.1.2. SAMPLE A Sample represents a unique piece of genomic data along with its associated metadata taken during collection (Fig. 2). Like a Project, each Sample contains a unique identifier, name and 4 description (Fig. 2). A Sample possesses both bioinformatics analysis results and metadata in the form of a single Metadata object (Fig. 2). Figure 2: An example of a Sample object which contains a Metadata object with multiple Datapoints. 3.1.3. METADATA AND DATAPOINT A Metadata object represents the collection of individual data points which belong to a single sample (Fig. 2). The Metadata object contains one or more Datapoint objects (Fig. 2). Each Datapoint represents a single piece of metadata for a given sample. It contains not only the value of the metadata but also its name, type and units. These three properties are referred to the as the data point’s metadata class. For example, the piece of metadata 13 °C would have a value of 13 and a metadata class consisting of the name “Temperature”, type “Float”, and unit “°C”. Current metadata types include Floating point number (“Float”), Integer number (“Integer”), String and Date. 3.2. INFORMATION ARCHITECTURE 3.2.1. METADATA CLASSES Metadata classes are used in Kroak’s user interface. Each data point logically belongs to a metadata class which is shared between similar data points found across samples in a project 5 (Fig. 3). For example, all temperature data points which have the same type and units would be in the same metadata class (Fig. 3). The aggregation of all unique metadata classes in a project is called a metadata class set. Changing a property of a metadata class iteratively changes that same property in each Datapoint belonging to that class. Properties which can be changed include the data point name, type and units. Figure 3: An overview of the classes used in Kroak. Samples contain Metadata. Metadata is composed of Datapoints. Datapoints are members of metadata classes. A metadata class is a subset of data points across samples in a project which contain the same type of data. 3.2.2. DATA FLOW Template files are used to add metadata to Samples. Adding metadata to a project takes multiple steps. First, a user creates or edits a set of metadata classes for a project (Fig. 4, Fig. 5). These classes define the properties of a project’s metadata. The user then downloads an XLSX metadata template file which contains all pre-existing project metadata and metadata classes (Fig. 4). The user then adds metadata to the template file using spreadsheet software such as Microsoft Excel. 6 The template is then saved and uploaded to Kroak where it is then parsed and the metadata inside is placed into a database for safe keeping (Fig. 4). Figure 4: The standard kroak work flow consists of the following steps: 1) The user connects to Kroak, 2) Existing metadata and metadata classes are pulled from the database, 3) They are then sent to the User, 4) The user then sends updated metadata classes to Kroak, 5) These along with existing metadata are used to create an XLSX template file, 6) This template is then sent the user, 7) The user then adds metadata to the template, 8) This completed template is uploaded to Kroak, 9) Finally, the metadata is parsed from the XLSX template and placed in the database. 7 Figure 5: Editing a template XLSX file in Microsoft Excel. 3.2.3. VIEWS Kroak’s web client consists of two views. One view consisting of a page used for assigning and editing metadata classes and viewing changes to the project's metadata structure (Fig. 6). A second view is used for uploading metadata templates for specific projects via a drag and drop user interface (Fig. 7). These two views will henceforth be known as the metadata view and the upload view, respectively. 8 Figure 6: Kroak’s user interface for view one which allows end users to declare and modify metadata classes and view existing metadata. Figure 7: Kroak’s user interface for view two which allows users to upload metadata templates. 9 4. IMPLEMENTATION Kroak was developed using web technologies. A summary of these technologies used can be found in Supplementary Table ST1 in Appendix A. Figure 8: Kroak uses metadata templates and web technologies to input per-sample metadata into a PostgreSQL database. 4.1. USER INTERFACE The user interacts with Kroak via a web interface consisting of two HTML5 pages representing the views above (Fig. 6, Fig. 7) (Hickson & Hyatt 2011). In the metadata view, the project’s current metadata, metadata classes and metadata types are loaded asynchronously from the web server after initial web page load via Asynchronous JavaScript and XML (AJAX) (Mesbah & Van Deursen 2007). In the upload view, drag and drop file uploads are facilitated by Dropzone.js 10 (Meno). Both views use JQuery (Bibeault & Kats 2008) and Bootstrap (Lerner 2012). The jQuery JavaScript library was used to make AJAX requests and perform other client side scripting such as updating tables (Bibeault & Kats 2008). The Bootstrap 3.0 CSS framework was used to perform page styling and allows for the user interface to be useful on both mobile and desktop browsers (Fig. 6, Fig. 7) (Lerner 2012). 4.2. WEB SERVER The Kroak web server supplies both web pages and data to the client. It was written using the Flask web framework (Grinberg 2014) for the Python programing language (Van Rossum & Drake 2003). The web server consists of a series of endpoints that serve specific content to the client web browser when it sends a request for a specific URL (Richardson & Ruby 2008). In addition to an endpoint which provides each of the client pages mentioned above and an endpoint for selecting a project, the web server also provides six data endpoints. Three endpoints (get_metadata, get_metadata_classes, get_metadata_types) return JavaScript Object Notation (JSON) (Crockford 2006) formatted data while one receives JSON (set_metadata_classes). Additionally, another endpoint sends the user an XLSX file (generate_template) while a final one accepts an XLSX file upload (upload_template). The get_metadata endpoint returns the entirety of all metadata for a given project. A get_metadata_classes endpoint returns a list of all metadata classes for a project. Additionally, a get_metadata_types endpoint returns a list of available metadata types. The set_metadata_classes endpoint accepts a list of updated metadata classes sent from the client. The generate_template endpoint generates a metadata template XLSX file from existing project metadata and metadata classes and sends it to the client (Fig. 4, Fig. 8). Finally, an upload_template endpoint receives this template XLSX file from the client (Fig. 4, Fig. 8). 11 4.3. METADATA TEMPLATE GENERATION Existing metadata and metadata classes are integrated into an XLSX template file which is sent to the user as a file download (Fig. 4, Fig. 8). To generate this template, the system iterates through each sample’s metadata and generates a combined XLSX file. This XLSX files is generated using the XLSXWriter Python library (McNamara 2017). In the resulting file, fields which should be edited are labeled grey whereas fields which can be edited are labeled white (Fig. 5). 4.4. METADATA TEMPLATE PARSING After the user fills in the template file it can be uploaded to the server via a drag and drop interface. Once the template file is uploaded, it is immediately parsed via the Pandas data analysis library (McKinney 2010) and turned into a data frame. Pandas interprets the appropriate data type for each data point in the XLSX file in a heuristic fashion using the xlrd library (Machin 2013, IO Tools). Each row in the data frame (all data points for a given sample) is converted to JSON and stored in each Sample. 4.5. DATA REPRESENTATION AND STORAGE The SQLAlchemy Object Relational Mapper (ORM) (Bayer 2012) was used to map in memory objects from the data model to individual relational database tables for long term storage. Project and Sample objects are mapped to the database whereas Metadata and Datapoint objects are instantiated dynamically from data contained in a parent Sample object. Each set of data points for a given sample are stored as semi-structured JSON in the parent Sample object (Fig. 9). When a data point needs to be accessed, for example to change its value, it is converted to a Datapoint object dynamically from this JSON (Fig. 9). After modification, this Datapoint is converted back to JSON and stored once again in its parent Sample. This metadata JSON is mapped to a JSONB 12 (Lerner 2014) column in a PostgreSQL database (Momjian 2001) along with the parent sample’s other data via SQLAlchemy. Figure 9: The representation of data points within a Sample object. 5. RESULTS Kroak was tested with simulated metadata which included Floating point numbers, Integer numbers, Strings and Dates. It was found that Kroak stored this dataset properly and generated metadata templates which were defect free. 6. DISCUSSION Several decisions were made during the development of Kroak. These include the use of asynchronous web technologies, metadata classes, per-sample metadata storage, XLSX template files, and the PostgreSQL database. The reasons for these design choices are found below. 6.1. STATIC PAGES AND ASYNCHRONOUS CLIENT DATA LOADING The web client is written in a manner where it is a static web page which dynamically requests data from the server after initial page loading (Garrett 2005, Mesbah & Van Deursen 2007). This contrasts with traditional methods where the server injects content into the page before it is sent to the user. Having all rendering performed in the client's browser has several advantages 13 (Mesbah & Van Deursen 2007). Firstly, it allows for the initial static client to be deployed via a Content Delivery Network (CDN). CDNs are web services which cache web content across a global network of servers on different continents (Leighton & Lewin 2003). When a user requests a page from the website it is loaded from the nearest geographically located server which drastically decreases page load times (Leighton & Lewin 2003). Examples of CDNs are Cloudflare (The Web Performance & Security Company) and Amazon Cloudfront (Amazon CloudFront – Content Delivery Network (CDN)). It also pushes the cost of rendering the page to the client allowing for more cost-effective server hardware to be used (Mesbah & Van Deursen 2007). From a development perspective, having a client page which is static allows the backend server to be completely rewritten and replaced without an effect on the client (Mesbah & Van Deursen 2007). This makes it easier for the client web application to be tested independently of the server web application. For example, during automated testing, a mock up web server which provides the same content every time and does not require a database can be used as a surrogate backend (Humble & Farley 2010, Taneja, Zhang & Xie 2010). This allows testing to be faster as less setup time is required (Humble & Farley 2010, Taneja, Zhang & Xie 2010). Additionally, since the server application itself is just a series of endpoints it can also be tested independently during automated testing (Mesbah & Van Deursen 2009). 6.2. METADATA CLASSES The use of Metadata Classes is especially important for follow on operations where knowing the type of each data point is essential. For example, in situations where statistical operations can only accept certain types of data such as categorical or numerical. Knowing the type of the available data would allow for automated systems to determine what type of statistical test to use 14 rather than relying on the end user to select the appropriate test. This will be useful once Kroak is fully integrated into the rest the Amplytica Cloud Platform. 6.3. PER-SAMPLE METADATA STORAGE Storing metadata at a sample rather than project level was done to make it easier for only subsets of the data to be used at one time. Potentially, this will allow for projects to be merged and analyses to be performed on data points, with common metadata classes, originating from different projects. 6.4. XLSX FILES The Office Open XML Workbook format is a common spreadsheet format created by Microsoft and was chosen for a variety of reasons. Internally it stores data in Extensible Markup Language (XML) format and is thus parsable by third party spreadsheet programs other than Microsoft Excel (Kosek 2008). It has been standardized by the International Organization for Standardization / International Electrotechnical Commission (ISO/IEC 29500-1:2016) and the European Computer Manufacturers Association (ECMA-376). A benefit of using this format is that Excel is commonly used in both academia and industry. For example, the author performed a small survey in which two microbial ecology researchers were asked what would be their preferred method for imputing metadata into the ACP. Both said that they would prefer to be able to upload the metadata in spreadsheet form and preferably as an Excel file. Additionally, existing data sources are often already found in XLSX format and can be copy-pasted into a Kroak metadata template. XLSX has several advantages over more open spreadsheet file formats such as Comma-separated values (CSV) (Shafranovich 2005). For example, XLSX provides facilities for colour coding cells and has data protection features which can be used to protect tables and 15 lock cells to prevent unnecessary changes to the sheet’s structure (Harvey 2016). XLSX can also be used to enforce data stringency via the ability for cells to perform data validation (Harvey 2016). Other than colour coding, these features are not currently implemented but may be useful in the future. 6.5. POSTGRESQL PostgreSQL was chosen as the backing store for Kroak primarily because it is a free and open source database management system (DBMS) that can be hosted on one's own servers with no licensing fees (Momjian 2001). Alternatively, various providers, such as Heroku (Middleton & Schneeman 2013), provide PostgreSQL as a remotely hosted Database as a Service (DaaS) (Hacigumus, Iyer, & Mehrotra 2002) for a low or even no fee. Additionally, PostgreSQL can act as a document database with the ability to store and query semistructured data stored in JSON columns (Lerner 2014). In Kroak, metadata is currently stored in a JSONB column which is a special column type where JSON strings are stored in a binary format (Fig. 8, Fig. 9) (Lerner 2014). The data within the JSON is parsed by data type. For example, floating point numbers in the JSON will be parsed into a binary float value and stored in the database. This saves significant database space versus storing the JSON in a purely textual column (Lerner 2014). 7. CONCLUSIONS AND FUTURE DIRECTIONS Although Kroak is a functional metadata capture system, several opportunities for improvements and future work items were identified during its development and are detailed below. 16 7.1. FURTHER PLATFORM INTEGRATION Kroak is intended to be the metadata capture component of the Amplytica Cloud platform and will be integrated soon. The ACP currently contains endpoints for handling the creation, retrieval, update and delete (CRUD) of User, Project and Sample objects. The project and sample classes from Kroak overlap with those of ACP’s current data model. The two models could be joined together in a united data model allowing other ACP components to access per-sample metadata. 7.2. IMPROVED DATA MODEL In the current implementation, each sample is stored in its own database row and all data points of a sample are stored in a single JSON cell (Fig. 8, Fig. 9). Each data point in the JSON contains not only its value but also its metadata class information such as its name, type and units (Fig. 9). If one makes a mistake in the creation of a metadata class, for example using Fahrenheit rather than Celsius as the class units, and goes to correct this mistake then Kroak currently must load each sample’s metadata from the database. The JSON inside each Sample needs to be parsed and instantiated into Datapoint objects. These objects need to be searched to determine which ones belong to metadata classes that need to be changed. Then only the objects that belong to that class need to be updated. Afterwards, the whole process is reversed by converting these Datapoint objects back to JSON and then finally saving them back to their respective Sample’s metadata column in the database. This is quite inefficient. This problem results from storing each data point’s metadata class information alongside its actual value. This problem could be solved by normalizing the database schema further (Fig. 10) (Beeri, Bernstein, & Goodman 1978). A more efficient solution would be to add a separate Datapoint table and Metadata_Class table (Fig. 10). This would allow for the storage of metadata class information independent of each 17 data point. Each Datapoint object would have a link to the Metadata_Class and Sample for which it is associated (Fig. 10). A Metadata_Class is linked to all data points across all samples which possess its class (Fig. 10). Since metadata classes are linked rather than being stored inside each data point, a single Metadata_Class object can be changed to change the properties of multiple data points across a project. Figure 10: A) The current Kroak database schema. B) An improved Kroak database schema with new tables for metadata classes and data points. 18 7.3. IMPROVED METADATA PARSING Pandas is used to parse metadata, however, some data types such as dates are simply parsed to a number rather than a native Python object. The type information found in a data point’s metadata class could be used to assist in the parsing of an uploaded template file’s metadata. For example, if a data point has a type of “date”, then this information could be used parse the date in the template file into an object of Python’s native date class (Hellmann 2011). Improved parsing and the use of native Python objects could be helpful during the future development of statistical analysis components. 7.4. EXPANDED WEB APPLICATION PROGRAMMING INTERFACE (API) Currently our web Application Programming Interface (API) has facilities for setting and getting metadata classes, sending metadata types, retrieving project metadata, template creation and template upload. Additional endpoints could be created for CRUD operations on individual data points. Other components of the ACP have endpoints for project and sample CRUD. Together this expanded set of endpoints would create an API for manipulating all data components. Not only could this expanded API be used in a web development context but it could also be used to collect sensor data. Many bioreactors have built in sensors for measuring values such as O2 concentration (Bambot et al. 1994), pH (Jeevarajan et al. 2002) and temperature. If these sensors are hooked up to the internet they could potentially be used to send data in real-time to Kroak via the above expanded API over HTTP (Fielding et al. 1999). 7.5. CREATION OF AUTOMATED TESTS The creation and collection of metadata is complex and has many opportunities for errors. A suite of automated tests should be created to test both template creation and parsing. These tests would 19 be performed using various types of metadata to ensure that they are all compatible with the various operations performed by the codebase and that no errors or mistakes occur. LITERATURE CITED Aijo, T., Mueller, C. L., & Bonneau, R. (2016). Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. bioRxiv, 076836. Amazon CloudFront – Content Delivery Network (CDN). (n.d.). Retrieved April 28, 2017, from https://aws.amazon.com/cloudfront/ Bambot, S. B., Holavanahali, R., Lakowicz, J. R., Carter, G. M., & Rao, G. (1994). Phase fluorometric sterilizable optical oxygen sensor. Biotechnology and bioengineering, 43(11), 11391145. Bayer, M. (2012). Sqlalchemy-the database toolkit for python. URL http://www. sqlalchemy. org/. Accessed on the 13th of November. Beeri, C., Bernstein, P. A., & Goodman, N. (1978, September). A sophisticate's introduction to database normalization theory. In Proceedings of the fourth international conference on Very Large Data Bases-Volume 4 (pp. 113-124). VLDB Endowment. Bibeault, B., & Kats, Y. (2008). jQuery in Action. Dreamtech Press. Bischof, J., Harrison, T., Paczian, T., Glass, E., Wilke, A., & Meyer, F. (2014). Metazen– metadata capture for metagenomes. Standards in genomic sciences, 9(1), 18. Bucci, V., Tzen, B., Li, N., Simmons, M., Tanoue, T., Bogart, E., ... & Olle, B. (2016). MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome biology, 17(1), 121. Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., ... & Huttley, G. A. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5), 335-336. Chen, Y., Cheng, J. J., & Creamer, K. S. (2008). Inhibition of anaerobic digestion process: a review. Bioresource technology, 99(10), 4044-4064. Cox, B. (1986). Object-oriented programming: an evolutionary approach. Crockford, D. (2006). The application/json media type for javascript object notation (json). Internet Engineering Task Force Request for Comments 4627 Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., & Berners-Lee, T. (1999). Hypertext transfer protocol--HTTP/1.1 (No. RFC 2616). Garrett, J. J. (2005). Ajax: A new approach to web applications. Adaptive path. 20 Grinberg, M. (2014). Flask web development: developing web applications with python. " O'Reilly Media, Inc.". Hacigumus, H., Iyer, B., & Mehrotra, S. (2002). Providing database as a service. In Data Engineering, 2002. Proceedings. 18th International Conference on (pp. 29-38). IEEE. Harvey, G. (2016). Excel 2016 all-in-one for dummies. Hoboken, NJ: For Dummies, a Wiley brand. Hellmann, D. (2011). The Python standard library by example. Addison-Wesley Professional. Henze, M., Harremoes, P., la Cour Jansen, J., & Arvin, E. (2001). Wastewater treatment: biological and chemical processes. Springer Science & Business Media. Hickson, I., & Hyatt, D. (2011). Html5. W3C Working Draft WD-html5-20110525, May. Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Adobe Reader). Pearson Education. International, E. (2011). Standard ECMA-376 - Office Open XML File Formats. IO Tools (Text, CSV, HDF5, ...)¶. (n.d.). Retrieved http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files April 28, 2017, from ISO/IEC, (2016). ISO/IEC 29500-1:2016 Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 1: Fundamentals and Markup Language Reference. Geneva, Switzerland: ISO/IEC. Jeevarajan, A. S., Vani, S., Taylor, T. D., & Anderson, M. M. (2002). Continuous pH monitoring in a perfused bioreactor system using an optical pH sensor. Biotechnology and bioengineering, 78(4), 467-472. Kosek, J. (2008). From the office document format battlefield. IT Professional, 10(3). Leighton, F. T., & Lewin, D. M. (2003). U.S. Patent No. 6,553,413. Washington, DC: U.S. Patent and Trademark Office. Lerner, R. M. (2012). At the forge: twitter bootstrap. Linux Journal, 2012(218), 6. Lerner, R. M. (2014). At the forge: PostgreSQL, the NoSQL database. Linux Journal, 2014(247), 5. Machin, J. (2013, April) xlrd 0.9. 0. Python Package Index, [Online]. Available: https://pypi. python. org/pypi/xlrd. [Accessed 5 March 2013]. Mata-Alvarez, J., Mace, S., & Llabres, P. (2000). Anaerobic digestion of organic solid wastes. An overview of research achievements and perspectives. Bioresource technology, 74(1), 3-16. McKinney, W. (2010, June). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51-56). van der Voort S, Millman J. 21 McNamara, J. (n.d.). Creating Excel files with Python and XlsxWriter. Retrieved May 01, 2017, from http://xlsxwriter.readthedocs.io/ Meno, M. (n.d.). Dropzone.js. Retrieved May 01, 2017, from http://www.dropzonejs.com/ Mesbah, A., & Van Deursen, A. (2007, March). Migrating multi-page web applications to singlepage Ajax interfaces. In Software Maintenance and Reengineering, 2007. CSMR'07. 11th European Conference on (pp. 181-190). IEEE. Mesbah, A., & Van Deursen, A. (2009, May). Invariant-based automatic testing of AJAX user interfaces. In Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on (pp. 210-220). IEEE. Middleton, N., & Schneeman, R. (2013). Heroku: Up and Running. " O'Reilly Media, Inc.". Momjian, B. (2001). PostgreSQL: introduction and concepts (Vol. 192). New York: AddisonWesley. Okibe, N., Gericke, M., Hallberg, K. B., & Johnson, D. B. (2003). Enumeration and characterization of acidophilic microorganisms isolated from a pilot plant stirred-tank bioleaching operation. Applied and environmental microbiology, 69(4), 1936-1943. Richardson, L., & Ruby, S. (2008). RESTful web services. " O'Reilly Media, Inc.". Shafranovich, Y. (2005). Common format and MIME type for comma-separated values (CSV) files. Internet Engineering Task Force Request for Comments 4180 Taneja, K., Zhang, Y., & Xie, T. (2010, September). MODA: Automated test generation for database applications via mock objects. In Proceedings of the IEEE/ACM international conference on Automated software engineering (pp. 289-292). ACM. The Web Performance & Security Company. (n.d.). Retrieved April 28, 2017, from https://www.cloudflare.com/ Van Rossum, G., & Drake, F. L. (2003). Python language reference manual (p. 144). Network Theory. Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R., Nyulas, C., Tudorache, T., & Musen, M. A. (2011). BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic acids research, 39(suppl 2), W541-W545. 22 APPENDIX A Supplementary Table ST1: A summary table of all software packages and technologies used to create Kroak, their roles, and citations. Software Package Description / Technology HTML5 Citation HTML (HyperText Hickson Markup Language) describes and defines the 2011 content of a webpage. Website & html5/ AJAX Ajax (Asynchronous Mesbah & JavaScript and XML) is a Deursen 2007 set of Web development techniques using many Web technologies on the client side to create asynchronous Web applications. Dropzone.js DropzoneJS is an open Meno source library that provides drag’n’drop file uploads with image previews. JQuery jQuery is a fast, small, and Bibeault feature-rich JavaScript library. It makes things 2008 like HTML document traversal and manipulation, event handling, animation, and Ajax much simpler with an easy-to-use API that works across a multitude of browsers. Hyatt https://www.w3.org/TR/ Van N/A http://www.dropzonejs.c om & Kats https://jquery.com 23 Bootstrap 3.0 Bootstrap is the most Lerner 2012 popular HTML, CSS, and JS framework for developing responsive, mobile first projects on the web. http://getbootstrap.com Python 3 Python is a widely used Van Rossum high-level programming language for general- Drake 2003 purpose programming, created by Guido van Rossum and first released in 1991. & https://www.python.org Flask Flask is a micro web Grinberg 2014 framework written in Python http://flask.pocoo.org JSON JSON (JavaScript Object Crockford 2006 Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, http://www.json.org XLSXWriter XlsxWriter is a Python McNamara module for creating Excel XLSX files. http://xlsxwriter.readthe docs.io Pandas pandas is an open source, McKinney 2010 BSD-licensed library providing highperformance, easy-to-use data structures and data analysis tools for the Python programming language. http://pandas.pydata.org xlrd A Python module for extracting data from Excel https://pypi.python.org/ pypi/xlrd 24 spreadsheet files. SQLAlchemy SQLAlchemy is the Bayer 2012 Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. https://www.sqlalchemy .org PostgreSQL PostgreSQL, often simply Momjian 2001 Postgres, is a free and open source object-relational database (ORDBMS) https://www.postgresql. org JSONB JSONB (‘’jsonb”) is a Lerner 2014 column type of PostgreSQL. Rather than being stored as text, the json in jsonb column is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage. https://www.postgresql. org/docs/9.4/static/datat ype-json.html 25