Faculty of Science CONNECTING THE DOTS: WORKING TOWARDS THE METABOLOMIC PROFILE OF GRINDELIA SQUARROSA 2017 | JASON MATHIUS MCFARLANE B.Sc. Honours thesis – Chemical Biology CONNECTING THE DOTS: WORKING TOWARDS THE METABOLOMIC PROFILE OF GRINDELIA SQUARROSA by JASON MATHIUS MCFARLANE A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF SCIENCE (HONS.) in the DEPARTMENTS OF BIOLOGICAL AND PHYSICAL SCIENCES (Chemical Biology) This thesis has been accepted as conforming to the required standards by: Bruno Cinel (Ph.D.), Thesis Supervisor, Dept. Physical Sciences Donald Nelson (Ph.D.), Co-supervisor, Dept. Biological Sciences Jonathan Van Hamme (Ph.D.), Examining Committee member, Dept. Biological Sciences Dated this 28th day of April, 2016, in Kamloops, British Columbia, Canada © Jason Mathius McFarlane, 2017 ABSTRACT Novel metabolomics methods using the NMR (nuclear magnetic resonance) spectrometer at Thompson Rivers University were developed. This will lead to better utilization of the NMR by opening up new applications of this powerful instrument. The method was applied to three different samples: a simple four compound mixture, a previously analyzed Escherichia coli lysate, and an extract of Grindelia squarrosa, which has not heretofore been analyzed using a metabolomics approach. The fractionated crudes extracts returned 597 compounds from the Biological Magnetic Databank, using HSQC peak chemical shifts. Results for the fourcompound mixture were visualized as raw two-dimensional spectra, showing resolution of the peaks. The E. coli lysate provided invaluable insight into the necessity of high sensitivity as well as resolution in metabolomics. The future directions of this project are discussed, outlining the power of this technique for different mixtures and refining the experimental setup to reduce the necessity of relying on the currently available databases. Thesis supervisors: Dr. Bruno Cinel and Dr. Donald Nelson ii ACKNOWLEDGEMENTS Thank you to my Honours supervisors, Dr. Bruno Cinel and Dr. Don Nelson for their guidance and support. I would also like to acknowledge Dr. Jon Van Hamme for taking time out of his busy schedule to evaluate my thesis. Funding provided by the Undergraduate Research Enhancement Award Program, UREAP. iii TABLE OF CONTENTS Abstract ii Acknowledgements iii Table of Contents iv List of Figures vi List of Tables vii List of Acronyms viii 1 Introduction 1.1 Metabolomics 1.1.1 2 1 1 1.2 Drug discovery 3 1.3 Grindelia squarrosa as a source of natural products 4 Materials and Methods 2.1 Composite mixture 5 5 2.1.1 Sample preparation 5 2.1.2 NMR acquisition parameters 5 2.2 E. coli lysate preparation 6 2.2.1 Growth curve 6 2.2.2 Sample Preparation 6 2.2.3 NMR acquisition parameters 6 2.3 Grindelia squarrosa 7 2.3.1 Sample collection 7 2.3.2 Sample preparation 7 2.3.3 Sample fractionation 8 2.3.4 NMR acquisition parameters 9 2.4 3 NMR-based metabolomics 1 Computational processing 10 2.4.1 Peak deconvolution 10 2.4.2 Database querying 10 2.4.3 Constructing compound maps 10 Results and Discussion 3.1 Pulse sequences 11 11 iv 3.2 Composite mixture 12 3.3 E. coli Lysate 15 3.3.1 Growth curve 15 3.3.2 NMR spectra 16 Grindelia squarrosa 18 3.4 4 5 3.4.1 Crude samples 18 3.4.2 Fractionated samples 21 Conclusions and Future Work 24 4.1 Complex mixture 24 4.2 E. coli lysate 25 4.3 G. squarrosa extracts 25 4.4 The direction of bioinformatics 27 4.5 Concluding remarks 27 Literature Cited 29 Appendix A 31 Appendix B 33 v LIST OF FIGURES Figure 1. The structures of the four compounds used in the composite mixture. The compounds are (A) coumarin, (B) limonene, (C) vanillin, and (D) menthol. The colours of the structure corresponds to the colour of the spectra in Figure 2, shown in the discussion. 5 Figure 2. (A), (B), (C), and (D) are the individual COSY spectra of 0.334 M coumarin, 0.289 M limonene, 0.285 M vanillin, and 0.289 M menthol in chloroform-d, respectively. The individual spectra were overlaid to give (E), which can be compared with (F), the actual COSY spectra produced by a mixture of all four compounds. 14 Figure 3. The growth curve of DH5-α bacteria grown in M9 minimal media at 37°C and 240 rpm. The data appears to level off at optical density of 0.2. 15 Figure 4. The spectra used to analyze the E. coli lysate. (A) shows a TOCSY spectrum, (B) a COSY spectrum, (C) an HSQC-TOCSY spectrum, and (D) an HSQC spectrum. 17 Figure 5. The connectivity network produced from six of the fractions showing the compounds present in each fraction. The large blue nodes represent the fractions, while the 594 unique compounds returned by the BMRB database are the small white nodes. Blue edges represent connections that result from a single fraction/compound pair, showing compounds that only appear in one fraction. The size of the nodes is proportional to the number of compounds returned by the database for each fraction, while the length of the edges is proportional to the inverse of the scoring value returned by the BMRB database referred to as ‘Peak Match’. (A) shows the complete dataset from the BMRB database with no editing. (B) shows a simplified graph with only fractions 1, 2, and 3 shown. This allows for the easy visualization of compounds belonging only to 1 fraction (blue edges and on the peripherals), 2 fractions (grey edges and on the peripherals) and all three compounds (center of the graph). 22 vi Figure 6. This shows the same six fractions shown in Figure 5 with all compounds that had a Peak Match in the BMRB database equal or lower to 0.10. This filter removed 60 compounds from the dataset leaving 534 compounds. 36 LIST OF TABLES Table 1. The eluent composition and fractions collected from each polarity solvent gradient. The eluent ranges from most polar (35% methanol:65% water) to least polar (100% methanol). The subfractions are volumes collected directly from the column. 8 Table 2. The range of subfractions pooled to give each fraction, which were analyzed by NMR spectroscopy. 9 Table 3. A selection of results returned from sample 15 from the BMRB database. The displayed compounds were chosen based on structural complexity and the presence of functional groups. 19 Table 4. A sample of the first 20 peaks with the highest intensity from the combination of the four trial samples (limonene, vanillin, coumarin, menthol). 31 Table 5. The first 28 rows of the output of NMR_reader.py using the four compound mixture as an input. 31 vii LIST OF ACRONYMS BMRB Biological Magnetic Resonance databank COSY Correlation Spectroscopy csv comma separated values DMSO Dimethyl sulfoxide GC-MS Gas Chromatography-Mass Spectrometry HMBC Heteronuclear Multiple Bond Correlation HMDB Human Metabolome Database HSQC Heteronuclear Single Quantum Correlation LC-MS Liquid Chromatography-Mass Spectrometry MMCD Madison Metabolomics Consortium Database MS Mass Spectrometry NMR Nuclear Magnetic Resonance NS Number of scans OD Optical Density siRNA small interfering Ribonucleic Acid SW Spectral width viii TD Time domain TOCCATA TOCSY Customized Carbon Trace Archive TOCSY Total Correlation Spectroscopy TOCSY-HSQC tsv Total Correlation Spectroscopy-Heteronuclear Single Quantum Correlation tab separated values ix 1 INTRODUCTION 1.1 METABOLOMICS Metabolomics is the evaluation of all metabolites present in an organism. It is sometimes used interchangeably with metabonomics, which is the evaluation of metabolites as they change in concentration and composition, due to a change in a variable affecting an organism.1 Due to the low concentrations of metabolites present in organisms, the full elucidation of the metabolites present is unfeasible with current instrumentation. Even so, a metabolomics approach allows for the identification of many compounds at once, and has the potential to reduce the number of rediscovered compounds, an issue in natural product chemistry. Additionally, it supports an untargeted approach, wherein the compounds present can be identified independently of their biological activity; as opposed to traditional isolation, where the active fraction must be constantly identified after each fractionation step in order to isolate a single active component. Two main metabolomics methods are used to identify the metabolites present. Gas or Liquid Chromatography-Mass Spectrometry (GC/LC-MS) is the most common, due to high sensitivity and ease of automation. However, Nuclear Magnetic Resonance (NMR) spectroscopy is also used; the advantages of NMR are the wealth of structural information gleaned as well as the ability to detect any organic compound.2 Massive databases must be compiled of retention times and masses for GC/LC-MS and chemical shifts for NMR spectroscopy in order to get accurate results. Metabolomics also allows a library of compounds to be compiled for testing against multiple targets. 1.1.1 NMR-BASED METABOLOMICS As NMR spectroscopy provides a relatively unique chemical shift value for each hydrogen, these values can be queried against an appropriate database of known compounds. However, due to the complexity of most natural product libraries, a 1-dimensional spectrum quickly becomes too crowded to parse out individual chemical shift values. Some of the techniques that researchers have developed to get around this problem are bucketing,3,4 multivariate analyses,5,6 and the use of 2-dimensional NMR spectroscopy.7 One of the main advantages of NMR-based metabolomics is the inherent ability to glean structural information of the metabolites present. NMR-based metabolomics has not gained the same following that chromatographic methods have, especially as there is no industry standard 1 for NMR spectroscopy methods. This is part due to the many approaches that can be used. Beyond the different types of pulse programs, which probe different interactions between nuclides, there are many different methods of sample preparations and data processing. Some of the different pulse programs commonly used for metabolomics are: 1D proton, Jresolved proton, 1D {1H}13C, 2D 1H-1H COSY, 2D 1H-13C HSQC, 2D 1H-13C HMBC, and 2D 1 H-13C TOCSY-HSQC. 1D proton views all of the individual chemical and magnetic environments of a proton (1H) in a molecule. Its downfall is the limited range over which the protons are spread out from (typically 0 to 12 ppm). This leads to many overlapping peaks when used for analyzing complex mixtures. One way of dealing with this, without adding too much complexity, is the use a J-resolved proton pulse sequence. This sequence separates a 1D proton spectra based on the strength of the coupling constant, J, which causes the peaks produced by nuclides to split into higher and lower energy states based on the spin state of neighboring hydrogens. This allows protons that appear at identical chemical shifts, but have different coupling constants (which are typically on the range of 1-6 Hz), to be differentiated from each other. 1D {1H}13C is a carbon-13 spectra that is decoupled from proton couplings. This decoupling simplifies the carbon-13 spectra, giving a single peak for each environment. This fact, along with the greater range that a carbon-13 spectrum is spread over (typically 0 to 200 ppm) increases the likelihood that each peak is at a unique chemical shift. While one-dimensional spectra have decent sensitivity within a 20-minute time period, 2dimensional pulse sequences trade some of that sensitivity for high resolution of peaks. In essence, each peak that appears in the 2-dimensional plot represents the interaction of nuclides that neighbor each other on a molecule. As a consequence, it is possible to determine two chemical shifts that are theoretically on the same molecule. This increases the certainty of chemical identification. Two-dimensional NMR spectroscopy can be divided into two categories: homonuclear and heteronuclear. Homonuclear experiments look at coupling between the same types of nuclides. For example, 1H-1H COrrelated SpectroscopY (COSY) looks at hydrogens that are within three bonds of each other on a molecule. This gives symmetrical spectra with peaks along the diagonal (the same x and y coordinates) representing protons coupling with themselves and off-diagonal peaks representing the interaction of two different protons. This type of spectra can be “walked” along, going from off-diagonal peak to off diagonal peak along the same x or y 2 coordinate, to determine substructures of entire spin systems. Unfortunately, wherever a tertiary carbon appears, there is a break in the coupling hydrogens, making it hard to determine the entire structure of a molecule. At the tradeoff of sensitivity, the distance of coupling detected can be increased with a TOtal Correlation SpectroscopY (TOCSY) experiment. This pulse sequence increases the time that the protons can influence each other through coupling, as well as represses the signal of 3-bond coupling. An additional way to get around tertiary carbons is to use a 13C-13C COSY experiment. This detects 1 bond 13C-13C couplings, similar to how the proton variation detects three bond correlations. Unfortunately, carbon-13 has such a low natural abundance that the chance of two neighboring carbon atoms being carbon-13 is 1: 10,000. To be functional, the metabolites must be isotopically-labeled with carbon-13, by introducing labeled feedstock to the growing organism. This is particularly useful for determining metabolic pathways that metabolites were produced from, as there is high sensitivity for molecules that incorporate the feedstock while other molecules are fainter.8,9 The other type of 2-dimensional NMR spectroscopy experiment is heteronuclear. This is the detection of coupling between two different nuclides, usually 1H and 13C, though 15N can also be used instead of carbon-13 if the sample has been isotopically-labeled. Heteronuclear single quantum coherence spectroscopy (HSQC) probes one bond 1H - 13C interactions. One useful feature of heteronuclear experiments is that, because the coupling is between different nuclides, there is no peak from self-coupling. That is to say, there is no large diagonal mass of peaks, which obscure the peaks produced by the coupling of protons of similar chemical shift. As well, HSQC is an asymmetrical spectra, removing the redundancy present in COSY type spectra. HSQC has additional simplicity, because each proton and carbon theoretically produces only a single peak in the spectra. This is complicated somewhat from noise due to very large peaks, such as solvent peaks. Heteronuclear Multiple Bond Correlation spectroscopy (HMBC) looks at the coupling between carbon-13 and protons within three bonds by suppressing the signal from single bond coupling. A similar pulse program, TOCSY-HSQC, also includes a TOCSY mixing time to allow spin coupling to spread further throughout the molecule. 1.2 DRUG DISCOVERY As the field of medicine is developing and more of a focus is being placed on personalized medicine, there is a large push in the field of chemical biology to develop new therapeutic agents 3 to treat a variety of ailments. In order to do this, large collections of chemical libraries must be screened against biological targets. These chemicals can then be synthesized in lab in various diversity- and target-oriented synthesis approaches. These approaches have the potential to produce wide varieties of chemical scaffolds; however, they are limited by the synthetic methodologies available, are time consuming to synthesize, and (particularly in diversityoriented synthesis) produce many compounds without biological activity. In contrast, organisms synthesize natural product libraries, so many complicated organic reactions are catalyzed by enzymes in high yield, reducing time and money.10 Additionally, biological molecules are considered optimized through evolution to act within biological systems and on biological targets. However, these advantages are offset by the need to identify and isolate the natural products from a complex mixture. Unfortunately, the traditional method of activity directed isolation, wherein the extract is screened against a biological target and any active components isolated, suffers from a high incidence of rediscovery. This is due to the compounds only being characterized after isolation. As mentioned in section 1.1, one method that has the potential to deal with this issue is untargeted metabolomics. If constituents in an extract can be identified in a metabolomics experiment before or concurrent with assaying for biological activity, then previously discovered active compounds can be identified and priority placed on novel compounds. This opens up new possibilities from sources that have previously been investigated using traditional methods. Using metabolomics allows for new compounds to be detected, and the activity of compounds that are in low concentration to be elucidated. 1.3 GRINDELIA SQUARROSA AS A SOURCE OF NATURAL PRODUCTS Exploring plants used in traditional medicine as a source of natural products could lead to the development of new therapeutics. However, many plants have already been thoroughly characterized. Either new plants must be investigated or new approaches, such as metabolomics, used to characterize more compounds present in the natural product “library”. G. squarrosa is a common plant in the southwest interior of British Columbia. It has a yellow, bulbous flower and is covered in a resinous exudate. G. squarrosa is known in indigenous medicine as an expectorant; that is, it induces mucous to be expelled from the lungs. Recent research reports the crude extract to have modest antibiotic activity. This makes the plant a promising source of 4 bioactive molecules. Previous research has been done to identify the components of the essential oil of G. squarrosa by GC-MS,11 though little work has been done on the more polar constituents, or using a metabolomics approach. 2 MATERIALS AND METHODS 2.1 COMPOSITE MIXTURE 2.1.1 SAMPLE PREPARATION Figure 1. The structures of the four compounds used in the composite mixture. The compounds are (A) coumarin, (B) limonene, (C) vanillin, and (D) menthol. The colours of the structure corresponds to the colour of the spectra in Figure 2, shown in the discussion. Solutions of 0.0197 g limonene, 0.0226 g menthol, 0.0217 g vanillin, and 0.0244 g coumarin in 0.5 mL chloroform-d were prepared to give solutions of 0.289 M, 0.289 M, 0.285 M, and 0.334 M, respectively. NMR analyses were done with the parameters described in section 2.1.2, then the solutions were mixed pair-wise (limonene and menthol, vanillin and coumarin) in a 1:1 ratio. These samples were analyzed before being mixed in equal volumes to give a final concentration of 0.0723 M limonene, 0.0723 M menthol, 0.0713 M vanillin, and 0.0835 M coumarin. This final solution was also analyzed by NMR spectroscopy. 2.1.2 NMR ACQUISITION PARAMETERS 1D proton (number of scans (NS)=16), and 2D COSY spectra (NS=8) for each sample in 0.5 mL chloroform-d were obtained on a Bruker Ultrashield™ Plus Avance III 500 MHz NMR spectrometer with a 5 mm PATXI 1H/D-13C/15N Z xgradient probe. Each spectrum was coloured and overlaid to show the unique cross peaks between samples. 5 2.2 E. COLI LYSATE PREPARATION 2.2.1 GROWTH CURVE DH5-α E. coli were grown in 20 mL M9 minimal media in a 100-mL side armed flask at 37°C and 240 rpm by inoculating with 2 mL of an LB overnight broth. Optical density at 600 nm (OD) measurements were taken every 30 minutes to determine the time required for culture saturation. As the OD failed to reach 3.0, as previously reported in the literature,12 the growth curve of E. coli grown in LB broth was determined to establish a theoretical maximum. 2.2.2 SAMPLE PREPARATION DH5-α E. coli were grown in 1 L of M9 minimal media in four 250 mL portions each in a 1-L Erlenmeyer at 37°C and 240 rpm. The culture was split into four 250 mL portions, which were each inoculated with 10 mL LB broth overnights. The cultures were grown for 24 hours, then the cells were collected by centrifugation at 8000 xg and 4°C. The pellets were combined and washed with three 15 mL portions of 50 mM phosphate buffer (pH 7.0). The cells were pelleted and resuspended in 10 mL deionized water. This suspension was frozen for 2 hours at -20°C, then thawed. This was repeated two more times, before the cell fragments were collected by centrifuging at 16,000 xg at 4°C. The supernatant was retained and 10 mL of methanol was added, then 10 mL of chloroform. The container was agitated, then left for 12 hours for the organic and aqueous layers to separate. The chloroform was removed by transfer pipette and the methanol removed by 1 hour of nitrogen blowdown. The water was removed under vacuum centrifugation for 3 hours at 30°C. The residue was resuspended in 1 mL of D2O and centrifuged for 5 minutes at 14,000 xg to pellet undissolved compounds. 2.2.3 NMR ACQUISITION PARAMETERS 1D proton (NS=16), 2D COSY spectra (NS=16, time domain (TD)=4096x256, offset 4.7 ppm, spectral width (SW)=10.9920 ppm*10.9920 ppm), 2D TOCSY(NS=8, TD=4096x2048, offset 4.7 ppm, SW=10.9920*10.9920, mixing time=0.090 s), 2D HSQC-TOCSY (NS=128, TD=2048x512, proton offset 4.7 ppm, carbon-13 offset=85.0 ppm, SW=10.9920 ppm*170.9149 ppm, mixing time=0.080 s), and 2D HSQC (NS=64, TD=2048x512, proton offset 4.7 ppm, carbon-13 offset=85.0 ppm, SW=10.9920 ppm*170.9149 ppm) were used to analyze the aqueous portion of the extract. 6 2.3 GRINDELIA SQUARROSA 2.3.1 SAMPLE COLLECTION Samples were collected from the side of the Red-Tailed Hawk trail in Kenna Cartwright Park in Kamloops, BC. Initial samples (numbered 1-18) were collected by cutting the stem three inches below the sepals, labeling with masking tape and placing in a brown paper bag to dry for three weeks. For subsequent samples, the entire plant, including roots, was removed from the soil and placed in a dark box to dry, also for three weeks. 2.3.2 SAMPLE PREPARATION The dried G. squarrosa flowers were weighed and ground with a mortar and pestle. 2 mL of hexanes (ACS grade, BDH) were added, and the grinding continued until the plant matter appeared homogenous. The sample was transferred to a 50-mL Erlenmeyer flask, 8 mL more hexanes added for a total volume of 10 mL, sealed with parafilm and extracted for 24 hrs. The hexanes fraction was filtered through a coarse filter paper, and the solid returned to the Erlenmeyer flask. 10 mL of acetone (ACS grade, BDH) was added, the flask was sealed with parafilm, and the flask was left for 24 hrs to extract. The acetone fraction was filtered through a coarse filter paper, and the solid returned to the Erlenmeyer flask. 10 mL of methanol (HPLC grade, BDH) was added, the flask was sealed with parafilm, and the flask was left for 24 hrs to extract. The methanol fraction was filtered through a coarse filter paper, and the solid returned to the Erlenmeyer flask. The acetone and methanol fractions were evaporated to dryness by nitrogen blow down and combined. For the sample fractionation, 3.72 grams of plant material (2.97 g flower heads and 0.75 g leaves) were prepared as described above except that the sample was dried by rotary evaporation instead of nitrogen blow down. This gave a total fractionated mass of 0.4465 g, or 12% yield. The acetone and methanol fractions were resuspended in 20 mL of methanol and stored at 4°C for 1 week. The methanol was dried by nitrogen blow down to produce a dark green oil. The oil was suspended in 50:50 methanol to water by ultra-sonication bath for one minute and partially dried by nitrogen blow down. 10 mL methanol was added and the oil was resuspended and transferred to a 50-mL round bottomed flask. 7 2.3.3 SAMPLE FRACTIONATION Preparative C-18 125 Å 55-105 µm Waters silica solid phase was added to the resuspended crude extract form a loose slurry. The mixture was dried to a thick paste, and then transferred to the top of a 3x10 cm C-18 column. As shown in Table 1, the sample was eluted with gradients from 35:65 methanol: water to 100% methanol and subfractions collected in approximately 12 mL portions. Methanol was applied until the visible dark band was eluted, with these later fractions collected in 100 mL portions. Table 1. The eluent composition and fractions collected from each polarity solvent gradient. The eluent ranges from most polar (35% methanol:65% water) to least polar (100% methanol). The subfractions are volumes collected directly from the column. % Methanol % Water Volume (mL) Subfraction range 35 65 100 1-7 45 55 100 8-13 55 45 100 14-19 65 35 100 20-26 70 30 100 27-32 75 25 100 33-38 80 20 100 39-44 90 10 100 45-49 100 0 800 50-71 Fractions were combined based on a visual comparison of colour and fluorescence based loosely around the polarity of the eluents. Details as to which fractions were combined are shown in Table 2. Each combined fraction was evaporated to dryness by rotary evaporation. The fractions were transferred to test tubes with five 2 mL portions of methanol, apart from fraction 1, which was relatively insoluble in methanol. As such, fraction 1 was transferred with three 2 mL portions of methanol and three 2 mL portions of deionized water. The combined fractions were then evaporated to dryness by nitrogen blow down, weighed, and stored at 4°C until analysis. 8 Table 2. The range of subfractions pooled to give each fraction, which were analyzed by NMR spectroscopy. Fraction number Subfraction range Mass (g) 1 1-8 0.2227 2 9-14 0.0138 3 15-20 0.0144 4 21-28 0.0552 5 29-34 0.0233 6 35-45 0.0414 7 46-50 0.0281 8 51-55 0.0222 9 56-60 0.0106 10 61-65 0.0049 11 66-71 0.0099 With the exception of fraction 1, each fraction was dissolved in 1 mL of methanol-d4 and a drop of tetramethylsilane (TMS) was added as a standard. The test tube was ultra-sonicated to suspend metabolites adhered to the glass, and then centrifuged for five minutes to remove undissolved particles. The supernatant was then transferred to an NMR tube and analyzed by NMR spectroscopy. Fraction 1 was treated like the rest of the fractions, except that it was dissolved in 1 mL of DMSO-d6 instead of methanol-d4. 2.3.4 NMR ACQUISITION PARAMETERS 1D proton (NS=16), 2D COSY spectra (NS=128, TD=8192x256, offset 4.7 ppm, SW=10.9920 ppm*10.9920 ppm), 2D HMBC (NS=128, TD=4096x256, proton offset 4.7 ppm, carbon-13 offset=85.0 ppm, SW=10.9920 ppm*172.9149 ppm), and 2D HSQC (NS=64, TD=2048x512, proton offset 4.7 ppm, carbon-13 offset=85.0 ppm, SW=10.9920 ppm*172.9149 ppm) were used to analyze all fractions and crude extracts. 9 2.4 COMPUTATIONAL PROCESSING 2.4.1 PEAK DECONVOLUTION A list of peaks was compiled by the peak picking command in Bruker Topspin v2.1, with the lowest contour level visually set to just above the baseline. These peaks were exported to a comma separated values (csv) file with four columns: Peak, containing an arbitrary unique identifier; ?(F2) [ppm], containing a list of the F2 chemical shifts; ?(F1) [ppm], containing a list of the F1 chemical shifts; and Intensity [abs], containing the peak integration values. A sample of the output is shown in Appendix A, Table 4. In order to compile sets of multiple peaks belonging to the same substructure, a Python 3.7 program was developed in Spyder v3.13 using the Pandas library. The script is included in Appendix B as NMR_reader.py. In essence, the NMR_reader.py program took a table of twodimensional chemical shifts and condensed the redundant chemical shifts in one column by appending the corresponding chemical shifts in the other column to one, unique value. This simplified the dataset by taking a list of approximately 2000 chemical shifts pairs and reducing it to a list of approximately 500 unique entries without loss of information. 2.4.2 DATABASE QUERYING The NMR_reader.py program was executed on the sets of COSY data from the composite mixtures. The sets of substructures from the NMR_reader.py program were converted to a csv format, and then the individual substructures were inputted into the TOCCATA database.12 The csv peak list of the HSQC experiments was manually converted to a tab separated value (tsv) file with two columns (?(F2) [ppm] and ?(F1) [ppm]) by uploading the csv files of the HSQC experiments to Google sheets and exporting the two desired columns as a tsv file. This tsv file was uploaded to the BMRB database website13 in the “Search 2D HSQC lists” submenu. A list of potential compounds was returned for each peak list uploaded, which was then downloaded as a tsv file, and converted to a csv file. 2.4.3 CONSTRUCTING COMPOUND MAPS The csv files from the BMRB database of fractions 1-4, 6, and 7 were concatenated. The entries in the first column were changed to the fraction name that the compound came from without spaces (e.g. “Fraction1”). The name of the first column was changed to “Fractions”. The column that contained the compound names was renamed to “compounds”. The scoring value supplied 10 by the BMRB database, “Peak Match”, was renamed to “weight” in order to align better with the graphing program. A new spreadsheet was made that contained one column, “Nodes”, that contained all entries from the compound column and a single entry for each fraction. These csv files were used as inputs for a network parsing Python program, NMR_network.py. Refer to Appendix B for details; but in brief, this program dealt with all duplicate values and parsed the data into a format that could be read by Cytoscape, a network visualization platform.14 NMR_network.py wrote an xml file, which was inputted into Cytoscape as a network file. In Cytoscape, the layout was set to Edge-weighted Spring Embedded Layout, based on the weight parameter. Then from the Tools drop down menu, Workflow was opened and the “Analyze selected networks and create custom styles” was selected. This option returned metadata about the networks as well as adapted the network to highlight key features. In order to clean up the number of edges displayed, Layout | Bundle Edge | All Nodes and Edges was selected. An additional map for clearer understanding was constructed from fractions 1, 2, and 3. As well, a simplified complete map was made by deleting all edges that had a weight value equal or less than 0.10. 3 RESULTS AND DISCUSSION 3.1 PULSE SEQUENCES In order to optimize peak resolution and sensitivity many pulse sequences were investigated over the course of this project. The TOCSY and TOCSY-HSQC were initially chosen due to their integration with the TOCCATA database.12 These pulse sequences are advantageous because they have very narrow peaks in the indirect dimension on the spectra, meaning that horizontal one dimensional “slices” of the spectra should only contain peaks from one spin system. The result is a one-dimensional spectrum that contains only the peaks from one spin system. Unfortunately, the spectra produced from the E. coli lysate did not appear to be well resolved or sufficiently sensitive to isolate individual spin systems from the raw spectra, as shown in Figure 4. The COSY pulse program also provides connectivity information, and is more sensitive than TOCSY. Unfortunately, COSY shows less information than TOCSY, so additional computational steps must be done to isolate complete spin systems. COSY spectra are more powerful than 1D proton NMR spectra, clustering more consistently in statistical treatments15. 11 The computer program, NMR_reader.py attempted to take the high sensitivity of COSY spectra and convert it into the structural information provided by TOCSY slices. HSQC has lower sensitivity than heteronuclear experiments by virtue of the low sensitivity of the carbon-13 nuclide. This is offset by the high resolution afforded by the expanded carbon-13 dimension compared to the proton dimension.16 HSQC is also advantageous as every nuclide appears only once in the two-dimensional spectrum. One interesting advantage of HSQC is it is possible to determine qualitative amounts of metabolites, by varying delay times and extrapolating backwards to determine peak intensity at t = 0.17 3.2 COMPOSITE MIXTURE Compounds were selected because recent research identified some of them in the essential oil of G. squarrosa (limonene, menthol)11 and others were similar in structure to acetylsalicylic acid (vanillin). The analysis of the composite mixture allowed any peaks that may arise from the pulse sequence to be identified and be considered in more complex mixtures. In Figure 2, the off diagonal peaks are clearly resolved between compounds, while the diagonal “self-coupling” peaks show a large degree of overlap. Querying the COSY peak list of the 4 compounds against the BMRB database returned a list of 815 compounds. Unfortunately, the BMRB database was not setup to allow the querying of multiple related COSY peaks, so each compound was based on a single peak assignment. This probably contributed to the extremely high incidence of false positives. The expected compounds were present in the data output; they appeared in the first 100 compounds with high peak match score with the exception of limonene, which only appeared as limonene oxide. This is probably due to a limitation of the database, because limonene is not present in the dataset. However, limonene is also prone to autooxidation in the presence of oxygen, so this could be another contributing factor. Further work need to be done in order to increase the fidelity of the results. Querying the output of NMR_reader.py against the TOCCATA database in order to reference more peaks was unsuccessful due to the limitations of the TOCCATA database. Though the database allows for substructures gleaned from multiple bond coupling relationships in TOCSY-type experiments, the compounds contained in it are derived from the human metabolome database (HMDB). As such, exogenous compounds are not present in the reference dataset. This limits the applicability of the TOCCATA database to natural product extracts. 12 Other databases were considered. NAPROC-1318 has an excellent selection of natural products.19 Unfortunately, the database relies exclusively on carbon-13 assignments, and only a single substructure at a time can be queried. This makes it ideal for two-dimensional 13C-13C COSY experiments with carbon-13 labelled feedstock, but requires elaborate processing to make it useable with carbon-13 in natural abundance. The Madison Metabolomics Consortium Database allows many parameters to be inputted at once, including two-dimensional COSY.20 When the composite mixture was queried, only six compounds were returned, a large improvement over the BMRB database; however, only one compound (coumarin) was returned that was in the composite mixture. 13 Figure 2. (A), (B), (C), and (D) are the individual COSY spectra of 0.334 M coumarin, 0.289 M limonene, 0.285 M vanillin, and 0.289 M menthol in chloroform-d, respectively. The individual spectra were overlaid to give (E), which can be compared with (F), the actual COSY spectra produced by a mixture of all four compounds. The spectra produced from the four compounds are shown in Figure 2. Figure 2 A-D shows the individual spectra of each compound. In retrospect, it may have been advantageous to select some compounds with more complex chemical scaffolds, as these compounds would produce more off diagonal peaks in the COSY spectra. The compounds that were chosen had few 14 neighboring hydrogens, which are what a COSY pulse sequence detects. This is particularly noticeable in Figure 2 C, vanillin, which only shows one off-diagonal peak produced by neighboring hydrogens. 3.3 E. COLI LYSATE 3.3.1 GROWTH CURVE Figure 3. The growth curve of DH5-α bacteria grown in M9 minimal media at 37°C and 240 rpm. The data appears to level off at optical density of 0.2. As shown in Figure 3, the DH5-α E. coli grew to ~0.2 optical density (OD) after 7 hours of growth. This culture saturation was considerably lower than reported in the method followed, with an absorbance of 3.0. This contributed to low sensitivity in the NMR spectra. It is possible that the cultures needed to be incubated for longer, in order to maximize cellular concentrations and, more importantly, maximize the production of secondary metabolites. The low OD may also be due to low oxygen levels, which could be alleviated by increasing the shaker speed and further splitting the culture into more Erlenmeyer flasks. Changing the concentration of glucose 15 did not seem to affect the final OD, meaning that glucose was not the limiting component of the media. Changing the media to LB broth doubled the final OD, but it did not approach the order of magnitude increase that was reported in the literature, indicating potential discrepancies in the methods of Bingol et al.12 The two main possibilities are that the culture was measured with a path length of 10 cm instead of 1 cm, or culture saturation was simply assumed to have an OD of 3 without any sort of spectrophotometric measurements to confirm. As contacting the researcher received no response, the analysis of the 0.3 OD 1-L culture was done. 3.3.2 NMR SPECTRA Figure 4 shows the four different spectra produced by NMR analysis of the E. coli lysate. Note the poor sensitivity displayed in Figure 4 C, which causes it to have the same or fewer peaks than the corresponding near coupling spectra, while the opposite should be the case. The TOCSY spectrum (Figure 4 A) has the opposite issue, where there is so much signal, that almost no resolution can be observed. Sensitivity was low, in part due to the low cell concentrations already mentioned, but also because the parameters were poorly optimized for the E. coli lysate. The number of scans should be increased, and the resolution in the time domain decreased. There were also issues with the HSQC-TOCSY pulse sequence, and to a lesser extent the TOCSY pulse sequence, where the sample became hot over the course of the experiment. Part of this can be attributed to the exothermic mixing time inherent in TOCSY pulse programs; however, the HSQC-TOCSY was an adiabatic variant, due to some corruption in the parameter files of the basic pulse program. What effect, if any, this had on the increase in temperature is unclear. Heat is undesirable due to averaging of signals that are separate at room temperature (which most of the database spectra have been collected). Additionally, the change in temperature can decompose analytes and cause solvent to rapidly evaporate. A CryoProbe can be used to keep the temperature of an NMR experiment constant, but our instrument used is not equipped with that probe. This lead to a preference for lower energy pulse sequences–such as HSQC, HMBC, and COSY–in the rest of the analyses done by NMR spectroscopy. 16 Figure 4. The spectra used to analyze the E. coli lysate. (A) shows a TOCSY spectrum, (B) a COSY spectrum, (C) an HSQC-TOCSY spectrum, and (D) an HSQC spectrum. 17 3.4 GRINDELIA SQUARROSA 3.4.1 CRUDE SAMPLES The extraction with hexanes gave a viscous, yellow oil that had a strong, resinous scent that was completely miscible in hexanes. The acetone extraction was a vivid yellow, with a faint green hue that varied in intensity based on the amount of photosynthetic material included in the plant sample. The methanol fractions were an off-yellow colour that was quite weak in intensity. Solvent-wise, the DMSO-d6 dissolved the most extract. However, this was mitigated somewhat by the large, broad solvent peak produced by DMSO that overwhelmed the weaker peaks next to them. The 1:1 D2O: acetone-d6 dissolved the least amount of the extract and had the added issue of two solvent peaks, one broad peak from HOD, and one intense peak from the six equivalent protons in acetone. These peaks produced unwanted noise and overlapped with low intensity analyte peaks. Methanol-d4 showed a compromise between these two extremes. Most compounds were soluble in the methanol, as that was one of the two solvents that comprised. The methanol has two solvent peaks; one at 3.31 ppm and the OH proton at 4.78 ppm. The hydrocarbon portion allows methanol to dissolve moderately nonpolar compounds. The OH group allows for hydrogen bonding with compounds with hydrogen bond donors and acceptors. The OH group also allows the solvent to be buffered to increase the reproducibility of the chemical shifts. Unfortunately, buffering was neglected in this experiment, which may have led to inconsistencies in the chemical shifts. Querying the BMRB database returned between 217 and 477 compounds. These compounds were compared against the results of the networks produced from the fractionated compounds (Figure 5). The sample dissolved in methanol displayed 54 new compounds that were not present in any of the fractions analyzed; this is probably due to the incomplete analysis of the fractions, but may indicate that the BMRB database requires very few compounds in order to reduce the number of false positives. 18 Table 3. A selection of results returned from sample 15 from the BMRB database. The displayed compounds were chosen based on structural complexity and the presence of functional groups. Peak match Structure Name 1 scyllo-Inositol 0.8 D-(+)-Threo-isocitric acid 0.78 O-Phospho-L-serine 0.75 D-Saccharate 0.73 Pimelic acid 0.67 O,O-diethyl thiophosphate 19 0.57 2-Aminoethyl dihydrogen phosphate 0.53 Nepsilon-Acetyl-L-lysine 0.5 4-Chlorophenol 0.5 Guaiacol 0.5 4-Guanidinobutyric acid 0.5 L-(-) Arabitol 20 3.4.2 FRACTIONATED SAMPLES The fractions were unevenly split in the mass of the isolated compounds, varying from 0.0049 g to 0.2227 g, as shown in Table 2. This indicates, at a very superficial level, that the way that the subfractions were pooled needs to be reevaluated. Ideally each fraction would contain an equal number of compounds, and those compounds would be in relatively equal concentrations. 21 Figure 5. The connectivity network produced from six of the fractions showing the compounds present in each fraction. The large blue nodes represent the fractions, while the 594 unique compounds returned by the BMRB database are the small white nodes. Blue edges represent connections that result from a single fraction/compound pair, showing compounds that only appear in one fraction. The size of the nodes is proportional to the number of compounds returned by the database for each fraction, while the length of the edges is proportional to the inverse of the scoring value returned by the BMRB database referred to as ‘Peak Match’. (A) shows the complete dataset from the BMRB database with no editing. (B) shows a simplified graph with only fractions 1, 2, and 3 shown. This allows for the easy visualization of compounds belonging only to 1 fraction (blue edges and on the peripherals), 2 fractions (grey edges and on the peripherals) and all three compounds (center of the graph). Ideally, each fraction should contain at most 20 compounds. One way to do this would be to 22 keep the 71 subfractions separate and characterize each one individually. However, analyzing 71 subfractions at 20 hours a fraction would take over two months. One way to minimize the analysis time would be to reduce the acquisition time to only 20 minutes, which would require a more manageable 30-hour timeframe. This may be feasible for the smaller number of compounds present in each fraction, as they may require less sensitivity to detect each compound. The large number of common compounds clearly visualized in Figure 5 B, and suggested in Figure 5 A, indicates poor separation of the crude extract. This could be due to overloading the C-18 chromatography column. Either decreasing the amount of extract that is separated or increasing the amount of column packing material to increase resolution of compounds should be done. Unfortunately, decreasing the amount of extract loaded on the column would greatly impact sensitivity. Therefore, the more prudent choice would be to increase the amount of column material, even though there will be more solvent to remove before analysis. As shown in Figure 5, the networks allow one to decide upon a different order to pool fractions. Fraction 1 and 3 are quite large nodes, indicating a large number of compounds being returned from the database. If these fractions were separated in more, smaller fractions, the compounds would be easier to identify. Additionally, fractions 6 and 7 are quite small and close together on the network. This seems to indicate that they contain similar compounds because similar forces from the compounds affect them. They should be pooled together to increase sensitivity and reduce network complexity. Ideally, the network would be constructed from the peaks, rather than compound assignments from the database. This would remove some of the uncertainties and limitations of the databases, such as very limited amounts of compounds in the databases and the uncertain assignment of compounds that are in the database. Additionally, incorporating only the peaks reduces analyst bias, where compounds that are not expected to be bioactive are dismissed out of hand. Unfortunately, due to slight changes in temperature and solvent composition, there is minor variation in the chemical shift of the same peak between runs. This means that a single peak coordinate cannot be used as a common compound identifier between fractions. Instead, a bucketing approach3,4 must be used, dividing the two dimensional spectra into a grid. The increments in NMR bucketing are typically 0.04 ppm for a one-dimensional spectrum, though due to the approximately 10-fold increase in ppm range for 13C, it seems reasonable to believe 23 that increments of 0.4 would provide sufficient resolution for a 13C spectra. Thus, a 2D HSQC spectra, such as the one used for the compilation of the networks, consisting of 10.992 ppm in the 1H dimension and 173 ppm in the 13C dimension would consist of a 275 1H buckets X 433 13 C buckets for a total of 119075 data points. If the apex of a peak falls in a bucket, the corresponding intensity is considered the value for the entire bucket, and the peak can be related to peaks in other fractions that fall in the same range. One of the downsides of this method is when multiple peaks appear in the same bucket. In this case, some sensitivity is lost as the peak intensity are added together, but as such deviation could not be detected by a database due to the database treating each peak as a range, such points are academic at best. The bioinformatics programs to do this sort of bucketing already exist for LC-MS,21 where the two dimensions are the retention times and the masses of the compounds being detected instead of 1H and 13C chemical shifts. Therefore, it would be relatively simple to adapt these programs for twodimensional NMR spectra. 4 CONCLUSIONS AND FUTURE WORK 4.1 COMPLEX MIXTURE There is still more interpretively useful data to be gleaned from simple, few compound mixtures, both of the compounds discussed here and of other mixtures. A dilution series of the mixture should be done to determine the limits of detection and optimize the instrumental parameters to give a balance between sensitivity and experiment time. An experiment that was not done was an HSQC analysis of the composite mixture. This would allow the effects of querying multiple peaks on the certainty of assignment to be empirically determined. Additionally, compounds that vary only slightly (alcohol group versus methoxide group) should be examined to explore whether this technique can differentiate between analogues of the same compounds, or if the technique is limited to only determining classes of compounds. However, even if NMR metabolomics can only identify classes of compounds, the technique is still a powerful tool for characterizing natural product extracts. The complex mixture also provides a useful platform for examining the effects of buffering the NMR solvent on chemical shifts. This is important because chemical shifts of nuclides can vary up to 1 ppm for 1H and up to 10 ppm for 13C depending on whether the molecule is in the protonated, acidic form, or the deprotonated, basic form.22 Different solutions of methanol-d4 can 24 be made with varying concentrations and pH values. Chemical shifts of known compounds could be compared at these different conditions. Ideally, the buffering of methanol would prove unnecessary, as buffering either requires expensive deuterated buffers, or introduces a large water peak into the NMR spectra, which obscures many finer peaks. Typically, the suggested pH is around 7.3, probably due to that being close to physiological pH. 4.2 E. COLI LYSATE The E. coli lysate provided valuable insight into the importance of pulse program selection, as well as clarifying the effects of the TD and NS parameters on spectra sensitivity and resolution. For the purpose of this experiment, it was only included in order to produce a defined, complex mixture to confirm the database results from. Due to the poor sensitivity displayed, the spectra gathered did not display enough information to query the peaks correctly. However, beyond this experiment, examining the metabolic profile of bacteria allows perturbations, be it genetic based or chemical, to be examined on a metabolomics level. This could be applied to examining the biosynthesis of natural products from bacterial transformants in order to optimize feedstock and minimize side reactions. Introducing carbon-13 labeled feedstock could increase sensitivity. This would increase sensitivity 100-fold in experiments involving carbon, as well as distinguishing the metabolites of the feedstock from the rest of the metabolome. Changing the focus of the lysate to examine more nonpolar constituents is also a possible direction. To do so, size exclusion chromatography packing is introduced to the bacterial culture. Exuded small molecules are adsorbed by the packing material, and after the bacterial culture is grown to saturation, the small organic molecules are extracted by methanol.23 The gathered compounds are more nonpolar than those gathered through methanol/chloroform extraction. These compounds can then be analyzed through metabolomics experiments to compile a natural product library. 4.3 G. SQUARROSA EXTRACTS One of the first things that still needs to be done with the G. squarrosa extract is to analyze the remaining fractions by NMR spectroscopy, as only fractions 1-7 and fraction 11 were analyzed. The database results of all the fractions should be visualized on a network. This network can be used to determine which fractions have more compounds–and thus should be further split into more fractions–and which fractions display very similar compounds–which should be pooled. This information should guide the subfraction pooling of a second reverse phase column 25 chromatography with identical parameters to the first, described in section 2.3.3. The new fractions should be analyzed by NMR spectroscopy and queried against the BMRB database. The results would then be visualized in a network; this network would be used to determine the uniqueness and number of constituents of each fraction. The list of unique compounds from each column chromatographic separation should be compared to indicate the reproducibility of both the database results and the plant composition. Once that is done, the fractions should be tested for activity. The purpose of this step is to gain biological profile, rather than looking for specific bioactive constituents. Several different approaches exist for biological profiling. One such method is to compare the expression levels of several reported genes upon treatment with the fractions.24 The expression levels can be compared with the expression levels upon treatment with compounds of known modes of action, or upon treatment with siRNAs (genetic knockdown). In either case, similar expression levels between fraction and known perturbagen is interpreted as a compound that acts on the same pathway as the known perturbagen. In a similar way, cell morphology can be visualized using fluorescent dyes.21,23,25 Changes in morphology can be statistically correlated between fractions, as well as with known perturbagens. One large advantage of this platform is that it already designed to integrate with a network compiled from MS-based metabolomics. The network changes from visualizing the relationship between compounds and fraction to visualizing the relationship between compounds and fraction activity. This process is improved through increasing the number of fractions analyzed, as the fewer compounds present in a fraction to be tested for activity the easier it is to identify the single compound causing that activity. Even without added biological activity, the fraction networks can be further analyzed. If the weight of the lines were changed to peak intensity, normalized to the TMS added in a constant concentration, the edge lengths would be related to concentration, rather than the peak match value returned by the database. Unfortunately, the peaks in two-dimensional spectra are not directly related to concentration, as in one-dimensional proton NMR. Other factors, such as magnitude of the coupling constant, J, and NMR parameters such as using a decoupling irradiation pulse, and changing mixing time in TOCSY type experiments also influence the intensity of the peaks. However, as long as the instrument parameters are constant, the influences 26 other than compound concentration should also remain constant. When linking metabolomic profile with biological profile, having some indication of molecular concentration allows a dose response relationship to be observed, helping to narrow in on the biologically active compound. 4.4 THE DIRECTION OF BIOINFORMATICS The programs developed over the course of this project were limited in function. Ideally, there is more functionality that should be included in each program, particularly NMR_reader.py. Noise can be filtered from HSQC type spectra by incorporating the intensity of each peak into the output of NMR_reader.py. Then the highest intensity peak would be retained, discarding the smaller, noise peaks. Noise can also be visualized in COSY spectra. If the spectra are not symmetrized, noisy peaks show up as intense vertical lines. Any peaks in the output of NMR reader that have more than four peaks in it should be discarded, as they are most likely the result of noise. Another feature that would be useful is to add a way to “walk” along related peaks by referencing entries in the list of F2 chemical shifts to the key of the reversed library, eventually compiling entire substructures in one table entry. Ideally, the NMR data should be manipulated so as to return excellent results in the MMCD.20 This database contains the largest number of compounds and allow multiple inputs to be queried at once, increasing the validity of the results. 4.5 CONCLUDING REMARKS Metabolomics methods using the NMR spectrometer were developed. After considering several pulse programs, HSQC showed the greatest promise for producing well resolved and simple spectra. Querying the list of peaks produced from the HSQC NMR spectra of the fractionated G. squarrosa extract against the HMBC database returned 597 unique compounds. Querying the crude extract in the same manner returned between 217 and 477 compounds. The four- compound mixture returned 815 compounds from the querying of one-dimension of the COSY spectrum. This large number of compounds returned from a relatively simple mixture raised questions about the quality of the database assignments. To improve the assignment, more peaks known to be on the same molecule need to be queried against the database. As the databases are not configured to accept multiple related sets of peaks at a time, relating peaks with biological activity, and only analyzing those subsystems that correlate with biological activity is proposed as a potential work around. This reduces the number of peaks that need to be analyzed. This 27 integration would result in a powerful technique that is complementary to the currently used MSbased metabolomics, further exploring chemical space. 28 5 LITERATURE CITED (1) Dona, A. C.; Kyriakides, M.; Scott, F.; Shephard, E. A.; Varshavi, D.; Veselkov, K.; Everett, J. R. Comput. Struct. Biotechnol. J. 2016, 14, 135–153. (2) Kurita, K. L.; Linington, R. G. J. Nat. Prod. 2015, 78 (3), 587–596. (3) Smolinska, A.; Blanchet, L.; Buydens, L. M. C.; Wijmenga, S. S. Anal. Chim. Acta 2012, 750, 82–97. (4) Leenders, J.; Frédérich, M.; de Tullio, P. Drug Discov. Today Technol. 2015, 13, 39–46. (5) Larive, C. K.; Barding, G. A.; Dinges, M. M. Anal. Chem. 2015, 87 (1), 133–146. (6) Öman, T.; Tessem, M.-B.; Bathen, T. F.; Bertilsson, H.; Angelsen, A.; Hedenström, M.; Andreassen, T. BMC Bioinformatics 2014, 15 (1), 413. (7) Mahrous, E. A.; Farag, M. A. Drug Discov. 2015, 6 (1), 3–15. (8) Kruger, N. J.; Ratcliffe, R. G.; Roscher, A. Phytochem. Rev. 2003, 2 (1–2), 17–30. (9) Krishnan, P.; Kruger, N. J.; Ratcliffe, R. G. J. Exp. Bot. 2005, 56 (410), 255–265. (10) Hu, Y.; Potts, M. B.; Colosimo, D.; Herrera-Herrera, M. L.; Legako, A. G.; Yousufuddin, M.; White, M. A.; MacMillan, J. B. J. Am. Chem. Soc. 2013, 135 (36), 13387–13392. (11) Veres, K.; Roza, O.; Laczkó-Zöld, E.; Hohmann, J. Nat. Prod. Commun. 2014, 9 (4), 573– 574. (12) Bingol, K.; Bruschweiler-Li, L.; Li, D.-W.; Brüschweiler, R. Anal. Chem. 2014, 86 (11), 5494–5501. (13) Ulrich, E. L.; Akutsu, H.; Doreleijers, J. F.; Harano, Y.; Ioannidis, Y. E.; Lin, J.; Livny, M.; Mading, S.; Maziuk, D.; Miller, Z.; Nakatani, E.; Schulte, C. F.; Tolmie, D. E.; Kent Wenger, R.; Yao, H.; Markley, J. L. Nucleic Acids Res. 2008, 36 (Database issue), D402408. (14) Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N. S.; Wang, J. T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Genome Res. 2003, 13 (11), 2498–2504. (15) Féraud, B.; Govaerts, B.; Verleysen, M.; de Tullio, P. Metabolomics 2015, 11 (6), 1756– 1768. 29 (16) Xi, Y.; de Ropp, J. S.; Viant, M. R.; Woodruff, D. L.; Yu, P. Anal. Chim. Acta 2008, 614 (2), 127–133. (17) Rai, R. K.; Sinha, N. Anal Chem 2012, 84, 10005–10011. (18) López-Pérez, J. L.; Therón, R.; del Olmo, E.; Díaz, D. Bioinformatics 2007, 23 (23), 3256– 3257. (19) Johnson, S. R.; Lange, B. M. Front. Bioeng. Biotechnol. 2015, 3, 22. (20) Cui, Q.; Lewis, I. A.; Hegeman, A. D.; Anderson, M. E.; Li, J.; Schulte, C. F.; Westler, W. M.; Eghbalnia, H. R.; Sussman, M. R.; Markley, J. L. Nat. Biotechnol. 2008, 26 (2), 162– 164. (21) Kurita, K. L.; Glassey, E.; Linington, R. G. Proc. Natl. Acad. Sci. 2015, 112 (39), 11999– 12004. (22) Platzer, G.; Okon, M.; McIntosh, L. P. J. Biomol. NMR 2014, 60 (2), 109–129. (23) Schulze, C. J.; Bray, W. M.; Woerhmann, M. H.; Stuart, J.; Lokey, R. S.; Linington, R. G. Chem. Biol. 2013, 20 (2), 285–295. (24) Potts, M. B.; Kim, H. S.; Fisher, K. W.; Hu, Y.; Carrasco, Y. P.; Bulut, G. B.; Ou, Y.-H.; Herrera-Herrera, M. L.; Cubillos, F.; Mendiratta, S.; Xiao, G.; Hofree, M.; Ideker, T.; Xie, Y.; Huang, L. J.; Lewis, R. E.; MacMillan, J. B.; White, M. A. Sci. Signal. 2013, 6 (297), ra90. (25) Woehrmann, M. H.; Bray, W. M.; Durbin, J. K.; Nisam, S. C.; Michael, A. K.; Glassey, E.; Stuart, J. M.; Lokey, R. S. Mol. Biosyst. 2013, 9 (11), 2604–2617. 30 APPENDIX A Table 4. A sample of the first 20 peaks with the highest intensity from the combination of the four trial samples (limonene, vanillin, coumarin, menthol). Peak (F2) [ppm] (F1) [ppm] Intensity [abs] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1.6561 7.0456 5.7015 5.7146 1.2646 1.5126 1.1211 1.3951 1.6953 7.411 0.7557 1.5256 2.0085 2.2955 7.4241 7.8286 0.8731 3.9529 1.7475 7.6981 7.0413 1.6568 5.7114 5.6984 1.5134 1.2657 1.396 1.1223 7.4063 1.6959 1.5264 0.7572 2.2956 2.0088 7.8235 7.4194 3.9514 0.8745 7.6931 1.7481 31298 31298 31338 31338 31366 31366 31396 31396 31554 31554 31558 31558 31632 31632 31756 31756 31970 31970 32008 32008 Table 5. The first 28 rows of the output of NMR_reader.py using the four compound mixture as an input. F1 (ppm) F2 (ppm) 0.9006 0.9006 0.9137 0.9137 -0.0118 -0.0118 7.7062 7.1065 7.2629 7.2629 7.3411 6.4285 -0.1422 -0.1422 1.6438 1.6438 0.8224 0.8224 7.2108 7.2629 7.4194 7.5237 7.4976 7.289 31 3.9775 3.9775 1.6698 1.6698 4.6945 4.6945 6.598 4.0035 4.0035 2.1783 7.4194 7.4194 9.6227 0.7963 0.7963 7.6149 7.4324 9.8182 9.8182 1.722 1.722 0.9267 0.9267 7.0934 7.4976 0.1444 0.1444 0.1053 0.1053 0.0793 0.0793 0.001 0.001 1.722 0.8745 0.8745 3.9775 0.8094 0.8094 4.7076 9.6227 0.8615 0.8615 7.5497 7.4715 2.1392 2.1392 2.1913 2.1913 2.1653 2.1653 1.6307 0.8094 7.4194 1.5264 1.5264 1.5916 1.5916 4.4207 4.4729 1.6568 1.722 9.6227 32 APPENDIX B #NMR_reader.py '''To compile peaks with the same chemical shift as one entry in a dictionary''' import pandas as pd import numpy as np def F1(table, number): '''defines the F1 direction in an easy to reference way''' return table.loc[number, '?(F1) [ppm]'] def F2(table, number): '''defines the F2 direction in an easy to reference way''' return table.loc[number, '?(F2) [ppm]'] def dict_maker(file): '''converts the table of peaks into a dictionary, where the F2 peak is the key and all F1 chemical shifts attached to the key are compiled into a list. This removes redundant F2 peaks''' spectra = pd.read_csv(file) peaks = {} for number in range(spectra.shape[0]): peak = [F2(spectra, number)] connected = [] true_spectra = spectra.isin(peak) if F2(spectra, number) not in peaks: for x in range(number, spectra.shape[0]): if F2(true_spectra, x) == True: connected.append(F1(spectra, x)) peaks[F1(spectra, number)] = connected return peaks def dict_maker_rev(file): '''converts the table of peaks into a dictionary, where the F1 peak is the key and all F2 chemical shifts attached to the key are compiled into a list. 33 This removes redundant F1 peaks''' spectra_rev = pd.read_csv(file) peaks_rev = {} for n in range(spectra_rev.shape[0]): #returns the reverse of the dict_maker loop peak_rev = [F1(spectra_rev, n)] connected_rev = [] true_spectra = spectra_rev.isin(peak_rev) if F1(spectra_rev, n) not in peaks_rev: for x in range(n, spectra_rev.shape[0]): if F1(true_spectra, x) == True: connected_rev.append(F2(spectra_rev, x)) peaks_rev[F1(spectra_rev, n)] = connected_rev return peaks_rev '''runs both the dict_maker and dict_maker_rev functions on one data set''' file = input("What is your filepath? ") product = input("Where should this be written? ") product_rev = input("Where should the reverse be written? ") lysate = dict_maker(file) lysate_rev = dict_maker_rev(file) output = open(product, "w") for key in lysate: print (key, lysate[key], file = output) output.close() output_rev = open(product_rev, "w") for key in lysate_rev: print (key, lysate_rev[key], file = output_rev) output_rev.close() 34 # NMR_network.py import networkx as nx import pandas as pd # makes empty graph and empty nodes list G = nx.Graph() listything = [] # open csv file that contains the list of nodes as a dataframe # make sure to label the column containing the nodes as "Nodes" fraction_1 = pd.read_csv("/Users/jasonmcfarlane/Downloads/nodes.csv") '''iterates through the "Nodes" column and inputs each entry into the empty nodes list. G.add_nodes_from()creates a nodes table from this list, ignoring any redundant entries''' for n in range(len(fraction_1.index)): s = fraction_1["Nodes"] listything.append(s[n]) G.add_nodes_from(listything) '''Makes a dataframe from a table of edges. The first column contains a list of fractions and is labelled "Fractions". The second column contains the compounds returned in each fraction. The weight column is a value returned from the database, and represents the certainty of the compound assignment. A "1" is a high likelihood, while "0" is low likelihood.''' edges_list = pd.read_csv("/Users/jasonmcfarlane/Downloads/yuple_edit10%.csv") for n in range(len(edges_list.index)): first = edges_list["Fractions"] second = edges_list["compound"] weight = edges_list["weight"] G.add_edge(first[n], second[n], weight=float(weight[n])) '''Writes the graph to an xml file for easy visualization using Cytoscape''' nx.write_graphml(G, "/Users/jasonmcfarlane/Desktop/weightedgraph_edit10%.xml") 35 Figure 6. This shows the same six fractions shown in Figure 5 with all compounds that had a Peak Match in the BMRB database equal or lower to 0.10. This filter removed 60 compounds from the dataset leaving 534 compounds. 36