# 1 Abstract

The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.

Data were collected at the University of Arizona Maricopa Agricultural Center in Maricopa, Arizona. This site hosts a large field scanner with fifteen sensors, many of which are capable of capturing mm-scale images and point clouds at daily to weekly intervals.

These data are intended to be re-used, and are accessible as a combination of files and databases linked by spatial, temporal, and genomic information. In addition to providing open access data, the entire computational pipeline is open source, and we enable users to access high-performance computing environments.

The study has evaluated a sorghum diversity panel, biparental cross populations, and elite lines and hybrids from structured sorghum breeding populations. In addition, a durum wheat diversity panel was grown and evaluated over three winter seasons. The initial release includes derived data from from two seasons in which the sorghum diversity panel was evaluated. Future releases will include data from additional seasons and locations.

The TERRA-REF reference dataset can be used to characterize phenotype-to-genotype associations, on a genomic scale, that will enable knowledge-driven breeding and the development of higher-yielding cultivars of sorghum and wheat. The data is also being used to develop new algorithms for machine learning, image analysis, genomics, and optical sensor engineering.

# 2 Introduction

## 2.1 General information

• Title: TERRA-REF, An Open Reference Data Set From High Resolution Genomics, Phenomics, and Imaging Sensors
• Dates of Data Collection: 2017 and 2018
• Geographic Location: Maricopa, Arizona
• Center of the field: 33.07549$$^{\circ}$$ N 111.9749$$^{\circ}$$ W
• Field is approximately 0.4 ha (200 m x 20 m). The scannable area of the field is 22.1 m on East - West axis and 205.5 m on N/S axis.
• Keywords: Sensor, Phenomics, Sorghum, TERRA-REF
• Funding: The work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000598. Computational support was provided by the National Center for Supercomputing Applications and XSEDE.

The data pipeline used to process sensor data and generate derived data was described in the Burnette et al. (2018) ACM PEARC 2018 proceedings paper “TERRA-REF data processing infrastructure”.

## 2.2 Data Use, Sharing, and Access

To cite this dataset:

LeBauer, D.S., Burnette, M.A., Demieville, J., Fahlgren, N., French, A.N., Garnett, R., Hu, Z., Huynh, K., Kooper, R., Li, Z., Maimaitijiang, M., Mao, J., Mockler, T.C., Morris, G.P., Newcomb, M., Ottman, M., Ozersky, P., Paheding, S., Pauli, D., Pless, R., Qin, W., Riemer, K., Rohde, G.S., Rooney, W.L., Sagan, V., Shakoor, N., Stylianou, A., Thorp, K., Ward, R., White, J.W., Willis, C., and Zender C.S. (2020). TERRA-REF, An Open Reference Data Set From High Resolution Genomics, Phenomics, and Imaging Sensors. Dryad Digital Repository. http://doi.org/10.5061/dryad.4b8gtht99

This data publication consists of data, metadata, and a catalog of over 400 TB of files that are stored on a server at the National Center for Supercomputing Applications and made available through Globus on the ncsa#terra-public endpoint.

Except where clearly indicated for sensor data, the structure of directories that contain data and metadata refer to the contents of the ZIP files in the Dryad archive. This document describes the contents of the Dryad arcive, which includes directories organized in the following zip files:

• metadata.zip contains the metadata/ directory. This includes information about experiments, sensors, and germplasm used in the study as well as comprehensive documentation.
• code.zip contains the code/ directory. This includes code used to generate figures and tables in this README as well as instructions for running a copy of the trait database.
• trait_data.zip contains the data/traits/ directory with one csv file for each plot level phenotype, described in {#phenotype-data}.
• sensor_data_catalogs.zip contains the data/sensors/ directory. These catalogs provide an inventory of the files that are available using the Globus file transfer service. See {#sensor-data} for more information.

In addition to providing an archival data on Dryad, it is possible to browse and access these data through a variety of online portals. These online portals provide access to web user interfaces as well as databases, apis, and R and Python clients. In many cases it will be easier to access data through these portals using web interfaces and software libraries.

The TERRA REF documentation includes instructions for using these portals, and is hosted online at docs.terraref.org. The section “How to Access Data” provides an overview of methods that can be used to access data beyond what is provided in this repository. There is also a PDF copy of the documentation in the file metadata/docs.terraref.org_2020_04_06.pdf.

Tutorials for getting started with TERRA-REF data are available at terraref.org/tutorials and on GitHub at github.com/terraref/tutorials.

The TERRA-REF YouTube channel hosts 1) video walkthroughs of the tutorials https://www.youtube.com/channel/UComeQAqYR5aZrXN_3K5iFGw and 2) a playlist of videos related to the project https://www.youtube.com/playlist?list=PLNgRX4VLed8213stlJp60MvVx2p6VTv6N.

All data are released to the public domain under the CC-0 license. All original software are licensed with the BSD 3-clause or MIT/BSD compatible license. All software used for data processing have been archived on Zenodo and are available on GitHub in the terraref organization: github.com/terraref.

The software was created specifically for the field scanner data processing pipeline that is described by Burnette et al. (2018). The file code/source_code_dois.txt provides the DOIs for code archives that contain the state of the software at the time the data were processed.

Component Github Organization / Repository Archive Citation
TERRA-REF Documentation terraref/documentation LeBauer, Willis, et al. (2020)
Reference Data terraref/reference-data LeBauer, Heyek, et al. (2020)
Computing Pipeline terraref/computing-pipeline Burnette et al. (2020)
terrautils Python Library terraref/terrautils Burnette, Willis, et al. (2019)
Laser 3D Scanner terraref/extractors-3dscanner Burnette, ZongyangLi, et al. (2019)
Environmental Logger terraref/extractors-environmental Burnette, Mao, et al. (2019)
Hyperspectral terraref/extractors-hyperspectral Mao et al. (2019)
Multispectral, Thermal, PSII terraref/extractors-multispectral Burnette, LeBauer, Hajmohammadi, et al. (2019)
Stereo RGB terraref/extractors-stereo-rgb Burnette, LeBauer, Li, et al. (2019)

Other Software used in this project:

Software Github Organization / Repository Software Archive
Clowder clowder-framework/clowder Marini et al. (2019)
BETYdb Trait Database pecanproject/bety Rohde et al. (2016)

# 3 Datasets

In this README, we define the following types of data:

1. Sensor Data from five thermal, light, and shape imaging sensors scanning a 4000 m$$^2$$ field at hourly to weekly intervals at 1 mm$$^2$$ scale resolution. Additional details in the section on sensors and sensor technical descriptions that can be found in the metadata/sensors/ directory of this repository.
2. Phenotypes include both sensor-derived and standard field measurements required to validate and calibrate algorithms that compute plant phenotypes from remote sensing data.
3. Environmental data include time series of meteorological variables including temperature, relative humidity, precipitation, wind direction and speed, photosynthetically active radiation, and downwelling spectral radiance.
4. Genomics data include whole-genome resequencing data for 384 varieties from the sorghum Bioenergy Association Panel (BAP) and genotyping-by-sequencing data for 768 sorghum Recombinant Inbred Lines (RILs). Contains raw and derived sorghum genome sequencing data. Raw data includes DNA sequence files in compressed FASTQ format. Derived data are available in Variant Call Format (VCF) and Hapmap files.

## 3.1 Sensor Data

### 3.1.1 Field Scanner Sensors

This publication includes the following data generated by sensors. Detailed sensor and system details can be found in the file metadata/sensors_information.zip as well as browsed online through the Clowder Interface terraref.org/clowder in a space named “Maricopa Agricultural Center Device and Sensor Information”.

The sensor information folder contains extensive documentation for each of the sensors, the field scanner, calibration targets, and the results of sensor validation tests.

Additional sensors not represented in this version of the data are listed in the section on sensors in the section on additional sensors.

Sensor Name Model Technical Specifications
Imaging Sensors
Stereo RGB Camera Allied Vision Prosilica GT3300C
Laser Scanner Custom Fraunhofer 3D) Spatial Resolution: 0.3 to 0.9 mm
Thermal Infrared FLIR A615) Thermal Sensitivity $$<$$ 50mK @ 30$$^\circ$$C
PS II Camera LemnaTec PS II Fluorescence Prototype) Illumination 635nm x 4000 $$\mu$$mol/m2/s, Camera 50 fps
Environmental Sensors
Environmental Sensors Thies Clima 4.9200.00.000)
VNIR Spectrometer Spectral Evolution PSR+3500 Range 350 to 800 nm
Environmental Sensors Thies Clima 4.9200.00.000)
VNIR Spectrometer Spectral Evolution PSR+3500 Range 350 to 800 nm
PAR Sensor Quantum SQ–300 Spectral Range 410 to 655 nm

### 3.1.2 Sensor Data Products

The total size of raw (Level 0) data generated by these sensors is 60 TB. Combined, the Level 1 and Level 2 sensor data products are 490 TB. This size could be substantially reduced through compression and removal of duplicate data. For example, the same images at the same resolution appear in the georeferenced Level 1 files, the full field mosaics, and the plot level clip.

Sensor data are stored on the Storage Condo at the National Center for Supercomputing Applications in Urbana, Illinois. We make them available for download with the Globus file transfer system. The following steps are required to access them: 1) get an account at globus.org; 2) search for the terra-public endpoint; 3) install the Globus Personal Connect application and transfer data. Further information is provided in the data access chapter of the TERRA-REF documentation. As an alternative, the data can be provided on hard drives for the cost of supplies, labor, and shipping.

#### 3.1.2.1 Sensor Data Catalog

Globus provides the easiest way to navigate the data. This archive also contains a catalog listing all of the files in the dataset. The catalog is one compressed ZIP file named sensor_data_catalogs.zip. This file includes one sub-directory for each season named sensors/season_[n]_catalog/. The compressed catalogs are 428 MB total, and expand to 5.4 GB when uncompressed. For each season’s catalog there is one directory per data product and one file per day named [data product]/file_catalog_season[n]_[data product]_[filetype]_[YYYY-MM-DD].json.

These catalog files contain the following information in JSON format:

collections: 'collection name [Data Product Name] - [YYYY-MM]'
datasets: 'dataset name [Data Product Name] - [YYYY-MM-DD]'
files:
path: 'season-[n]/Level_[m]/[data_product]/[filename]'
checksum: '[checksum_string]'
name: '[data product]_L[m]_[YYYY-MM-DD]_[Scan Name]'
size: 'bytes'

There is one collection per data product per month, and one dataset per data product per day. This structure of collections and datasets refers to the organization of files in the the Clowder database and web interface see data access documentation.

Below is a summary of the sensor data products included in the first release of TERRA-REF data. Sensor-derived phenotypes described in the Phenotype Data section were generated from the 3D laser scanner and RGB camera sensors as described in metadata/methods.csv.

Data Product Sensor Algorithm File Format Plot Clip Full Field
Environment Thies Clima envlog2netcdf netcdf NA NA
Thermal Image FLIR ir_geotiff geotiff +
Point Cloud Fraunhofer Laser 3D laser3d_las las +
Point Cloud Fraunhofer Laser 3D scanner3DTop ply
Images Time-Series PSII Camera ps2png png
Color Images RGB Stereo bin2tiff geotiff + +

Figure 3.1 shows the number of files for each data type across seasons 4 and 6.

#### 3.1.2.2 Sensor Data Directory Contents

The following list describes the organization and contents of the Storage Condo server. These can be accessed at the ncsa#terra-public endpoint on Globus. Directory names have a leading / while file names do not.

• Environment Logger
• /envlog_netcdf
• Daily aggregated files named envlog_netcdf_L1_ua-mac_[YYYY-MM-DD].nc.
• There are also 24 hourly files for each day named [YYYY-MM-DD_HH-MM-SS]_environmentlogger.nc.
• Laser3D
• /laser3d_las
• One merged file per scan across the short (E-W) axis with names ending in _merged.las. There are typically 50-100 of these each day.
• /laser3d_las_plot
• Each directory has the name of one plot, and there is one LAS file clipped to the plot boundaries for each scan (there may be more than one scan per day).
• RGB Stereo:
• /rgb_geotiff
• File names ending in _left.tif and _right.tif represent simultaneous images from left and right stereo pair cameras.
• /rgb_mask
• These images have the soil represented as black pixels. For each file ending in *_left_mask.tif in the RGB Geotiff dataset, an image with black pixels representing areas that contain soil and not plants.
• /rgb_geotiff_plots
• For each RGB Geotiff image, a Geotiff file with the same dimensions as the plot. It contains the image clipped to the plot boundaries as well as fill values for parts of the plot not in the image.
• /rgb_fullfield
• Key data product is one full resolution full-field image per scan.
• Other files include: lower resolution versions of the full field (files with names ending in _10pct.tif, _thumb.tif and .png); CSV files containing canopy cover values for each plot; a JSON file listing images contained in the fullfield mosaic; a VRT file that is a “virtual geotiff” that was used to generate the full-field mosaic.
• These full field Geotiff images are RGB images and image masks tiled together to make up a full-field view. These full field images are not orthomosaics since they are not stitched together because doing so causes geometric aberrations.
• PSII Camera:
• /ps2_png:
• 101 .png files per folder. The order of the images is indicated by the last four digits of the file name, i.e. _0000.png to _0100.png.
• 101 georeferenced Geotiff files otherwise identical to the PNG counterparts.
• These files represent a time series of images captured at a rate of 50 frames per second.

#### 3.1.2.3 Sensor Data Directory Structure and File Naming

File names follow consistent patterns based on data product and date of collection. These are intended to be easily understood. The hierarchy is season, data product level, date, date-time, and files.

Specifically, under the directory /season-[4,6] we have:

|-Level
|  |- Data product name
|  |  |-Date (YYYY-MM-DD)
|  |  |  |-Date-Time (YYYY-MM-DD__HH-MM-SS-SSS)
|  |  |  |  |- File name(s)

e.g.

|-Level_1
|  |- envlog_netcdf
|  |  |-2017-08-26
|  |  |  |- 2017-08-26_13-34-54-321_environmentlogger.nc
|  |- laser3d_las
|  |  |-2017-08-26
|  |  |  |-2017-08-26__12-34-54-321
|  |  |  |  |- scanner3DTop - 2017-08-26__12-34-54-321 MergedPointCloud.las

For convenience, we have pre-processed some images and point clouds to plot boundaries and have organized them by Date and then Plot name, e.g.:

|-Level_1_plots
|  |- rgb_geotiff
|  |  |- 2017-04-26
|  |  |  |- MAC Field Scanner Season 4 Range 21 Column 16
|  |  |  |  |- rgb_geotiff_L1_ua-mac_2017-04-26__12-56-14-907_right.tif
|  |  |  |  |- rgb_geotiff_L1_ua-mac_2017-04-26__12-56-14-907_left.tif
|  |  |  |  |- rgb_geotiff_L1_ua-mac_2017-04-26__12-53-34-106_right.tif

## 3.2 Phenotype Data

### 3.2.1 Raw Phenotype Data

Tables of phenotypes can be found in the compressed files named traits/season_[n]_traits/ folder inside the trait_data.zip file. There is one subdirectory for each of seasons 4 and 6. Once uncompressed, each directory will contain one CSV file for each combination of trait and measurement method. The names of these CSV files help identify the contents because they follow the pattern season_[n]_[trait]_[measurement_type].csv. For example, the file season_6_aboveground_biomass_manual.csv contains manual measurements of above-ground biomass taken during season 6.

These CSV files have one measurement per row for a specific date, location, genotype, and measurement. The first line is a header that contains the names of the fields:

• plot (text) Plot name, using the format <field site> Season <n> Range <m> Column <k>.
• scientificname (text) Latin name for the crop species. This will always be Sorghum bicolor until future versions with data from additional crops are published.
• genotype (text) Genotype or accession identifier.
• treatment (text) Name of experimental treatment.
• date: (YYYY-MM-DD) Date of measurement.
• trait: (text) Name of the trait measured. Defined in the file metadata/variables.csv.
• method: (text) The method used to measure the trait. Defined in the file metadata/methods.csv.
• mean: (numeric) Value of the phenotype data.
• checked: (boolean) 0 = unchecked and 1 = checked: has the data been independently reviewed?
• author: (text) name of scientist who collected the data or who wrote the algorithm used to derive phenotypes from sensor data.
• season: (text) Name of season: one of ‘Season 4’ or ‘Season 6’.
• method_type: (text) Type of measurement: one of ‘manual’ or ‘sensor’.