About big data file shares
A big data file share is an item created in your portal that references a location available to your ArcGIS GeoAnalytics Server. The big data file share location can be used as an input and output to feature data (points, polylines, polygons, and tabular data) of GeoAnalytics tools. When you create a big data file share, an item is created in your portal. The item points to a big data catalog service, which outlines the datasets in the big data file share and their schema, including geometry and time information, and the output formats, called templates, that you have registered. When using a big data file share for the input to a ArcGIS GeoAnalytics Server tool, you can browse for the item to run analysis on a dataset.
There are several benefits to using a big data file share. You can keep your data in an accessible location until you are ready to perform analysis. A big data file share accesses the data when the analysis is run, so you can continue to add data to an existing dataset in your big data file share without having to reregister or publish your data. You can also modify the manifest to remove, add, or update datasets in the big data file share. Big data file shares are extremely flexible in how time and geometry can be defined, and allow for multiple time formats on a single dataset. Big data file shares also allow you to partition your datasets while still treating multiple partitions as a single dataset. Using big data file shares for output data allows you to store your results in formats that you may use for other workflows, such as a parquet file for further analysis or storage.
Note:
Big data file shares are only accessed when you run GeoAnalytics Tools. This means that you can only browse and add big data files to your analysis; you cannot visualize the data on a map.
Big data file shares can reference the following input data sources:
- File share—A directory of datasets on a local disk or network share.
- Apache Hadoop Distributed File System (HDFS)—An HDFS directory of datasets.
- Apache Hive—Hive metastore databases.
- Cloud store—An Amazon Simple Storage Service (S3) bucket, Microsoft Azure Blob container, or Microsoft Azure Data Lake store containing a directory of datasets.
When writing results to a big data file share, you can use the following output GeoAnalytics Tools:
- File share
- HDFS
- Cloud store
The following file types are supported as datasets for input and output in big data file shares:
- Delimited files (such as .csv, .tsv, and .txt)
- Shapefiles (.shp)
- Parquet files (.gz.parquet)
- ORC files (orc.crc)
Note:
A big data file share is only available for use if the portal administrator has enabled GeoAnalytics Server. To learn more about enabling GeoAnalytics Server, see Set up ArcGIS GeoAnalytics Server.
Big data file shares are one of several ways GeoAnalytics Tools can access your data and are not a requirement for GeoAnalytics Tools. See Use the GeoAnalytics Tools in Map Viewer for a list of possible GeoAnalytics Tools data inputs and outputs.
You can register as many big data file shares as you need. Each big data file share can have as many datasets as you want.
The table below outlines some important terms when talking about big data file shares.
Term | Description |
---|---|
Big data file share | A location registered with your GeoAnalytics Server to be used as dataset input, output, or both input and output to GeoAnalytics tools. |
Big data catalog service | A service that outlines the input datasets and schemas and output template names of your big data file share. This is created when your big data file share is registered, and your manifest is created. To learn more about big data catalog services, see the Big Data Catalog Service documentation in the ArcGIS Services REST API help. |
Big data file share item | An item in your portal that references the big data catalog service. You can control who can use your big data file share as input to GeoAnalytics by sharing this item in portal. |
Manifest | A JSON file that outlines the datasets available and the schema for inputs in your big data file share. The manifest is automatically generated when you register a big data file share and can be modified by editing or using a hints file. A single big data file share has one manifest. |
Output templates | One or more templates that outline to file type and optional formatting when writing results to a big data file share. For example, a template could specify that results are written to a shapefile. A big data file share can have none, one, or more output templates. |
Big data file share type | The type of locations you are registering. For example, you could have a big data file share or type HDFS. |
Big data file share dataset formant | The format of the data you are reading or writing. For example, the file type may be shapefile. |
Hints file | An optional file that can be used to assist in generating a manifest for delimited files used as an input. |
Prepare your data to be registered as a big data file share
To use your datasets as inputs in a big data file share, you need to make sure your data is correctly formatted. See below for the formatting based on the big data file share type.
File shares and HDFS
To prepare your data for a big data file share, you need to format your datasets as subfolders under a single parent folder that will be registered. In this parent folder you register, the names of the subfolders represent the dataset names. If your subfolders contain multiple folders or files, all of the contents of the top-level subfolders are read as a single dataset, and must share the same schema. The following is an example of how to register the folder FileShareFolder that contains three datasets, named Earthquakes, Hurricanes, and GlobalOceans. When you register a parent folder, all subdirectories under the folder
you specify are also registered with the GeoAnalytics Server. Always register
the parent folder (for example, \\machinename\FileShareFolder)
that contains one or more individual dataset folders.
Example of a big data file share that contains three datasets: Earthquakes, Hurricanes, and GlobalOceans.|---FileShareFolder < -- The top-level folder is what is registered as a big data file share
|---Earthquakes < -- A dataset "Earthquakes", composed of 4 csvs with the same schema
|---1960
|---01_1960.csv
|---02_1960.csv
|---1961
|---01_1961.csv
|---02_1961.csv
|---Hurricanes < -- The dataset "Hurricanes", composed of 3 shapefiles with the same schema
|---atlantic_hur.shp
|---pacific_hur.shp
|---otherhurricanes.shp
|---GlobalOceans < -- The dataset "GlobalOceans", composed of a single shapefile
|---oceans.shp
This same structure is applied to file shares and HDFS, although the terminology differs. In a file share, there is a top-level folder or directory, and datasets are represented by the subdirectories. In HDFS, the file share location is registered and contains datasets. The following table outlines the differences:
File share | HDFS | |
---|---|---|
Big data file share location | A folder or directory | An HDFS path |
Datasets | Top-level subfolders | Datasets within the HDFS path |
Once your data is organized as a folder with dataset subfolders, make your data accessible to your GeoAnalytics Server by following the steps in Make your data accessible to ArcGIS Server and registering the dataset folder.
Access HDFS using Kerberos
GeoAnalytics Server can access HDFS using Kerberos authentication.
Note:
GeoAnalytics Server supports RCP protection set to authentication (hadoop.rpc.protection =authentication). GeoAnalytics Server does not currently support integrity (integrity) or privacy (privacy) modes.
Follow these steps to register the HDFS file share using Kerberos authentication:
- On Windows, copy the krb.ini file to C:/linux/krb.ini on all machines in your GeoAnalytics Server site. On Linux, copy the krb.conf file to /etc/krb.conf on all machines in your GeoAnalytics Server site.
- Sign in to your GeoAnalytics Server site from ArcGIS Server Administrator Directory.
ArcGIS Server Administrator Directory requires you to sign in as an administrator. To connect to your federated GeoAnalytics Server site, you must sign in using a portal token, which requires the portal administrator's credentials, or as the GeoAnalytics Server site's primary site administrator. If you are not a portal administrator or do not have access to the primary site administrator account information, contact your portal administrator to complete these steps for you.
- Go to data > registerItem.
- Copy the following text and paste it into the Item text box. Update the following values:
- <bigDataFileShareName>: Replace with the name you want for the big data file share.
- <hdfs path>: Replace with the fully qualified file system path to the big data file share, for example, hdfs://domainname:port/folder.
- <user@realm>: Replace with the user and realm of the principal.
- <keytab location>: Replace with the location of the keytab file. The keytab file must be accessible to all machines in the GeoAnalytics Server site, for example, //shared/keytab/hadoop.keytab.
{ "path": "/bigDataFileShares/<bigDataFileShareName>", "type": "bigDataFileShare", "info": { "connectionString": "{\"path\":\"<hdfs path>",\"accessMode\":\"Kerberos\",\"principal\":\"user@realm\",\"keytab\":\"<keytab location>\"}", "connectionType": "hdfs" } }
- Click Register Item.
Once the item is registered, the big data file share appears as a data store in ArcGIS Server Manager with a populated manifest. If the manifest is not populated, continue to Step 5.
- Sign in to your GeoAnalytics Server site ArcGIS Server Manager.
You can sign in as a publisher or administrator.
- Go to Site > Data Stores and click the Regenerate Manifest button next to your new big data file share.
You now have a big data file share and manifest for your HDFS, which you will access through Kerberos authentication. The big data file share item in your portal points to a big data catalog service in the GeoAnalytics Server.
Hive
Note:
GeoAnalytics Server uses Spark 3.0.1. Hive must be version 2.3.7 or 3.0.0–3.1.2.
If you try and register a big data file share with Hive that is not the correct version, the big data file share registration will fail. If this happens, restart the GeoAnalyticsManagement toolbox in ArcGIS Server Administrator Directory > services > System > GeoAnalyticsManagement> stop. Repeat steps to start.
In Hive, all tables in a database are recognized as datasets in a big data file share. In the following example, there is a metastore with two databases, default and CityData. When registering a Hive big data file share through ArcGIS Server with your GeoAnalytics Server, only one database can be selected. In this example, if the CityData database was selected, there would be two datasets in the big data file share, FireData and LandParcels.|---HiveMetastore < -- The top-level folder is what is registered as a big data file share
|---default < -- A database
|---Earthquakes
|---Hurricanes
|---GlobalOceans
|---CityData < -- A database that is registered (specified in Server Manager)
|---FireData
|---LandParcels
Cloud stores
There are three steps to registering a big data file share of type cloud store.
Prepare your data
To prepare your data for a big data file share in a cloud store, format your datasets as subfolders under a single parent folder.
The following is an example of how to structure your data. This example registers the parent folder, FileShareFolder, which contains three datasets: Earthquakes, Hurricanes, and GlobalOceans. When you register a parent folder, all subdirectories under the folder
you specify are also registered with GeoAnalytics Server. Example of a how to structure data in a cloud store that will be used as a big data file share. This big data file contains three datasets: Earthquakes, Hurricanes, and GlobalOceans.|---Cloud Store < -- The cloud store being registered
|---Container or S3 Bucket Name < -- The container (Azure) or bucket (Amazon) being registered as part of the cloud store
|---FileShareFolder < -- The parent folder that is registered as the 'folder' during cloud store registration
|---Earthquakes < -- The dataset "Earthquakes", composed of 4 csvs with the same schema
|---1960
|---01_1960.csv
|---02_1960.csv
|---1961
|---01_1961.csv
|---02_1961.csv
|---Hurricanes < -- The dataset "Hurricanes", composed of 3 shapefiles with the same schema
|---atlantic_hur.shp
|---pacific_hur.shp
|---otherhurricanes.shp
|---GlobalOceans < -- The dataset "GlobalOceans", composed of 1 shapefile
|---oceans.shp
Register the cloud store with your GeoAnalytics Server
Connect to your GeoAnalytics Server site from ArcGIS Server Manager to register a cloud store. When you register a cloud store, you must include an Azure container name, an Amazon S3 bucket name, or an Azure Data Lake Store account name. It is recommended to additionally specify a folder within the container or bucket. The specified folder is composed of subfolders, and each represents an individual dataset. Each dataset is composed of all the contents of the subfolder.
Register the cloud store as a big data file share
Follow these steps to register the cloud store you created in the previous section as a big data file share:
- Sign in to your GeoAnalytics Server site from ArcGIS Server Manager.
You can sign in as a publisher or administrator.
- Go to Site > Data Stores and choose Big Data File Share from the Register drop-down list.
- Provide the following information in the Register Big Data File Share dialog box:
- Type a name for the big data file share.
- Choose Cloud Store from the Type drop-down list.
- Choose the name of your cloud store from the Cloud Store drop-down list.
- Click Create to register your cloud store as a big data file share.
You now have a big data file share and manifest for your cloud store. The big data file share item in your portal points to a big data catalog service in the GeoAnalytics Server.
Register your big data file share
To register a file share, HDFS, or Hive cloud store as a big data file share, connect to your GeoAnalytics Server site through ArcGIS Server Manager. See Register your data with ArcGIS Server using Manager in the ArcGIS Server help for details on the necessary steps.
Tip:
Steps for registering a cloud store as a big data file share were covered in the previous section.
When a big data file share is registered, a manifest is generated that outlines the format of the datasets within your share location, including the fields representing the geometry and time. If you optionally chose to register your big data file share as an output location, an output template manifest is also generated. A big data file share item is created in your portal that points to a big data catalog service in the GeoAnalytics Server where you registered the data. To learn more about big data catalog services, see the Big Data Catalog Service documentation in the ArcGIS Services REST API help.
Modify a big data file share
When a big data catalog service is created, a manifest for the input data is automatically generated and uploaded to the GeoAnalytics Server site where you registered the data. The process of generating a manifest may not always correctly estimate the fields representing geometry and time, and you may need to apply edits. To edit a manifest, follow the steps in Edit big data file share manifests in Manager. To learn more about the big data file share manifest, see Understanding the big data file share manifest in the ArcGIS Server help.
Modify the output templates for a big data file share
When you choose to use the big data file share as an output location, output templates are automatically generated. These templates outline the formatting of output analysis results, such as the file type, and how time and geometry will be registered. If you want to modify the geometry or time formatting, or add or delete templates, you can modify the templates. To edit the output templates, follow the steps in Edit big data file share manifests in Manager. To learn more about output templates, see Output templates in a big data file share.
Run analysis on a big data file share
You can run analysis on a dataset in a big data file share through any clients that support GeoAnalytics Server, which include the following:
- ArcGIS Pro
- Map Viewer
- ArcGIS REST API
- ArcGIS API for Python
To run your analysis on a big data file share through ArcGIS Pro or Map Viewer, select the GeoAnalytics Tools you want to use. For the input to the tool, browse to where your data is located under Portal in ArcGIS Pro or on the Browse Layers dialog box in Map Viewer. Data will be in My Content if you registered the data yourself. Otherwise, look in your Groups or All Portal. Note that a big data file share layer selected for analysis will not be displayed in the map.
Note:
Make sure you are signed in with a portal account that has access to the registered big data file share. You can search your portal with the term bigDataFileShare* to quickly find all the big data file shares you can access.
To run analysis on a big data file share through the ArcGIS REST API, use the big data catalog service URL as the input. This will be in the format {"url":" https://webadaptorhost.domain.com/webadaptorname/rest/DataStoreCatalogs/bigDataFileShares_filesharename/BigDataCatalogServer/dataset"}. For example, with a machine named example, a domain named esri, a Web Adaptor named server, a big data file share named MyData, and a dataset named Earthquakes, the URL would be: {"url":" https://example.esri.com/server/rest/DataStoreCatalogs/bigDataFileShares_MyData/BigDataCatalogServer/Earthquakes"}. To learn more about input to big data analysis through REST, see the Feature Input topic in the ArcGIS Services REST API documentation.
Save results to a big data file share
You can run analysis on a dataset (big data file share or other input) and save the results to a big data file share. When you save results to a big data file share, you are not able to visualize them. You can do this through the following clients:
- Map Viewer
- ArcGIS REST API
- ArcGIS API for Python
When you write results to a big data file share, the input manifest is updated to include the dataset you just saved. The results you have written to the big data file share are now available as an input for another tool run.