DKAN Harvest

DKAN Harvest is a module that provides a common harvesting framework for DKAN. It supports custom extensions and adds drush commands and a web UI to manage harvesting sources and jobs. To “harvest” data is to use the public feed or API of another data portal to import items from that portal’s catalog into your own. For example, Data.gov harvests all of its datasets from the data.json files of hundreds of U.S. federal, state and local data portals.

DKAN Harvest is built on top of the widely-used Migrate framework for Drupal. It follows a two-step process to import datasets:

  1. Process a source URI and save resulting data locally to disk as JSON
  2. Perform migrations into DKAN with the locally cached JSON files, using mappings provided by the DKAN Migrate Base module .

Harvest Sources

Harvest Sources are nodes that store the source’s URI and some additional configuration. Users with the administrator or site manager role will be able to create and manage harvest sources.

Create a new harvest source

Go to node/add/harvest-source or Content > Add content > Harvest Source, and fill out the form.

Title:Administrative title for the source.
Description:Administrative description or notes about the source.
Source URI:The full Uniform Resource Identifier, such as http://demo.getdkan.com/data.json.
Type:data.json or data.xml - If the harvest source type you are looking for is not available, please refer to the Define a new Harvest Source Type section below.

The form includes four multi-value fields to help you fine tune the results of your harvest.

Filters:Filters restrict the datasets imported by a particular field. For instance, if you are harvesting a data.json source and want only to harvest health-related datasets, you might add a filter with “keyword” in the first text box, and “heatlh” in the second.
Excludes:Excludes are the inverse of filters. For example, if you know there is one publisher listed on the source whose datasets you do not want to bring into your data portal, you might add “publisher” with value “Governor’s Office of Terrible Data”
Overrides:Overrides will replace values from the source when you harvest. For instance, if you want to take responsibility for the datasets once harvested and add your agency’s name as the publisher, you might add “publisher” with your agency’s name as the value.
Defaults:Defaults work the same as overrides, but will only be used if the relevant field is empty in the source

Project Open Data (as well as most metadata APIs) includes many fields that are not simple key-value pairs. If you need to access or modify nested array values you can use this dot syntax to specify the path: key.nested_key.0.other_nested_key. For example, the Publisher field in Project Open Data is expressed like this:

"publisher": {
  "@type": "org:Organization",
  "name": "demo.getdkan.com"
},

To access the name property for filtering or overriding, you can set publisher.name in the first text box and the value you want to use in the second one.

https://dkan-documentation-files.s3.us-east-2.amazonaws.com/dkan1/harvest-filters.png

Harvest sources also include a Topics field. Chose one or more topics to apply them to every dataset harvested from that source.

Now click Save. The datasets from the source will be cached, and you will see a preview of what will be imported. This preview page shows a list of dataset titles and identifiers now in the harvest cache, allowing you to perform a basic check on your source configuration. If it does not look right, click the Edit tab to make adjustments.

Click Harvest Now. The datasets that were cached will now be imported into your site.

https://dkan-documentation-files.s3.us-east-2.amazonaws.com/dkan1/harvest-search.png

Harvest Source nodes are viewable by the public and provide some basic metadata to the user.

Warning

Some behaviors of the Topics field on harvest sources to be aware of:

  • Changing the Topic on the source and re-harvesting will not update the Topic on harvested datasets if nothing else has changed. The Harvester will only re-import a dataset if it detects changes from the source.
  • If you manually add additional topics to a harvested dataset, and there is a change at the source, the next time the dataset is harvested your topics will be overwritten.

Warning

Some behaviors of harvesting resources to be aware of:

  • If only an accessURL value is given, the url will be saved to the API or Website URL field.
  • If a downloadURL value is given, the url will be saved to the Remote file field.
  • The maximum size of a managed file field is 255 characters. It is not possible to increase the size of the field as this would force MySQL to auto-convert the VARCHAR(255) to a SMALLTEXT datatype, which subsequently fails with error 1170 on key length if the column is used as primary key or unique or non-unique index. Therefore the Harvester will check the length of the url and if it exceeds 255 characters, will fall back to using the effective url which is the last url in a redirect chain.

Managing Harvest Sources

From the admin menu, navigate to DKAN > DKAN Harvest Dashboard to view harvest sources. The DKAN Harvest Dashboard provides site managers a quick overview of harvest sources, when they were last updated, number of datasets, and status of the source. The dashboard also allows site managers to perform manual harvest operations:

Harvest (cache and migrate):
 This will cache the source data locally and migrate that source data into your site content.
Cache source(s):
 This will fetch the source data, apply the source configuration (filters, excludes, etc.) and cache the data locally without migrating. You may wish to do this to check for errors, or to refresh the preview.
Migrate source(s):
 This will migrate the current cache for the selected sources, no matter how old it is.
https://dkan-documentation-files.s3.us-east-2.amazonaws.com/dkan1/harvest-dashboard.png

Click on the title of a harvest source from the dashboard to see the details of that source. Administrative tasks can be accomplished from the tabs across the top.

https://dkan-documentation-files.s3.us-east-2.amazonaws.com/dkan1/harvest-source-node.png
View:View the harvest source node.
Edit:Click to make changes to the configuration of the harvest source.
Preview:Click to pull the latest data from the source endpoint into the cache.
Manage Datasets:
 An administrative view that lets you sort and filter the datasets from this source. The most powerful function on this page is to filter by orphan status. When a dataset that was harvested into your system previously is no longer provided in the source, it is considered “orphaned” on your site and unpublished. From the Manage Datasets screen, you can either permanently delete or re-publish orphan datasets.
Events:Event Log that provides historical data on all harvests run on this source. The information is managed by the core dkan_harvest via a per-harvest source migrate_log table that tracks the number of datasets created, updated, failed, orphaned, and unchanged and status. If the value for the field Status is Error then you can click on the text to see the log error and identify the problem.
Errors:Error log that shows a list of all errors recorded during harvesting on the source.

Harvested Resources

When datasets are harvested, the resources are added as remote files, which means they are links to the original files on the remote server. If you modify the resource node in your DKAN site, your changes will be overwritten the next time a harvest is performed. If you add a harvested resource to the datastore be sure to set up periodic importing so that the resource stays in sync with the source. For these reasons, we do not recommend that you create visualizations based on harvested resources as the visualizations could break when changes are made to the files upstream.

Harvest Drush Commands

DKAN Harvest provides multiple drush commands to manage harvest sources and control harvest jobs. In fact, once your sources are properly configured, running harvests from Drush on a cron job or other scheduling system like Jenkins is highly reccomended.

It is recommanded to pass the --user=1 option to harvest drush operations (especially harvest migration jobs) to make sure that the entities created have a proper user as author.

List Harvest sources available

# List all available Harvest Sources
$ drush --user=1 dkan-harvest-status
# Alias
$ drush --user=1 dkan-hs

Run a full harvest (Cache & Migration)

# Harvest data and run migration on all the harvest sources available.
$ drush --user=1 dkan-harvest
# Alias
$ drush --user=1 dkan-h

# Harvest specific  harvest source.
$ drush --user=1 dkan-harvest test_harvest_source
# Alias
$ drush --user=1 dkan-h test_harvest_source

Run a harvest cache

# Run a harvest cache operation on all the harvest sources available.
$ drush --user=1 dkan-harvest-cache
# Alias
$ drush --user=1 dkan-hc

# Harvest cache specific harvest source.
$ drush --user=1 dkan-harvest-cache test_harvest_source
# Alias
$ drush --user=1 dkan-hc test_harvest_source

Run a harvest migration job

# Run a harvest migrate operation on all the harvest sources available.
$ drush --user=1 dkan-harvest-migrate
# Alias
$ drush --user=1 dkan-hm

# Harvest migrate specific harvest source.
$ drush --user=1 dkan-harvest-migrate test_harvest_source
# Alias
$ drush --user=1 dkan-hm test_harvest_source

Extending DKAN Harvest

DKAN developers can use the api provided by DKAN Harvest to add support for additioanl harvest source types. The dkan_harvest_datajson module encapsulate the reference implementation providing support for POD type sources.

If you need to harvest from an end point type other then POD. You can extend the DKAN Harvest APIs to implement said support by following a simple checklist:

  • Define a new Harvest Source Type via hook_harvest_source_types.
  • Implement the Harvest Source Type cache callback.
  • Implement the Harvest Source Type Migration Class.
  • (Optional) Write tests for your source type implementation.

Define a new Harvest Source Type

DKAN Harvest leverages Drupal’s hook system to provide a way to extend the source types that DKAN Harvest supports. To add a new harvest source type the we return their definitions as array items via the hook_harvest_source_types() hook.

/**
 * Implements hook_harvest_source_types().
 */
function dkan_harvest_test_harvest_source_types() {
  return array(
    'harvest_test_type' => array(
      'machine_name' => 'harvest_test_type',
      'label' => 'Dkan Harvest Test Type',
      'cache callback' => 'dkan_harvest_cache_default',
      'migration class' => 'HarvestMigration',
    ),

    // Define another harvest source type.
    'harvest_another_test_type' => array(
      'machine_name' => 'harvest_another_test_type',
      'label' => 'Dkan Harvest Another Test Type',
      'cache callback' => 'dkan_harvest_cache_default',
      'migration class' => 'HarvestMigration',
    ),
  );
}

Each array item defines a single harvest source type. Each harvest source item consists of an array with 4 keyed values:

machine_name:Unique string identifying the harvest source type.
label:This label wil be used on the harvest add node form.
cache callback:Cache function to perform; takes HarvestSource object and timestamp as arguments) and returns a HarvestCache object
migration class:
 A registered Migrate class to use for this source type

Cache callbacks

/**
 * @param HarvestSource $source
 * @param $harvest_updatetime
 *
 * @return HarvestCache
 */
function dkan_harvest_datajson_cache(HarvestSource $source, $harvest_updatetime)

This callback takes care of downloading/filtering/altering the data from the source end-point to the local file directory provided by the HarvestSource::getCacheDir() method. The recommended folder structure for cached data is to have one dataset per uniqely named file. The actual migration is then performed on the cached data, not on the remote source itself.

sh
$ tree
.
├── 5251bc60-02e2-4023-a3fb-03760551ab4a
├── 80756f84-894f-4796-bb52-33dd0a54164e
├── 846158bd-1821-48d8-80c8-bb23a98294a9
└── 84cada83-2382-4ba2-b9be-97634b422a07

0 directories, 4 files

$ cat 84cada83-2382-4ba2-b9be-97634b422a07
/* JSON content of the cached dataset data */

The harvest cache function needs to support the modifications to the source available from the harvest source via the Filter, Excludes, Overrides and Default fields. Each of these configurations is available from the HarvestSource object via the HarvestSource::filters, HarvestSource::excludes, HarvestSource::overrides, HarvestSource::defaults methods.

Migration Classes

The common harvest migration logic is encapsulated in the HarvestMigration class, (which extends the MigrateDKAN class provided via the DKAN Migrate Base module. DKAN Harvest will support only migration classes extended from HarvestMigration. This class is responsible for consuming the downloaded data during the harvest cache step to create the DKAN dataset and associated nodes.

Implementing a Harvest Source Type migration class is the matter of checking couple of boxes:

  • Wire the cached files on the HarvestMigration::__construct() method.
  • Override the fields mapping on the HarvestMigration::setFieldMappings() method.
  • Add alternate logic for existing default DKAN fields or extra logic for custom fields on the HarvestMigration::prepareRow() and the HarvestMigration::prepare().

Working on the migration class for Harvest Source Type should be straightforward, but a good knowladge of how Migrate module works is a big help.

HarvestMigration::__construct()

Setting the `HarvestMigrateSourceList is the only logic required during the construction of the extended HarvestMigration. During the harvest migration we can’t reliably determin and parse the type of cache file (JSON, XML, etc..) so we still need to provide this information to the Migration class via the MigrateItem variable. the Migrate module provide different helpful class for different input file parsing (MigrateItemFile, MigrateItemJSON, MigrateItemXML). For the the POD dkan_harvest_datajson reference implementation we use the MigrateItemJSON class to read the JSON files downloaded from data.json end-points.

public function __construct($arguments) {
  parent::__construct($arguments);
  $this->itemUrl = drupal_realpath($this->dkanHarvestSource->getCacheDir()) .
    '/:id';

  $this->source = new HarvestMigrateSourceList(
    new HarvestList($this->dkanHarvestSource->getCacheDir()),
    new MigrateItemJSON($this->itemUrl),
    array(),
    $this->sourceListOptions
  );
}

HarvestMigration::setFieldMappings()

The default Mapping for all the default DKAN fields and properties is done on the HarvestMigration::setFieldMapping() method. Overriding one or many field mapping is done by overrrding the setFieldMapping() in the child class and add/update the new/changed fields.

For example to override the mapping for the og_group_ref field.

public function setFieldMappings() {
  parent::setFieldMappings();
  $this->addFieldMapping('og_group_ref', 'group_id');

Resources import

The base HarvestMigration class will (by default) look for a $row->resources objects array that should contain all the data needed for constructing the resource node(s) associated with the dataset. the helper method HarvestMigration::prepareResourceHelper() should make creating the resources array items more streamlined.

Example code snippet:

/**
 * Implements prepareRow.
 */
public function prepareRow($row) {
  // Redacted code

  $row->resources = $this->prepareRowResources($row->xml);

  // Redacted code
}

Harvest and DKAN Workflow support

By default, DKAN Harvest will make sure that the harvested dataset node will be set to the published moderation state if the DKAN Workflow module is enabled on the DKAN site. This can be changed at the fields mapping level by overriding the workbench_moderation_state_new field.