How to create a harvest

Use drush commands to Harvest data into your catalog.

Create a harvest JSON file

Normally you would use the data.json provided by another data catalog and “harvest” the datasets into your catalog. But harvests can also be used for bulk management of datasets from a manually generated data.json file.

Example data.json with a single dataset:

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "https://site-domain.com/data.json",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "@type": "dcat:Dataset",
      "identifier": "my-universally-unique-identifier",
      "modified": "2023-10-01",
      "accessLevel": "public",
      "title": "Example Dataset Title",
      "description": "<p>Example dataset description text.</p>",
      "keyword": [
         "Example keyword"
      ]
      "distribution": [
        {
          "@type": "dcat:Distribution",
          "downloadURL": "https://site-domain.com/sites/default/files/Bike_Lane.csv",
          "mediaType": "text/csv"
        }
      ]
    }
  ]
}

The above example contains all the required properties for a dataset if using the default schema provided with DKAN. The default schema requires a distribution and at least one keyword. The identifier must be a unique string within the given catalog. You can find descriptions of the above properties and additional optional properties by viewing the example dataset schema that ships with DKAN at schema/collections/dataset.json.

If you plan to maintain the datasets with the harvest process then do not edit the datasets via the UI or API. Edit the metadata in the data.json file and re-run the harvest to update the datasets.

If you are using the harvest to simply bulk generate datasets, and want to allow data publishers to update the datasets as needed via the UI or API, delete the data.json file to prevent overwritting any changes. After the 2.16.13 release, you could also deregister the harvest.

Register a harvest

Register a new Harvest Plan.

  • Create a unique name as the identifier of your harvest. Do not use a hyphen in the identifier.

  • Provide the full URI for the data source

Example

drush dkan:harvest:register --identifier=myHarvestId --extract-uri=http://example.com/data.json

You can view a list of all registered harvest plans with dkan:harvest:list

Run the harvest

Once you have registered a harvest source, run the import, passing in the identifier as an arguement

drush dkan:harvest:run myHarvestId

View the status of the harvest

drush dkan:harvest:status myHarvestId

You can also navigate to admin/dkan/harvest to view the status of the extraction, the date the harvest was run, and the number of datasets that were added by the harvest. By clicking on the harvest ID, you will also see specific information about each dataset, and the status of the datastore import.

Revert a harvest

drush dkan:harvest:revert myHarvestId

This will delete all of the datasets listed in the specified harvest ID. Any referenced terms will be set to the orphaned state. Any distributions from the harvest will be unpublished. Alternatively you could run dkan:harvest:archive to unpublish the datasets without deleting them from your catalog.

Deregister a harvest

drush dkan:harvest:deregister myHarvestId

This will remove the harvest plan.

Transforms

If you would also like to make changes to the data you are harvesting, you can create custom transforms that will modify the data before saving it to your catalog. Add multiple transforms as an array.

How to create transforms

Transforms allow you to modify what you are harvesting. Click here to see an example of how you can create a custom module to add a transform class.

Example with a transform item

drush dkan:harvest:register --identifier=myHarvestId --extract-uri=http://example.com/data.json  --transform="\\Drupal\\custom_module\\Transform\\CustomTransform"