Site administrators =================== Use the `Django admin `__ to: - Add and edit publications .. note:: Once a new publication is added, the :ref:`cli-manageprocess` command will collect its data, unless *Frozen* is checked. - Check the status of a job and its tasks .. note:: The :ref:`cli-flattener` task was added in September 2022, so earlier jobs have a *Last completed task* of "exporter (4/5)". - Review log entries for other administrators' actions Publications and jobs can be searched by publication country and publication title. .. note:: The search performs only case normalization. For example, "Montreal" will not match "Montréal" (with an accent). Add a publication ----------------- Refer to our `internal documentation `__ (contains links to internal resources). Review publications ------------------- From time to time, use the filters in the right-hand sidebar to: - Review publications for out-of-date or missing information: - Non-frozen publications that weren't recently reviewed (*By last reviewed*: More than a year ago) - Publications without all English and Spanish translations (*By untranslated*: Yes) - Publications without licenses (*By data license*: Empty) - Non-frozen publications without quality summaries (*By quality summary [en]*: Empty) - Non-frozen publications without other information (*By incomplete*: Yes), one or more of: - Country flag - Country (en) - Retrieval frequency - Source URL - Language (en) - Description (en) - Data availability (en) - Review publications for visibility and processing: - Unpublished publications (*By public*: No) - Frozen publications (*By frozen*: Yes) - Historical publications (*By retrieval frequency*: This dataset is no longer updated by the publisher) Review jobs ----------- From time to time, use the filters in the right-hand sidebar to: - Check for failed jobs, and :ref:`restart tasks` as appropriate (*By failed*: Yes) - Check for completed jobs whose temporary data has not been deleted (*By temporary data deleted*: No, *By status*: COMPLETED) - Check for running jobs that are old (*By status*: RUNNING) .. _admin-troubleshoot: Troubleshoot a job ------------------ A job's detail page: - Displays the status, result and note (e.g. error messages) for each task, in the *Job tasks* section. If a task's result is ``FAILED``, but :func:`~data_registry.process_manager.process` considers the failure to be :class:`temporary`, then the :ref:`cli-manageprocess` command retries the task until it succeeds or fails permanently. Read the *Note*, and judge whether the failure is permanent. If so, you can set the job's *Status* to *COMPLETED* to stop the retries. The :ref:`cli-manageprocess` command will then delete the job's temporary data. The next job will be scheduled according to the publication's retrieval status. .. attention:: If you want it scheduled sooner, prioritize `#350 `__. - Defines and displays metadata (*Context*) from its tasks, in the *Management* section Use the metadata to troubleshoot other applications. For example, to check the Scrapy log, replace the hostname and port in the ``scrapy_log`` value with ``collect.data.open-contracting.org``. .. seealso:: How to check on progress in: - `Kingfisher Process `__ This project's RabbitMQ management interface is at `rabbitmq.data.open-contracting.org `__. .. _admin-cancel: Cancel a job ~~~~~~~~~~~~ A job can stall (always "running"). The only option is to `cancel `__ the Scrapyd job and set the job's *Status* to *COMPLETED* using the `Django admin `__. .. attention:: To properly implement this feature, see `#352 `__. .. _admin-restart: Restart a task ~~~~~~~~~~~~~~ You can restart the :ref:`Exporter` and :ref:`Flattener` tasks. Do this only if the ``data_registry_production_exporter_init`` and ``data_registry_production_flattener_init`` queues are empty in the `RabbitMQ management interface `__. .. note:: The Flattener task publishes one message per file. You might receive a Sentry notification about a failed conversion, while other conversions are still enqueued or in-progress. The Exporter task publishes one message per job. This task *can* be restarted while the queue is non-empty – as long as another administrator has not restarted it independently. #. `Access the job `__ #. Set only the *Exporter* and/or *Flattener* task's *Status* to *PLANNED* #. Click *SAVE* Any lockfiles are deleted to allow the task to run. .. attention:: See `#350 `__. Unblock the Process task ~~~~~~~~~~~~~~~~~~~~~~~~ Bugs can cause a job to get stuck on the Process task. To diagnose and fix a bug, run Kingfisher Process' `collectionstatus `__ command and select the collection's notes, for example: .. code-block:: sql SELECT * FROM collection_note WHERE collection_id = 100; If the collection is large, you can manually unblock the Process task. No data collected ^^^^^^^^^^^^^^^^^ .. note:: This bug is fixed. The Process task fails with "Collection is empty". If the ``collectionstatus`` command shows that no collection files were created and that the compiled collection has started but not ended: .. code-block:: none :emphasize-lines: 5-6,10-13 steps: compile data_type: to be determined store_end_at: 2001-02-03 04:05:06.979418 completed_at: 2001-02-03 04:05:07.074971 expected_files_count: 0 collection_files: 0 processing_steps: 0 Compiled collection compilation_started: True store_end_at: None completed_at: None collection_files: 0 processing_steps: 0 completable: yes Then, confirm that the Collect task didn't write files, by checking the crawl's log file in `Scrapyd `__ for a message like: .. code-block:: none 2001-02-03 04:05:06 [my_spider] INFO: +---------------- DATA DIRECTORY ----------------+ 2001-02-03 04:05:06 [my_spider] INFO: | | 2001-02-03 04:05:06 [my_spider] INFO: | Something went wrong. No data was downloaded. | 2001-02-03 04:05:06 [my_spider] INFO: | | 2001-02-03 04:05:06 [my_spider] INFO: +------------------------------------------------+ If so, run Kingfisher Process' `closecollection `__ command using the ID of the **original** collection, to allow the task to finish. Processing step remaining ^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: This bug is fixed. It was diagnosed by observing one remaining load step and a note like: .. code-block:: none Empty format 'empty package' for file /data/my_spider/20010203_040506/E76/my_file.json (id: 55555). The fix was to delete load steps for empty packages. If the output looks like: .. code-block:: none :emphasize-lines: 4,7,9,11,15-17,20 steps: compile data_type: release package store_end_at: 2001-02-03 04:05:06.979418 completed_at: None expected_files_count: 654321 collection_files: 654321 processing_steps: 1 2001-02-03 04:05:07,074 DEBUG [process.management.commands.compiler:120] Collection my_spider:2001-02-03 04:05:06 (id: 100) not compilable (load steps remaining) compilable: no (or not yet) 2001-02-03 04:05:07,074 DEBUG [process.management.commands.finisher:130] Collection my_spider:2001-02-03 04:05:06 (id: 100) not completable (steps remaining) completable: no (or not yet) Compiled collection compilation_started: False store_end_at: None completed_at: None collection_files: 0 processing_steps: 0 2024-07-04 14:45:01,718 DEBUG [process.management.commands.finisher:114] Collection my_spider:2001-02-03 04:05:06 (id: 101) not completable (compile steps not created) completable: no (or not yet) Then, confirm that the messages corresponding to the remaining processing steps have already been consumed by the `file_worker `__ worker, by checking `RabbitMQ's management interface `__. If so, select the remaining load steps for the original collection, for example: .. code-block:: sql SELECT collection_file_id FROM processing_step WHERE name = 'LOAD' AND collection_id = 100; .. code-block:: none collection_file_id -------------------- 55555 (1 row) And, re-publish the messages, using the Django `shell `__ command, for example: .. code-block:: python from process.util import get_publisher with get_publisher() as client: message = {"collection_id": 100, "collection_file_id": 55555} client.publish(message, routing_key="api_loader") Freeze or unpublish a publication --------------------------------- A publication is frozen if the source is temporarily broken or otherwise unavailable. Unfreeze the publication when the source is fixed. A publication is unpublished if there are security concerns (like Afghanistan), if it duplicates another publication, or if it was added in error. Only *delete* a publication if it is a duplicate or if it was otherwise created in error. .. note:: If the publication is no longer updated, or the spider is `removed from Kingfisher Collect `__, set the retrieval frequency to ``NEVER``, instead of freezing the publication. .. tip:: To audit whether publications ought to be frozen, run `scrapy checkall `__ from Kingfisher Collect. #. `Find the publication `__ #. If freezing: Check *Frozen*, to stop jobs from being scheduled #. If unpublishing: Uncheck *Public*, to hide the publication #. Click *Save* at the bottom of the page Add an administrator -------------------- #. Click *Add* next to *Users* in the left-hand menu #. Fill in *Username* and *Password*, using a `strong password `__ #. Click *Save and continue editing* On the next form: #. Fill in *First name*, *Last name* and *Email address* #. Check *Staff status* (only James and Yohanna should have *Superuser status*) #. Assign *Groups* (multiple can be selected, as they have non-overlapping permissions) Viewer Can view publications, licenses, jobs and job tasks Contributor Can add/change publications and licenses #. Click *SAVE*