Skip to content

Aleph

Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets.

Usage

API Usage

The Aleph web interface is powered by a Flask HTTP API. Aleph supports an extensive API for searching documents and entities. It can also be used to retrieve raw metadata, source documents and other useful details. Aleph's API tries to follow a pragmatic approach based on the following principles:

  • All API calls are prefixed with an API version; this version is /api/2/.
  • Responses and requests are both encoded as JSON. Requests should have the Content-Type and Accept headers set to application/json.
  • The application uses Representational State Transfer (REST) principles where convenient, but also has some procedural API calls.
  • The API allows API Authorization via an API key or JSON Web Tokens.

Authentication and authorization

By default, any Aleph search will return only public documents in responses to API requests.

If you want to access documents which are not marked public, you will need to sign into the tool. This can be done through the use on an API key. The API key for any account can be found by clicking on the "Settings" menu item in the navigation menu.

The API key must be sent on all queries using the Authorization HTTP header:

Authorization: ApiKey 363af1e2b03b41c6b3adc604956e2f66

Alternatively, the API key can also be sent as a query parameter under the api_key key.

Similarly, a JWT can be sent in the Authorization header, after it has been returned by the login and/or OAuth processes. Aleph does not use session cookies or any other type of stateful API.

Crossreferencing mentions with entities

Mentions are names of people or companies that Aleph automatically extracts from files you upload. Aleph includes mentions when cross-referencing a collection, but only in one direction.

Consider the following example:

  • "Collection A" contains a file. The file mentions "John Doe".
  • "Collection B" contains a Person entity named "John Doe".

If you cross-reference “Collection A”, Aleph includes the mention of “John Doe” in the cross-referencing and will find a match for it in “Collection B”.

However, if you cross-reference “Collection B”, Aleph doesn't consider mentions when trying to find a match for the Person entity.

As long as you only want to compare the mentions in one specific collection against entities (but not mentions) in another collection, Aleph’s cross-ref should be able to do that. If you want to compare entities in a specific collection against other entities and mentions in other collections, you will have to do that yourself.

If you have a limited number of collection, one option might be to fetch all mentions and automatically create entities for each mention using the API.

To fetch a list of mentions for a collection you can use the /api/2/entities?filter:collection_id=137&filter:schemata=Mention API request.

Get all documents of a collection

list_aleph_collection_documents.py is a Python script designed to interact with an API to retrieve and analyze documents from specified collections. It offers a command-line interface (CLI) to list and check documents within a specified collection.

Features

  • Retrieve documents from a specified collection.
  • Analyze document processing statuses and warn if any are not marked as successful.
  • Return a list of filenames from the retrieved documents.
  • Supports verbose output for detailed logging.
  • Environment variable support for API key management.

Installation

To install the required dependencies, use pip:

pip install typer requests

Ensure you have Python 3.6 or higher installed.

Create the file list_aleph_collection_documents.py with the next contents:

import logging
import requests
from typing import List, Dict, Any, Optional
import logging
import typer
from typing import List, Dict, Any


log = logging.getLogger(__name__)
app = typer.Typer()


@app.command()
def get_documents(
    collection_name: str = typer.Argument(...),
    api_key: Optional[str] = typer.Option(None, envvar="API_KEY"),
    base_url: str = typer.Option("https://your.aleph.org"),
    verbose: bool = typer.Option(
        False, "--verbose", "-v", help="Enable verbose output"
    ),
):
    """CLI command to retrieve documents from a specified collection."""
    if verbose:
        logging.basicConfig(level=logging.DEBUG)
        log.debug("Verbose mode enabled.")
    else:
        logging.basicConfig(level=logging.INFO)
    if api_key is None:
        log.error(
            "Please specify your api key either through the --api-key argument "
            "or through the API_KEY environment variable"
        )
        raise typer.Exit(code=1)
    try:
        documents = list_collection_documents(api_key, base_url, collection_name)
        filenames = check_documents(documents)
        if filenames:
            print("\n".join(filenames))
        else:
            log.warning("No documents found.")
    except Exception as e:
        log.error(f"Failed to retrieve documents: {e}")
        raise typer.Exit(code=1)


def list_collection_documents(
    api_key: str, base_url: str, collection_name: str
) -> List[Dict[str, Any]]:
    """
    Retrieve documents from a specified collection using pagination.

    Args:
        api_key (str): The API key for authentication.
        base_url (str): The base URL of the API.
        collection_name (str): The name of the collection to retrieve documents from.

    Returns:
        List[Dict[str, Any]]: A list of documents from the specified collection.

    Example:
        >>> docs = list_collection_documents("your_api_key", "https://api.example.com", "my_collection")
        >>> print(len(docs))
        1000
    """
    headers = {
        "Authorization": f"ApiKey {api_key}",
        "Accept": "application/json",
        "Content-Type": "application/json",
    }

    collections_url = f"{base_url}/api/2/collections"
    documents_url = f"{base_url}/api/2/entities"
    log.debug(f"Requesting collections list from {collections_url}")
    collections = []
    params = {"limit": 300}

    while True:
        response = requests.get(collections_url, headers=headers, params=params)
        response.raise_for_status()
        data = response.json()
        collections.extend(data["results"])
        log.debug(
            f"Fetched {len(data['results'])} collections, "
            f"page {data['page']} of {data['pages']}"
        )
        if not data["next"]:
            break
        params["offset"] = params.get("offset", 0) + data["limit"]

    collection_id = next(
        (c["id"] for c in collections if c["label"] == collection_name), None
    )
    if not collection_id:
        log.error(f"Collection {collection_name} not found.")
        return []

    log.info(f"Found collection '{collection_name}' with ID {collection_id}")

    documents = []
    params = {
        "q": "",
        "filter:collection_id": collection_id,
        "filter:schemata": "Document",
        "limit": 300,
    }

    while True:
        log.debug(f"Requesting documents from collection {collection_id}")
        response = requests.get(documents_url, headers=headers, params=params)
        response.raise_for_status()
        data = response.json()
        documents.extend(data["results"])
        log.info(
            f"Fetched {len(data['results'])} documents, "
            f"page {data['page']} of {data['pages']}"
        )
        if not data["next"]:
            break
        params["offset"] = params.get("offset", 0) + data["limit"]

    log.info(f"Retrieved {len(documents)} documents from collection {collection_name}")

    return documents


def check_documents(documents: List[Dict[str, Any]]) -> List[str]:
    """Analyze the processing status of documents and return a list of filenames.

    Args:
        documents (List[Dict[str, Any]]): A list of documents in JSON format.

    Returns:
        List[str]: A list of filenames from documents with a successful processing status.

    Raises:
        None, but logs warnings if a document's processing status is not 'success'.

    Example:
        >>> docs = [{"properties": {"processingStatus": ["success"], "fileName": ["file1.txt"]}},
        >>>         {"properties": {"processingStatus": ["failed"], "fileName": ["file2.txt"]}}]
        >>> filenames = check_documents(docs)
        >>> print(filenames)
        ['file1.txt']
    """
    filenames = []

    for doc in documents:
        status = doc.get("properties", {}).get("processingStatus")[0]
        filename = doc.get("properties", {}).get("fileName")[0]

        if status != "success":
            log.warning(
                f"Document with filename {filename} has processing status: {status}"
            )

        if filename:
            filenames.append(filename)

    log.debug(f"Collected filenames: {filenames}")
    return filenames


if __name__ == "__main__":
    app()
Get your API key

By default, any Aleph search will return only public documents in responses to API requests.

If you want to access documents which are not marked public, you will need to sign into the tool. This can be done through the use on an API key. The API key for any account can be found by clicking on the "Settings" menu item in the navigation menu.

Usage

Running the Script

You can run the script directly from the command line. Below are examples of usage:

Basic Usage

Retrieve and list documents from a collection:

python list_aleph_collection_documents.py --api-key "your-api-key" 'Name of your collection'
Using an Environment Variable for the API Key

This is better from a security perspective.

export API_KEY=your_api_key
python list_aleph_collection_documents.py 'Name of your collection'

Enabling Verbose Logging

To enable detailed debug logs, use the --verbose or -v flag:

python list_aleph_collection_documents.py -v 'Name of your collection'
Getting help
python list_aleph_collection_documents.py --help

Install the development environment

As a first step, check out the source code of Aleph from GitHub:

git clone https://github.com/alephdata/aleph.git
cd aleph/

Also, please execute the following command to allow ElasticSearch to map its memory:

sysctl -w vm.max_map_count=262144

Then enable the use of pdb by adding the next lines into the docker-compose.dev.yml file, under the api service configuration.

stdin_open: true
tty: true

With the settings in place, you can use make all to set everything up and launch the web service. This is equivalent to the following steps:

  • make build to build the docker images for the application and relevant services.
  • make upgrade to run the latest database migrations and create/update the search index.
  • make web to run the web-based API server and the user interface.
  • In a separate shell, run make worker to start a worker. If you do not start a worker, background jobs (for example ingesting new documents) won’t be processed.

Open http://localhost:8080/ in your browser to visit the web frontend.

  • Create a shell to do the operations with make shell.
  • Create the main user within that shell running
    aleph createuser --name="demo" \
        --admin \
        --password=demo \
        demo@demo.com
    
  • Load some sample data by running aleph crawldir /aleph/contrib/testdata

Debugging the code

To debug the code, you can create pdb breakpoints in the code you cloned, and run the actions that trigger the breakpoint. To be able to act on it, you need to be attached to the api by running:

docker attach aleph_api_1

You don't need to reload the page for it to load the changes, it does it dynamically.

Operation

Upgrade Aleph

Aleph does not perform updates and database migrations automatically. Once you have the latest version, you can run the command bellow to upgrade the existing installation (i.e. apply changes to the database model or the search index format).

The first step is to add a notice in Aleph's banner section informing the users that there's going to be a downtime.

Before you upgrade, check the to make sure you understand the latest release and know about new options and features that have been added.

In production mode, make sure you perform a backup of the main database and the ElasticSearch index before running an upgrade.

Then, make sure you are using the latest docker-compose.yml file. You can do this by checking out the source repo, but really you just need that one file (and your config in aleph.env). There are many docker-compose.yml files, we need to decide the one we want to take as the source of truth.

# Pull changes
cd /data/config/aleph
docker-compose pull --parallel

# Stop services
service aleph stop

# Do database migrations
docker-compose up -d redis postgres elasticsearch
# Wait a minute or so while services boot up...
# Run upgrade:
docker-compose run --rm shell aleph upgrade

# Start the services
service aleph start

Create Aleph admins

Creation of admins depends on how you create users, in our case that we're using Oauth we need to update the database directly (ugly!). So go into the instance you want to do the change and run:

# Create a terminal inside the aleph environment
docker-compose run --rm shell bash

# Connect to the postgres database
# It will ask you for the password, search it in the docker-compose.yaml file
psql postgresql://aleph@postgres/aleph

# Once there you can see which users do have the Admin rights with:
select * from role where is_admin = 't';

# If you want to make another user admin run the next command:
UPDATE role SET is_admin = true WHERE email = 'admin@site.org';

You may also need to run aleph update afterwards to refresh some cached information.

Remove a group

There is currently no web interface that allows this operation, you need to interact with the database directly.

# Create a terminal inside the aleph environment
docker-compose run --rm shell bash

# Connect to the postgres database
# It will ask you for the password, search it in the docker-compose.yaml file
psql postgresql://aleph@postgres/aleph

# List the available groups
select * from role where type = 'group';

# Delete a group.
# Imagine that the id of the group we want to delete is 18
delete from role where id = 18;
Role permission error

You may encounter the next error:

ERROR:  update or delete on table "role" violates foreign key constraint "permission_role_id_fkey" on table "permission"
DETAIL:  Key (id)=(18) is still referenced from table "permission".

That means that the group is still used, to find who is using it use:

select * from permission where role_id = 18;

You can check the elements that have the permission by looking at the collection_id number, imagine it's 3, then you can check your.url.com/investigations/3.

Once you're sure you can remove that permission, run:

delete from permission where role_id = 18;
delete from role where id = 18;
Role membership error

You may encounter the next error:

ERROR:  update or delete on table "role" violates foreign key constraint "role_membership_group_id_fkey" on table "role_membership"
DETAIL:  Key (id)=(8) is still referenced from table "role_membership".
That means that the group is still used, to find who is using it use:

select * from role_membership where group_id = 8;

If you agree to remove that user from the group use:

delete from role_membership where group_id = 8;
delete from role where id = 8;

Troubleshooting

Ingest gets stuck

It looks that Aleph doesn't yet give an easy way to debug it. It can be seen in the next webs:

Some interesting ideas I've extracted while diving into these issues is that:

  • You can also upload files using the alephclient python command line tool
  • Some of the files might fail to be processed without leaving any hint to the uploader or the viewer.
  • This results in an incomplete dataset and the users don't get to know that the dataset is incomplete. This is problematic if the completeness of the dataset is crucial for an investigation.
  • There is no way to upload only the files that failed to be processed without re-uploading the entire set of documents or manually making a list of the failed documents and re-uploading them
  • There is no way for uploaders or Aleph admins to see an overview of processing errors to figure out why some files are failing to be processed without going through docker logs (which is not very user-friendly)
  • There was an attempt to improve the way ingest-files manages the pending tasks, it's merged into the release/4.0.0 branch, but it has not yet arrived main.

There are some tickets that attempt to address these issues on the command line:

I think it's interesting either to contribute to alephclient to solve those issues or if it's complicated create a small python script to detect which files were not uploaded and try to reindex them and/or open issues that will prevent future ingests to fail.

Problems accessing redis locally

If you're with the VPN connected, turn it off.

PDB behaves weird

Sometimes you have two traces at the same time, so each time you run a PDB command it jumps from pdb trace. Quite confusing. Try to c the one you don't want so that you're left with the one you want. Or put the pdb trace in a conditional that only matches one of both threads.

Monitorization

Prometheus metrics

Aleph now exposes prometheus metrics on the port 9100.

Troubleshooting

Debug ingestion errors

Assuming that you've set up Loki to ingest your logs I've so far encountered the next ingest issues:

  • Cannot open image data using Pillow: broken data stream when reading image files: The log trace that has this message also contains a field trace_id which identifies the ingestion process. With that trace_id you can get the first log trace with the field logger = "ingestors.manager" which will contain the file path in the message field. Something similar to Ingestor [<E('9972oiwobhwefoiwefjsldkfwefa45cf5cb585dc4f1471','path_to_the_file_to_ingest.pdf')>]
  • A traceback with the next string Failed to process: Could not extract PDF file: FileDataError('cannot open broken document'): This log trace has the file path in the message field. Something similar to [<E('9972oiwobhwefoiwefjsldkfwefa45cf5cb585dc4f1471','path_to_the_file_to_ingest.pdf')>] Failed to process: Could not extract PDF file: FileDataError('cannot open broken document')

I thought of making a python script to automate the files that triggered an error, but in the end I extracted the file names manually as they weren't many.

Once you have the files that triggered the errors, the best way to handle them is to delete them from your investigation and ingest them again.

References