Skip to content

34th Week of 2024

Life Management

Life Chores Management

Grocy

  • New: Add API and python library docs.

    There is no active python library, although it existed pygrocy

Knowledge Management

  • New: Use ebops to create anki cards.

    • Ask the AI to generate Anki cards based on the content.
    • Save those anki cards in an orgmode (anki.org) document
    • Use ebops add-anki-notes to automatically add them to Anki

Coding

Languages

questionary

  • New: Unit testing questionary code.

    Testing questionary code can be challenging because it involves interactive prompts that expect user input. However, there are ways to automate the testing process. You can use libraries like pexpect, pytest, and pytest-mock to simulate user input and test the behavior of your code.

    Here’s how you can approach testing questionary code using pytest-mock to mock questionary functions

    You can mock questionary functions like questionary.select().ask() to simulate user choices without actual user interaction.

    Testing a single questionary.text prompt

    Let's assume you have a function that asks the user for their name:

    import questionary
    
    def ask_name() -> str:
        name = questionary.text("What's your name?").ask()
        return name
    

    You can test this function by mocking the questionary.text prompt to simulate the user's input.

    import pytest
    from your_module import ask_name
    
    def test_ask_name(mocker):
        # Mock the text function to simulate user input
        mock_text = mocker.patch('questionary.text')
    
        # Define the response for the prompt
        mock_text.return_value.ask.return_value = "Alice"
    
        result = ask_name()
    
        assert result == "Alice"
    

    Test a function that has many questions

    Here’s an example of how to test a function that contains two questionary.text prompts using pytest-mock.

    Let's assume you have a function that asks for the first and last names of a user:

    import questionary
    
    def ask_full_name() -> dict:
        first_name = questionary.text("What's your first name?").ask()
        last_name = questionary.text("What's your last name?").ask()
        return {"first_name": first_name, "last_name": last_name}
    

    You can mock both questionary.text calls to simulate user input for both the first and last names:

    import pytest
    from your_module import ask_full_name
    
    def test_ask_full_name(mocker):
        # Mock the text function for the first name prompt
        mock_text_first = mocker.patch('questionary.text')
        # Define the response for the first name prompt
        mock_text_first.side_effect = ["Alice", "Smith"]
    
        result = ask_full_name()
    
        assert result == {"first_name": "Alice", "last_name": "Smith"}
    

Coding tools

Coding with AI

  • Correction: Update the ai prompts.

    ```yaml matches: - trigger: :function form: | Create a function with: - type hints - docstrings for all classes, functions and methods - docstring using google style with line length less than 89 characters - adding logging traces using the log variable log = logging.getLogger(name) - Use fstrings instead of %s - If you need to open or write a file always set the encoding to utf8 - If possible add an example in the docstring - Just give the code, don't explain anything

      Called [[name]] that:
      [[text]]
    form_fields:
      text:
        multiline: true
    
    • trigger: :class form: | Create a class with:

      • type hints
      • docstring using google style with line length less than 89 characters
      • use docstrings on the class and each methods
      • adding logging traces using the log variable log = logging.getLogger(name)
      • Use fstrings instead of %s
      • If you need to open or write a file always set the encoding to utf8
      • If possible add an example in the docstring
      • Just give the code, don't explain anything

      Called [[name]] that: [[text]] form_fields: text: - trigger: :class form: | ... - Use paragraphs to separate the AAA blocks and don't add comments like # Arrange or # Act or # Act/Assert or # Assert. So the test will only have black lines between sections - In the Act section if the function to test returns a value always name that variable result. If the function to test doesn't return any value append an # act comment at the end of the line. - If the test uses a pytest.raises there is no need to add the # act comment - Don't use mocks - Use fstrings instead of %s - Gather all tests over the same function on a common class - If you need to open or write a file always set the encoding to utf8 - Just give the code, don't explain anything

      form_fields: text: - trigger: :polish form: | ... - Add or update the docstring using google style on all classes, functions and methods - Wrap the docstring lines so they are smaller than 89 characters - All docstrings must start in the same line as the """ - Add logging traces using the log variable log = logging.getLogger(name) - Use f-strings instead of %s - Just give the code, don't explain anything form_fields: code: multiline: true - trigger: :text form: | Polish the next text by:

      • Summarising each section without losing relevant data
      • Tweak the markdown format
      • Improve the wording

      [[text]] form_fields: text: multiline: true

    • trigger: :readme form: | Create the README.md taking into account:

      • Use GPLv3 for the license
      • Add Lyz as the author
      • Add an installation section
      • Add an usage section

      of: [[text]]

      form_fields: text: multiline: true ``` feat(aleph#Get all documents of a collection): Get all documents of a collection

    list_aleph_collection_documents.py is a Python script designed to interact with an API to retrieve and analyze documents from specified collections. It offers a command-line interface (CLI) to list and check documents within a specified collection.

    Features

    • Retrieve documents from a specified collection.
    • Analyze document processing statuses and warn if any are not marked as successful.
    • Return a list of filenames from the retrieved documents.
    • Supports verbose output for detailed logging.
    • Environment variable support for API key management.

    Installation

    To install the required dependencies, use pip:

    pip install typer requests
    

    Ensure you have Python 3.6 or higher installed.

    Create the file list_aleph_collection_documents.py with the next contents:

    import logging
    import requests
    from typing import List, Dict, Any, Optional
    import logging
    import typer
    from typing import List, Dict, Any
    
    log = logging.getLogger(__name__)
    app = typer.Typer()
    
    @app.command()
    def get_documents(
        collection_name: str = typer.Argument(...),
        api_key: Optional[str] = typer.Option(None, envvar="API_KEY"),
        base_url: str = typer.Option("https://your.aleph.org"),
        verbose: bool = typer.Option(
            False, "--verbose", "-v", help="Enable verbose output"
        ),
    ):
        """CLI command to retrieve documents from a specified collection."""
        if verbose:
            logging.basicConfig(level=logging.DEBUG)
            log.debug("Verbose mode enabled.")
        else:
            logging.basicConfig(level=logging.INFO)
        if api_key is None:
            log.error(
                "Please specify your api key either through the --api-key argument "
                "or through the API_KEY environment variable"
            )
            raise typer.Exit(code=1)
        try:
            documents = list_collection_documents(api_key, base_url, collection_name)
            filenames = check_documents(documents)
            if filenames:
                print("\n".join(filenames))
            else:
                log.warning("No documents found.")
        except Exception as e:
            log.error(f"Failed to retrieve documents: {e}")
            raise typer.Exit(code=1)
    
    def list_collection_documents(
        api_key: str, base_url: str, collection_name: str
    ) -> List[Dict[str, Any]]:
        """
        Retrieve documents from a specified collection using pagination.
    
        Args:
            api_key (str): The API key for authentication.
            base_url (str): The base URL of the API.
            collection_name (str): The name of the collection to retrieve documents from.
    
        Returns:
            List[Dict[str, Any]]: A list of documents from the specified collection.
    
        Example:
            >>> docs = list_collection_documents("your_api_key", "https://api.example.com", "my_collection")
            >>> print(len(docs))
            1000
        """
        headers = {
            "Authorization": f"ApiKey {api_key}",
            "Accept": "application/json",
            "Content-Type": "application/json",
        }
    
        collections_url = f"{base_url}/api/2/collections"
        documents_url = f"{base_url}/api/2/entities"
        log.debug(f"Requesting collections list from {collections_url}")
        collections = []
        params = {"limit": 300}
    
        while True:
            response = requests.get(collections_url, headers=headers, params=params)
            response.raise_for_status()
            data = response.json()
            collections.extend(data["results"])
            log.debug(
                f"Fetched {len(data['results'])} collections, "
                f"page {data['page']} of {data['pages']}"
            )
            if not data["next"]:
                break
            params["offset"] = params.get("offset", 0) + data["limit"]
    
        collection_id = next(
            (c["id"] for c in collections if c["label"] == collection_name), None
        )
        if not collection_id:
            log.error(f"Collection {collection_name} not found.")
            return []
    
        log.info(f"Found collection '{collection_name}' with ID {collection_id}")
    
        documents = []
        params = {
            "q": "",
            "filter:collection_id": collection_id,
            "filter:schemata": "Document",
            "limit": 300,
        }
    
        while True:
            log.debug(f"Requesting documents from collection {collection_id}")
            response = requests.get(documents_url, headers=headers, params=params)
            response.raise_for_status()
            data = response.json()
            documents.extend(data["results"])
            log.info(
                f"Fetched {len(data['results'])} documents, "
                f"page {data['page']} of {data['pages']}"
            )
            if not data["next"]:
                break
            params["offset"] = params.get("offset", 0) + data["limit"]
    
        log.info(f"Retrieved {len(documents)} documents from collection {collection_name}")
    
        return documents
    
    def check_documents(documents: List[Dict[str, Any]]) -> List[str]:
        """Analyze the processing status of documents and return a list of filenames.
    
        Args:
            documents (List[Dict[str, Any]]): A list of documents in JSON format.
    
        Returns:
            List[str]: A list of filenames from documents with a successful processing status.
    
        Raises:
            None, but logs warnings if a document's processing status is not 'success'.
    
        Example:
            >>> docs = [{"properties": {"processingStatus": ["success"], "fileName": ["file1.txt"]}},
            >>>         {"properties": {"processingStatus": ["failed"], "fileName": ["file2.txt"]}}]
            >>> filenames = check_documents(docs)
            >>> print(filenames)
            ['file1.txt']
        """
        filenames = []
    
        for doc in documents:
            status = doc.get("properties", {}).get("processingStatus")[0]
            filename = doc.get("properties", {}).get("fileName")[0]
    
            if status != "success":
                log.warning(
                    f"Document with filename {filename} has processing status: {status}"
                )
    
            if filename:
                filenames.append(filename)
    
        log.debug(f"Collected filenames: {filenames}")
        return filenames
    
    if __name__ == "__main__":
        app()
    

    Get your API key

    By default, any Aleph search will return only public documents in responses to API requests.

    If you want to access documents which are not marked public, you will need to sign into the tool. This can be done through the use on an API key. The API key for any account can be found by clicking on the "Settings" menu item in the navigation menu.

    Usage

    You can run the script directly from the command line. Below are examples of usage:

    Retrieve and list documents from a collection:

    python list_aleph_collection_documents.py --api-key "your-api-key" 'Name of your collection'
    

    Using an Environment Variable for the API Key

    This is better from a security perspective.

    export API_KEY=your_api_key
    python list_aleph_collection_documents.py 'Name of your collection'
    

    Enabling Verbose Logging

    To enable detailed debug logs, use the --verbose or -v flag:

    python list_aleph_collection_documents.py -v 'Name of your collection'
    

    Getting help

    python list_aleph_collection_documents.py --help
    

OCR

Camelot

  • New: Introduce Camelot.

    Camelot is a Python library that can help you extract tables from PDFs

    import camelot
    
    tables = camelot.read_pdf('foo.pdf')
    
    tables
    <TableList n=1>
    
    tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
    
    tables[0]
    <Table shape=(7, 7)>
    
    tables[0].parsing_report
    {
        'accuracy': 99.02,
        'whitespace': 12.24,
        'order': 1,
        'page': 1
    }
    
    tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
    
    tables[0].df # get a pandas DataFrame!
    

    Installation

    To install Camelot from PyPI using pip, please include the extra cv requirement as shown:

    $ pip install "camelot-py[base]"

    It requires Ghostscript to be able to use the lattice mode. Which is better than using tabular-py that requires java to be installed.

    Usage

    Process background lines

    To detect line segments, Lattice needs the lines that make the table to be in the foreground. To process background lines, you can pass process_background=True.

    tables = camelot.read_pdf('background_lines.pdf', process_background=True)

    tables[1].df

    References - Docs

DevOps

Continuous Integration

Bandit

  • New: Solving warning B603: subprocess_without_shell_equals_true.

    The B603: subprocess_without_shell_equals_true issue in Bandit is raised when the subprocess module is used without setting shell=True. Bandit flags this because using shell=True can be a security risk if the command includes user-supplied input, as it opens the door to shell injection attacks.

    To fix it:

    1. Avoid shell=True if possible: Instead, pass the command and its arguments as a list to subprocess.Popen (or subprocess.run, subprocess.call, etc.). This way, the command is executed directly without invoking the shell, reducing the risk of injection attacks.

    Here's an example:

    import subprocess
    
    # Instead of this:
    # subprocess.Popen("ls -l", shell=True)
    
    # Do this:
    subprocess.Popen(["ls", "-l"])
    
    1. When you must use shell=True: - If you absolutely need to use shell=True (e.g., because you are running a complex shell command or using shell features like wildcards), ensure that the command is either hardcoded or sanitized to avoid security risks.

    Example with shell=True:

    import subprocess
    
    # Command is hardcoded and safe
    command = "ls -l | grep py"
    subprocess.Popen(command, shell=True)
    

    If the command includes user input, sanitize the input carefully:

    import subprocess
    
    user_input = "some_directory"
    command = f"ls -l {subprocess.list2cmdline([user_input])}"
    subprocess.Popen(command, shell=True)
    

    Note: Even with precautions, using shell=True is risky with user input, so avoid it if possible.

    1. Explicitly tell bandit you have considered the risk: If you have reviewed the code and are confident that the code is safe in your particular case, you can mark the line with a # nosec comment to tell Bandit to ignore the issue:
    import subprocess
    
    command = "ls -l | grep py"
    subprocess.Popen(command, shell=True)  # nosec
    

Operating Systems

Linux

Linux Snippets

  • Correction: Docker prune without removing the manual networks.

    If you run the command docker system prune in conjunction with watchtower and manually defined networks you may run into the issue that the docker system prune acts just when the dockers are stopped and thus removing the networks, which will prevent the dockers to start. In those cases you can either make sure that docker system prune is never run when watchtower is doing the updates or you can split the command into the next script:

    date
    echo "Pruning the containers"
    docker container prune -f --filter "label!=prune=false"
    echo "Pruning the images"
    docker image prune -f --filter "label!=prune=false"
    echo "Pruning the volumes"
    docker volume prune -f