Skip to content

transcripts_count #6

@henryzhang87

Description

@henryzhang87

The discrepancy between the transcripts_count and the actual transcript files.

The collection id used for this test is 1797, and the resource_id is 62203.

Running the command below:

 python get_collection_resources.py 1797

The output is shown below with transcripts_count=2

    {
        "resource_id": 62203,
        "title": "title1$ test - DO NOT DELETE",
        "media_file_id": [
            143091,
            143092
        ],
        "media_files_count": 2,
        "transcripts_count": 2,
        "indexes_count": 5,
        "persistent_url": "https://ualberta.aviaryplatform.com/r/h41jh3dw0c",
        "direct_url": "https://ualberta.aviaryplatform.com/collections/1797/collection_resources/62203",
        "created_at": "2022-01-12 03:03:06 UTC",
        "updated_at": "2025-04-02 21:51:30 UTC"
    },

However, using the API to query these transcripts, we only get 1

python get_transcript_files.py 62203
{
    "data": {
        "id": 62203,
        "resource_file_id": 55344,
        "is_caption": false,
        "is_public": false,
        "title": "trint_mssa_hvt_1851_p1of2_transcript.vtt",
        "language": "en",
        "description": null,
        "is_downloadable": "No",
        "export": {
            "webvtt": {
                "file": "https://ualberta.aviaryplatform.com/api/v1/transcripts/62203/export/webvtt",
                "file_name": "trint_mssa_hvt_1851_p1of2_transcript.vtt",
                "file_content_type": "text/vtt"
            },
            "txt": {
                "file": "https://ualberta.aviaryplatform.com/api/v1/transcripts/62203/export/txt",
                "file_name": "trint_mssa_hvt_1851_p1of2_transcript.txt",
                "file_content_type": "text/plain"
            },
            "json": {
                "file": "https://ualberta.aviaryplatform.com/api/v1/transcripts/62203/export/json",
                "file_name": "trint_mssa_hvt_1851_p1of2_transcript.json",
                "file_content_type": "text/json"
            }
        }
    },
    "success": true
}

Need clarifications for
v1) the number discrepancy for the transcripts
2) when it comes to preservation, the transcript has different formats: json, txt, and webvtt, do we preserve all different formats or just choose one?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions