Package: Tokenization (src.tokenization)

Contained Subpackages

The Tokenization package further contains subpackages which implement additional, specialized functionalities to support and enhance the main operations of the package.

Modules: Tokenization (src.tokenization)

Module: tokenization.main.py

This module provides functions for tokenizing MIDI files in a selected directory. These files are converted into token sequences using one or more tokenizers and saved as CSV files in a subdirectory of the original directory.

Functions

  • tokenization_tokenize_folder_midi_files: Tokenizes all MIDI files in a user-selected directory using selected tokenizers.

  • select_miditok_tokenizer: Presents a user interface for tokenizer selection.

  • select_folder_midi_files: Presents a user interface for folder selection.

  • tokenize_midi_file: Tokenizes a MIDI file with a specified tokenizer class and saves the tokenized sequence to a CSV file.

  • extract_tokens_from_token_string_within_list: Extracts tokens from a string representation of a list using regex.

  • extract_tokens_from_token_string_within_nested_list: Extracts tokens from a string representation of a nested list using regex and abstract syntax trees (AST).

Notes

Functions in this module can be used directly or through the CLI interface provided by the cli.cli_menu_structure module. The tokenized sequences are saved as CSV files in a subdirectory of the original directory. Please refer to the individual function docstrings for more detailed descriptions and examples of usage.

iMaT.src.tokenization.main.extract_tokens_from_token_string_within_list(tokens_string)[source]

Extracts tokens from a string representation of a list using regular expressions.

This function is designed to process a string that represents a list of tokens. It utilizes regular expressions to accurately identify and extract all tokens within the string. The extracted tokens are returned as a list of strings.

Parameters:

tokens_string (str) – String representation of a list of tokens. Expected to follow the pattern ‘= [<tokens>]’.

Returns:

List of extracted tokens as strings.

Return type:

list

Raises:

Exception – If the provided tokens string is not formatted correctly or an error occurs during extraction.

See also

tokenize_midi_file

Function to tokenize MIDI files.

iMaT.src.tokenization.main.extract_tokens_from_token_string_within_nested_list(tokens_string)[source]

Extracts tokens from a string representation of a nested list using regular expressions and abstract syntax trees (AST).

This function is designed to handle complex token string that represents a nested list of tokens. It uses a combination of regular expressions and abstract syntax trees to extract the tokens accurately from the string. The result is returned as a list of lists, where each sublist contains the tokens extracted from one of the nested lists in the string.

Parameters:

tokens_string (str) – String representation of a nested list of tokens. Expected to follow the pattern ‘= [<nested lists>], ids=’.

Returns:

List of lists, with each sublist containing extracted tokens as strings.

Return type:

list

Raises:

Exception – If the provided tokens string is not formatted correctly or an error occurs during extraction.

See also

tokenize_midi_file

Function to tokenize MIDI files.

iMaT.src.tokenization.main.select_folder_midi_files()[source]

Interacts with the user via a user interface for folder selection.

This function prompts the user to input the path to the desired folder. It validates the input and ensures the folder contains parsable music files. If not, it prompts the user to input another path.

Parameters:

None

Returns:

A string containing the path to the selected folder, or None if the user cancelled the dialog or didn’t select a suitable folder.

Return type:

str

Raises:

Exception – If an error occurs while processing the user’s selection.

See also

get_tokenizable_files_in_folder

Function to get tokenizable files in the selected folder.

display_menu_print_textblock

Function to display menu.

iMaT.src.tokenization.main.select_miditok_tokenizer()[source]

Interacts with the user via a user interface for tokenizer selection.

This function prompts the user to select one or more tokenizers from a predefined list. Depending on the user’s choice, it returns the selected tokenizers.

Parameters:

None

Returns:

A list containing the selected tokenizer(s) classes.

Return type:

list

Raises:

Exception – If an error occurs while processing the user’s selection.

See also

display_menu_request_selection

Function to display menu and request user selection.

iMaT.src.tokenization.main.tokenization_tokenize_folder_midi_files()[source]

Tokenizes all MIDI files in a user-selected directory using the selected tokenizers.

This function prompts the user to select a directory and one or more tokenizers. After selection, the function tokenizes all MIDI files in the directory using the chosen tokenizers. The tokenized sequences are saved as CSV files in a subdirectory of the original directory named “tokenized_YYYYMMDD_HHMMSS”. Additionally, any tokenization results and errors are logged in an Excel file named “tokenization_log.xlsx” in the same subdirectory.

Parameters:

None

Return type:

None

See also

select_folder_midi_files

Function to interact with user for selecting a folder.

display_tokenizable_files_in_folder

Function to display tokenizable files.

select_miditok_tokenizer

Function to interact with user for selecting tokenizer(s).

iMaT.src.tokenization.main.tokenize_midi_file(midi_file_path, tokenizer_class, tokenized_file_folder)[source]

Tokenizes a MIDI file using a specified tokenizer class and saves the tokenized sequence to a CSV file. It handles both single-layer and nested token lists.

Parameters:
  • midi_file_path (str) – Path to the MIDI file to be tokenized.

  • tokenizer_class (class) – Class of the tokenizer to be used for tokenizing the MIDI file.

  • tokenized_file_folder (str) – Path to the folder where the CSV file containing the tokenized sequence will be saved.

Return type:

None

Raises:

Exception – If an error occurs during tokenization or saving the tokenized sequence.

See also

extract_tokens_from_token_string_within_list

Function to extract tokens from a list.

extract_tokens_from_token_string_within_nested_list

Function to extract tokens from a nested list.

Module: tokenization.utils.py

This module provides various utility functions that assist in the tokenization of MIDI files and handling of data.

Functions

  • save_data_to_new_csv_file: Saves a DataFrame to a new CSV file in a timestamped directory.

  • select_csv_file_2d_token_representation: Opens a dialog for the user to select a CSV file.

  • display_tokenizable_files_in_folder: Prints all the tokenizable files in a specific folder.

  • get_tokenizable_files_in_folder: Retrieves all tokenizable MIDI files in a specific folder.

  • display_success_rate: Shows the success rate of the tokenization process and any errors.

  • combine_csv_files_in_directory: Merges all CSV files in a specific directory into one CSV file.

  • create_log_entry: Adds a new entry to an existing Excel file, or creates a new one if it doesn’t exist.

Notes

These functions are used throughout the package to facilitate the tokenization process for MIDI files and ensure correct data handling, saving, and logging. Please refer to the individual function docstrings for more detailed descriptions and examples of usage.

iMaT.src.tokenization.utils.combine_csv_files_in_directory(directory_path, output_file_name)[source]

Combines all CSV files in a given directory into a single CSV file.

This function reads all CSV files in the provided directory, extracts the original filenames from these files, and adds these as a new column in the DataFrames. The DataFrames are then concatenated and saved as a new CSV file in the same directory.

Parameters:
  • directory_path (str) – The path to the directory containing the CSV files.

  • output_file_name (str) – The name of the combined output CSV file.

Return type:

None

Raises:

Exception – If an error occurs during the process.

iMaT.src.tokenization.utils.create_log_entry(log_list, log_list_path)[source]

Appends a new entry to an existing .xlsx file or creates the file if it doesn’t exist.

The function uses the openpyxl library to open or create a .xlsx file at the provided path. It then appends the provided log_list as a new row to the first sheet of the workbook.

Parameters:
  • log_list (list) – A list of strings to be appended as a new row.

  • log_list_path (str) – The path to the .xlsx file.

Return type:

None

Raises:

Exception – If an error occurs during the process.

iMaT.src.tokenization.utils.display_success_rate(files_status, num_files_tokenized, num_tokenized)[source]

Displays the tokenization success rate and details of failed tokenizations.

This function calculates and displays the success rate of file tokenization. It also identifies and displays any failed tokenizations, including details of the associated tokenizer and file.

Parameters:
  • files_status (list) – A list of status messages from the file tokenization process.

  • num_files_tokenized (int) – The total number of files that were attempted to be tokenized.

  • num_tokenized (int) – The total number of files that were successfully tokenized.

Return type:

None

Raises:

Exception – If an error occurs during the process.

iMaT.src.tokenization.utils.display_tokenizable_files_in_folder(folder_path)[source]

Displays the tokenizable files in the provided directory.

The function uses the ‘get_tokenizable_files_in_folder’ function to fetch all tokenizable files from the directory. Then, it displays the files to the user. If more than 30 files exist in the directory, only the first 30 are displayed, along with a note indicating more files are not shown.

Parameters:

folder_path (str) – The path of the directory to search for tokenizable files.

Returns:

A list of tokenizable file names.

Return type:

list

Raises:

Exception – If an error occurs during the process.

iMaT.src.tokenization.utils.get_tokenizable_files_in_folder(folder_path)[source]

Retrieves all tokenizable MIDI files in the given folder.

This function inspects the given directory and identifies all files with the extensions ‘.midi’ or ‘.mid’, which are considered tokenizable.

Parameters:

folder_path (str) – The path of the folder to be searched.

Returns:

A list of filenames that can be tokenized.

Return type:

list

Raises:

Exception – If an error occurs during the process.

iMaT.src.tokenization.utils.save_data_to_new_csv_file(df, file_name, identifier)[source]

Saves the refined DataFrame to a new CSV file in a directory named ‘enhanced_csv_[current date and time]’.

This function extracts the directory path of the provided file, then creates a new directory in that path with the provided identifier and current timestamp. The DataFrame is then saved to this new directory with the identifier prepended to the original filename.

Parameters:
  • df (pandas.DataFrame) – The DataFrame to be saved.

  • file_name (str) – The original file name used to generate the new file name.

  • identifier (str) – A string to be used as an identifier for the new directory and file name.

Returns:

The path to the new CSV file.

Return type:

str

Raises:

Exception – If an error occurs during directory creation or file saving.

iMaT.src.tokenization.utils.select_csv_file_2d_token_representation()[source]

Opens a file dialog allowing the user to select a CSV file.

This function guides the user to select a suitable CSV file for further processing. It makes use of a graphical file dialog and informs the user about the requirements for the file selection via terminal outputs.

Returns:

  • str – The path to the selected file if a file was selected.

  • None – None if the user canceled the dialog.

Raises:

Exception – If an error occurs during file selection.