Python - Combining Excel Files
This Google Colab script automates merging Excel files from ZIP archives. It installs essential libraries, uploads and extracts ZIP files, compiles a unified column schema, standardizes data, and exports the combined DataFrame into a new Excel file.
Installing Dependencies
!pip install pandas openpyxl zipfile36
These libraries enable data manipulation, Excel file processing, and ZIP file handling.
Uploading Files via Google Colab
from google.colab import files uploaded = files.upload()
This allows users to upload ZIP archives, storing files in a dictionary with filenames as keys.
Extracting Uploaded ZIP Files
import zipfile, io, os for filename in uploaded.keys(): with zipfile.ZipFile(io.BytesIO(uploaded[filename]), 'r') as zip_ref: zip_ref.extractall('/content/extracted_files')
Extracts ZIP file contents into /content/extracted_files
for further processing.
Preparing for Excel File Processing
import pandas as pd, warnings warnings.filterwarnings("ignore", category=UserWarning, module='openpyxl')
Suppresses non-critical warnings and prepares data processing libraries.
Collecting Unique Columns
combined_df = pd.DataFrame() directory = '/content/extracted_files' all_columns = set() for filename in os.listdir(directory): if filename.endswith(".xlsx"): file_path = os.path.join(directory, filename) try: df = pd.read_excel(file_path, engine='openpyxl') all_columns.update(df.columns) except Exception as e: print(f"Error processing file {filename}: {e}")
Gathers all unique column headers to create a standardized format for merging.
Complete Script
!pip install pandas openpyxl zipfile36 from google.colab import files uploaded = files.upload() import zipfile, io, os for filename in uploaded.keys(): with zipfile.ZipFile(io.BytesIO(uploaded[filename]), 'r') as zip_ref: zip_ref.extractall('/content/extracted_files') import pandas as pd, warnings warnings.filterwarnings("ignore", category=UserWarning, module='openpyxl') combined_df = pd.DataFrame() directory = '/content/extracted_files' all_columns = set() for filename in os.listdir(directory): if filename.endswith(".xlsx"): file_path = os.path.join(directory, filename) try: df = pd.read_excel(file_path, engine='openpyxl') all_columns.update(df.columns) except Exception as e: print(f"Error processing file {filename}: {e}") for filename in os.listdir(directory): if filename.endswith(".xlsx"): file_path = os.path.join(directory, filename) try: df = pd.read_excel(file_path, engine='openpyxl') df = df.reindex(columns=all_columns) df['filename'] = filename combined_df = pd.concat([combined_df, df], ignore_index=True) except Exception as e: print(f"Error processing file {filename}: {e}") combined_df.to_excel('combined_file.xlsx', index=False) from google.colab import files files.download('combined_file.xlsx')
This is the complete version of the script, combining all steps into a single execution.