TL;DR

Researchers have developed an end-to-end pipeline to extract and analyze institutional affiliations from 5,356 ICLR 2026 accepted papers. The resulting dataset includes normalized affiliations and visualizations, offering insights into the landscape of AI research. The project aims to improve accuracy over author profiles and facilitate institutional impact analysis.

A new pipeline has transformed the analysis of ICLR 2026 accepted papers by generating a comprehensive, PDF-derived institutional affiliation dataset, which is now available for research and visualization purposes. This development offers a more accurate picture of the institutions shaping AI research, moving beyond unreliable author profiles.

The pipeline processes 5,356 accepted papers from ICLR 2026, extracting author affiliations directly from PDF title blocks to avoid issues like profile drift, where current jobs are incorrectly attributed to past papers. It normalizes institution names via approximately 250 rules, ensuring consistency across the dataset. The resulting data includes institution counts, country and region classifications, and detailed author affiliations, stored in multiple CSV formats for different analysis approaches.

Key outputs include a publication-level dataset, a ranked list of institutions by unique affiliation counts, and visualizations such as treemaps illustrating the research landscape. The project also compares different counting methods—per paper, first-author only, and fractional—to assess robustness and identify potential artifacts. The pipeline’s accuracy is approximately 96%, with fallback to author profile data for the remaining 4% where PDF parsing fails.

Why It Matters

This development matters because it provides a more precise and reliable view of institutional contributions to AI research at ICLR 2026. It enables stakeholders—researchers, institutions, and policymakers—to better understand research trends, collaboration patterns, and the influence of industry versus academia. The dataset enhances transparency and supports more accurate bibliometric analyses, which can influence funding, partnerships, and strategic decisions.

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files

Data Recovery Stick | USB Data Recovery Device | Windows Data Recovery Software | Recover SD Card, Photos, Files

The Data Recovery Stick requires no technical skills — simply plug it into your Windows computer, click Start,…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Previous analyses relied heavily on author profiles from platforms like OpenReview, which are prone to inaccuracies due to profile drift. The new pipeline addresses this by extracting affiliations directly from PDF documents, which are the official source. This approach aligns with ongoing efforts in the research community to improve data quality and transparency in bibliometric studies. The pipeline builds on prior work in PDF parsing and normalization, applying these techniques specifically to ICLR 2026, one of the major AI conferences.

“This pipeline provides a more accurate, PDF-derived institutional affiliation dataset, avoiding the common pitfalls of author profile drift and enabling detailed analysis of research trends.”

— Dmytro Lopushanskyy, project lead

“The new affiliation dataset offers valuable insights into the global distribution of AI research and the evolving landscape of institutional contributions.”

— ICLR 2026 organizing committee

Amazon

institutional affiliation analysis tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how the dataset will be adopted by the broader research community or integrated into existing bibliometric tools. Additionally, the accuracy of PDF parsing for non-standard or poorly formatted papers may vary, and the long-term stability of the normalization rules has yet to be tested across future conferences.

Bibliometrics - An Essential Methodological Tool for Research Projects

Bibliometrics – An Essential Methodological Tool for Research Projects

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include expanding the dataset to cover subsequent conferences, refining parsing algorithms for even higher accuracy, and encouraging community use to validate and improve the dataset. Researchers may also explore applying similar pipelines to other conferences or journals to build a comprehensive view of AI research trends.

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks

Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

How does this dataset improve over previous author profile-based analyses?

The dataset extracts affiliations directly from PDF title blocks, reducing errors caused by profile drift and providing a more accurate representation of institutional contributions.

Can this pipeline be used for other conferences or journals?

Yes, the pipeline is designed to be adaptable, and with some modifications, it can process PDFs from other conferences or journals to generate similar affiliation datasets.

What are the main limitations of this approach?

The primary limitations include potential parsing errors with non-standard papers and the need for ongoing normalization rule updates to maintain accuracy across diverse formats.

How can researchers access and use this dataset?

The dataset is publicly available via the project’s GitHub repository and can be integrated into bibliometric analyses or visualizations to better understand research trends.

You May Also Like

Comparing Top Plotter Manufacturers: HP Vs Canon Vs Roland

Major differences between HP, Canon, and Roland plotters can influence your choice; discover which brand suits your needs best.

Mercurial, 20 years and counting: how are we still alive and kicking? [video]

A recent FOSDEM talk highlights Mercurial’s two-decade history, its resilience, community efforts, and future prospects in version control.

AMD makes FSR 4 upscaling official for Radeon RX 7000- and 6000-series cards — RDNA 3 and RDNA 2 chips will soon enjoy improved visuals

AMD confirms FSR 4.1 upscaling will be available for RX 7000-series in July and RX 6000-series in early 2027, expanding support for older GPUs.

Apple Silicon costs more than OpenRouter

Recent analysis shows Apple Silicon hardware is more expensive than OpenRouter for running AI models locally, with cost implications over device lifespan.