General details
EDIHs involved
Customer

Customer size: Micro (1-9)
Customer turnover: 200.000€
Challenges
La Foneria is a digital humanities company specialised in processing and publishing digitised collections from museums, archives, and documentation centers online. They use the open-source software Dédalo and collaborate closely with its developers.
La Foneria processes data from digitised documents. For example, they extract locations from a series of 10,000 images, link these locations to a controlled vocabulary within Dédalo, and index the data for online publication, enabling users to find specific documentation. Currently, this process is manual, but La Foneria aims to automate it using computer vision and artificial intelligence, particularly for normalised datasets.
La Foneria's clients include public and private companies and documentation centers focused on archives, historical memory, and human rights. These clients have vast amounts of documents containing data on victims, including their names, origins, repressive events, and locations of death or imprisonment. By using AI trained on relevant datasets, La Foneria can automate the extraction of this data from digitised documents.
To explore the feasibility of this automation, La Foneria engaged with DIH4CAT and the Computer Vision Center (CVC), a renowned research center in computer vision. The CVC's Document Analysis group has expertise in various fields such as symbol recognition, graphical content indexing, document image analysis, and camera-based OCR.
After an initial evaluation, La Foneria provided thousands of digitised documents on topics like war councils. The CVC's technical team extracted relevant information, making it searchable, proposing anonymisation of certain names, and linking this data with Dédalo datasets.
The project, financed by DIH4CAT, includes a study on the state-of-the-art text recognition engine (OCR) for automatic transcription of digitised documents.
Solutions
To meet the financing of the challenge, a coupon from the Advanced Digital Technologies Testing Coupons DIH4CAT line was requested, which allowed the Company to carry out this first testing and study for the digital transformation.
The challenge is to automatically extract the names, locations, detention centers, as well as types of convictions that appear in the supplied documents.
To automate this task, a combination of computer vision and AI techniques was proposed. Specifically, it can be formulated as a case of Named-Entity Recognition, which is an information extraction task that seeks to locate and classify in pre-defined categories. In the literature, two important strategies are distinguished to approach this task: In a first phase a text recognition (OCR) engine is applied to make an automatic transcription (into text) of a digitisation (image) of a document. Next, a pre-trained NLP (Natural Language Processing) model is applied to find the entities of interest in this text. A second way is to train a Word Spotting model that directly searches the digitised version of the document. Next, only the areas detected in the image must be transcribed.
The performance of the OCR for handwritten text in the tests that have been carried out turned out to be better than expected, considering that the best for manual text processing is Word Spotting, or image clustering. Work carried out at the CVC related to Topic Spotting shows promising results in locating texts that are always the same in documents where the quality of the digitisation is very low. The result of the testing project indicates that the challenge is manageable with these technologies, and that it would be necessary to continue testing the technologies with a larger project that would allow a proof of concept with labelled images that would allow an objective evaluation of the tests.
Results and Benefits
The service is based on NER (Named Entity Recognition) algorithms and was aimed at automatically extracting this data to allow massive indexing. The service that has been provided approached the problem as a NER issue, analysing the performance of two possible approaches.
-
The first approach uses the output of an OCR to be processed with Natural Language Processing (NLP), which identifies categories from the OCR output. In this approach, various OCR engines were analysed, and it was assessed whether refining the dictionary used by the NLP could improve results. The results of the first approach are promising.
-
The second approach is based on Word Spotting specific words within the documents. Tests were conducted, but the results are not as promising as the first approach.
Beyond the scope of the proposal, solutions to collateral problems identified during document analysis were also examined, as solving these could improve the original solution's performance. These problems include: improving text segmentation in noisy or interfered documents, manually detecting text to be processed separately from the solution created for typed text, and OCR reading of handwritten text.
The results demonstrate the viability of these technologies for addressing this problem. Given the current efforts in digitising cultural heritage, such as the 175 million digital objects in the Europeana Archives Portal which are merely images and not machine-readable, computer vision could make these documents readable by applying various techniques. Once this information becomes machine-readable, archival science and its users will experience a revolutionary shift: archives will transform from document repositories to data providers, allowing users to query archives directly, unlocking millions of data points for public use and revolutionising archival science and related sectors. Achieving this long-term vision would enable, for instance, a historian to perform natural language searches like: "how many people were exiled in 1939 due to the Spanish Civil War, who were they, and where did they go?" Advanced AI and computer vision technologies for document reading, such as NER and entity linking, could provide written responses and even map-based presentations from machine-readable primary sources (archival documents).
This long-term vision is attainable, as La Foneria and the Computer Vision Center began a basic conceptualisation in 2023, funded by the “Digital Technologies Coupons”, and are developing the concept with NER in 2024 with support from the DIH4CAT’s PADIH Program. With support from Regional Funding “Nuclis”, an experimental proof of concept with additional AI techniques will be conducted and validated at the laboratory level, aiming to progress from a current TRL 2-3 to a TRL 4 by December 2026.
Perceived social/economic impact
The first study of state of the art and exploration of implementing AI technologies to extract names in historical documents has been very positive for La Foneria and allowed the company to further scale the solution with the help of the PADIH Program from the DIH4CAT. It has also opened new possibilities for continuing research in this field in a public-private consortium that can be potentially funded by the call “NUCLIS R+D” to carry out an experimental concept test with the rest of the AI techniques and validate them at laboratory level. The main idea is to create a theoretical model for archives, so that when this sector integrates and applies AI techniques it can start managing data and not documents. With this model we make it possible, until obtaining a valid prototype for both data providers and stakeholders. This transformation will allow growth in the current income lines of data analysis, treatment, creation and publication projects (main current business model of La Foneria), and in new income lines based on the future product developed.
Regarding the current income lines, it is estimated that these will generate double the turnover in the first year, from the launch of the new artificial intelligence functionalities inside or outside the Dédalo SaaS platform by files, increasing from 250K € (invoiced in 16 projects) to approximately €500K (invoiced in approximately 35 projects) which in the following 3 years will reach a turnover of €750K (distributed in 75 projects).
At a social and environmental level, the automation of archival processes through AI can reduce the consumption of energy and resources, making the work more efficient and sustainable. Facilitating access to information related to historical memory strengthens transparency and citizen participation, improving democratic quality. Innovation in work processes allows for more agile and accurate management and helps in transactional justice processes as the use of vision and AI in
Lessons learned
The services provided by DIH4CAT to allow companies to fund first stages of development have been stabilised as “Digital Technologies Coupons”. These cupons are aimed to help companies in a first diagnosis of technology development and proof of concepts. This is crucial for companies to be in contact with technology providers from the region, with technology capacities to approach a first attempt to solve a challenge. DIH4CAT made all efforts to ensure accessibility and simplification in the application process, ensuring all eligible businesses could participate without excessive administrative burdens. Flexibility was also crucial to cater to the diverse needs of SMEs at different stages of development; providing support and guidance. One of the barriers for the companies was assuming the first initial payment to the technology provider as the funds are released once the service is finished, which might take some months. Also, it was very important to limit and explain the reach of this first stage funding and it does not allow to have functional prototypes.
The technology services provided by the CVC in the first stage of funding were devoted to the analysis of 5 different methodologies when analysing and processing the documents to detect problems within the document. Some work done at the CVC related to Topic Spotting shows promising results for locating texts that are always the same (such as types of sentences or detention centres). The results indicate that the challenge can be tackled with these technologies. It would be necessary to continue testing the technologies with a larger-scale project that allows for a proof of concept with labelled images, enabling an objective (numerical) evaluation of the tests. The priorities we consider should be addressed in a larger-scale project are: the construction of a GroundTruth and the development of software.