We are very proud to announce the next big thing for all MateCat users, the release of the MateCat Filters: 69 formats supported, 20x faster, all released as open source software.
MateCat uses filters to import all files and extracts all and only the translatable text. We use the same filter again to generate the target files with your translations, keeping the original layout and formatting.
Previously, we had been using third-party commercial filters on matecat.com and could only support XLIFF files in the open source version.
We believe open source is an enabler for innovation, and that’s why we are thrilled to be releasing the new MateCat Filters as open source software.
Quality
We decided to design filters using a pure data-driven approach. We analyzed thousands of files from past translation projects and reviewed them with professional translators and engineers until all errors were removed and re-use was maximized.
We started out integrating the Okapi Filters, the best open source filters on the market. At the beginning, only 10% of the projects had zero errors. We worked hard with the Okapi team on improving the Okapi Filters and we reached a good success rate of 68%.
We continued iterating and adding new pre and post-processing features on the MateCat side until we achieved a 100% success rate.
This is how things progressed over the last few months:
Month |
Release |
# files tested |
Success rate |
May 2015 |
Matecat Filter 0.5
Okapi v27 |
400 |
10% |
June 2015 |
Matecat Filter 0.6
Okapi v28 |
400 |
35% |
July 2015 |
Matecat Filter 0.7
Okapi v29 |
400 |
68% |
August 2015 |
Matecat Filter 0.8
+ legacy Office support |
1000 |
96% |
September 2015 |
Matecat Filter 0.9
+ PDF and scanned files via OCR |
1000 |
100% |
October 2015 |
Matecat Filter 1.0
+ Adding formats not supported by commercial libraries |
>2000 |
106%* |
* Achieved by supporting formats that were not supported by commercial libraries.
20x faster
The brand new architecture is extremely fast compared to the leading commercial technology. Upload dozens of files and have them ready for translation in a few seconds.
The efficiency of the new system provides not only speed, but also greater stability and scalability, to deliver a fast experience to more and more users.
The filters are now multi-threaded, stateless and clusterizable across multiple machines. This means you can potentially convert millions of files in seconds and not days.
Support for new file formats
The number of supported formats now has grown to 69 – discover them all on MateCat.com.
Among the new formats, you will find .po files, scanned PDF files and images (JPEG). Import a PDF, JPEG or PNG file, or any image obtained by scanning or photographing a document, and MateCat will extract the translatable text for your translation project.
Better segmentation
We now use the ICU (International Components for Unicode) library with the splitting rules developed by Unicode. On top of this, we apply another layer of rules specifically designed for the world of CAT tools. As a result, MateCat provides extremely accurate segmentation, even for rare languages.
Open Source
Like the rest of the MateCat project, the new filters are based on open source code and are released as open source software. With the ability to directly maintain the filters’ code and the help of the open source community, the improvements will be exponential. You can test the new filters now at matecat.com.
Acknowledgments
It’s amazing how much talent and expertise we discovered in the open source community. We would especially like to thank the team of developers who created the terrific Okapi Framework, and special thanks also to Spartan Software for teaming up with us on the common challenge for better open source filters. Thank you guys, it’s an honour and a pleasure working with you!
The post MateCat 0.6.0: MateCat Filters now Open Source appeared first on Matecat.