掲示板

Enconding characters

6年前 に Daniel G によって更新されました。

Enconding characters

Regular Member 投稿: 141 参加年月日: 17/03/14 最新の投稿
Hi!

I'm trying to read values from a LanguageProperties and then extract them to a csv file. I'll do with HtmlUtil.extractText but the result is wrong because the encode of special characters like &aacute and others are printed.

Does anyone knows what I have to use for printing correctly?

Thanks for the help!
thumbnail
6年前 に Olaf Kock によって更新されました。

RE: Enconding characters

Liferay Legend 投稿: 6403 参加年月日: 08/09/23 最新の投稿
Daniel G:
I'm trying to read values from a LanguageProperties and then extract them to a csv file. I'll do with HtmlUtil.extractText but the result is wrong because the encode of special characters like &aacute and others are printed.

Does anyone knows what I have to use for printing correctly?


"Printing correctly" is subject to interpretation. Assuming you'd like to extract UTF-8 text to your csv file, you'll still need to make sure to escape some characters, like quotes, line breaks, comma or semicolon. You'll need to decide if (or how) you would like to see tags like "<b>" in your result. Thus, a single call won't be sufficient.

You've decided to use a method from HtmlUtil, a class that is intimately tied to the HTML format. You might want to look at the other methods that are also there and see if stripHtml, unescape or render fit your needs. Or you might want to check other methods of extraction.
6年前 に Daniel G によって更新されました。

RE: Enconding characters

Regular Member 投稿: 141 参加年月日: 17/03/14 最新の投稿
First to all, thanks for the help.

What other methods of extraction can I use? I read the code from a Language.properties with LanguageUtil.get , but I'm not able that this works with any method I use of HtmlUtil , because characters are printed.

Thanks again!
thumbnail
6年前 に Christoph Rabel によって更新されました。

RE: Enconding characters

Liferay Legend 投稿: 1554 参加年月日: 09/09/24 最新の投稿
We did something like that once, but used really ugly translation tables, something like this:
http://www.thesauruslex.com/typo/eng/enghtml.htm
Since we needed only a subset, it worked pretty well in the end.

Another wild idea we had was to use the browser. The browser is able to translate all those special characters. So, the idea was to write texts in divs, get the uf8 text using innerHTML and send it back. We never implemented or even tried it, but it might work. Write a page, print all text in a list, add a javascript to send all the text to the backend.
thumbnail
6年前 に Olaf Kock によって更新されました。

RE: Enconding characters

Liferay Legend 投稿: 6403 参加年月日: 08/09/23 最新の投稿
Daniel G:
What other methods of extraction can I use? I read the code from a Language.properties with LanguageUtil.get , but I'm not able that this works with any method I use of HtmlUtil , because characters are printed.


First of all: Please list some of your input and the desired output. Let's say you'd like to have the following values in csv - what do you expect?
  • press any key to continue, any other to quit
  • A semicolon (";") is a valid character
  • Single quotes look like this: '
  • This is <b>valid</b> HTML<br/>with two lines.
  • The german alphabet knows of a character &auml; - really!
  • The german alphabet knows of a character ä. Really!

(The question is about comma, semicolon, quotes, HTML-Tags, special characters (Umlaut) in whatever form you find them.)
6年前 に Daniel G によって更新されました。

RE: Enconding characters

Regular Member 投稿: 141 参加年月日: 17/03/14 最新の投稿
Thanks to all!! And sorry for the delay, but I was busy these days so I couldn't post.

I obtain this:

-Holan&#47;Adi&oacute;s

and I should obtain this:
- Hola/Adiós

Thanks!
thumbnail
6年前 に Olaf Kock によって更新されました。

RE: Enconding characters

Liferay Legend 投稿: 6403 参加年月日: 08/09/23 最新の投稿
Daniel G:
I obtain this:

-Holan&#47;Adi&oacute;s

and I should obtain this:

- Hola/Adiós


What about the other inputs that I've asked about? The problem is that converting encoding from one to another is not really a trivial task that can be answered with a single example. My samples above are not complete. Assume they all go into a CSV file with the line number. What would be the correct output?

1,press any key to continue, any other to quit
2,A semicolon (";") is a valid character
3,Single quotes look like this: '
4,This is <b>valid</b> HTML<br>with two lines.
5,The german alphabet knows of a character ä - really!
6,The german alphabet knows of a character ä. Really!


Obviously, some lines have 2 entries, some have 3.

You have the encoding from HTML to plain UTF-8, then from plain UTF-8 to CSV. However, with tags in HTML, even that is not a well defined requirement - will you keep the tags? Escape them? Simplify them?