Logo 
Search:

Java Forum

Ask Question   UnAnswered
Home » Forum » Java       RSS Feeds

Regular expression help

  Asked By: Lourdes    Date: Apr 11    Category: Java    Views: 1223
  

I am fairly new to java, but have a fair amount of experience with
PHP and Perl RegEx.

I am writting a program to parse an RTF file, and have developed the
following RegEx (with minor modifications)
insrsid\d*([^{}]*)?\\\\cell \\}
which works fine with preg_match (PHP parser, using Perl rules). And
so knowing that the Java parser is also build off of the Perl parser,
I just plugged the expression. But I found out that it doesn't
work. I did find that if I change it to
^.*?insrsid\d*([^{}]*)?\\\\cell \\}.*$
(adding the line begining and end qualifiers) that it will return
true, but the returned text (VAR.substring(VAR_match.start(),
VAR_match.end())) that it returns almost the entire file, and not the
small selection that it is supposed to.

for example, assume I have the below file
--------------------------------------
{\b\f2\fs20\insrsid4223016 TEXT 1\cell }\pard \ql \li0\ri0
\widctlpar\intbl\aspalpha\aspnum\faauto\adjustright\rin0\lin0 {\f2
\fs20\insrsid4223016 \trowd \irow0\irowband0\ts11\trleft0\trftsWidth1
\clvertalt\clbrdrt\brdrnone \clbrdrl\brdrnone \clbrdrb\brdrs\brdrw15
\clbrdrr\brdrnone \cltxlrtb\clftsWidth3\clwWidth18610\clshdrawnil
\cellx18610\row }\trowd \irow1\irowband1\ts11\trleft0\trftsWidth1
\clvertalt\clbrdrt\brdrs\brdrw15 \clbrdrl\brdrnone \clbrdrb\brdrnone
\clbrdrr\brdrnone \cltxlrtb\clftsWidth3\clwWidth9370\clshdrawnil
\cellx9370\clvertalt\clbrdrt\brdrs\brdrw15 \clbrdrl\brdrnone
\clbrdrb\brdrnone \clbrdrr\brdrnone \cltxlrtb\clftsWidth3\clwWidth9240
\clshdrawnil \cellx18610\pard \ql \li0\ri0\sb100
\widctlpar\intbl\faauto\adjustright\rin0\lin0 {\b\f2\fs20
\insrsid4223016 TEXT 2\cell }
--------------------------------------
if run with my PHP scripts it would return 2 entries
-----
TEXT 1
TEXT 2
-----
But when I run it with my Java program I get this
-----
TEXT 1\cell }\pard \ql \li0\ri0
\widctlpar\intbl\aspalpha\aspnum\faauto\adjustright\rin0\lin0 {\f2
\fs20\insrsid4223016 \trowd \irow0\irowband0\ts11\trleft0\trftsWidth1
\clvertalt\clbrdrt\brdrnone \clbrdrl\brdrnone \clbrdrb\brdrs\brdrw15
\clbrdrr\brdrnone \cltxlrtb\clftsWidth3\clwWidth18610\clshdrawnil
\cellx18610\row }\trowd \irow1\irowband1\ts11\trleft0\trftsWidth1
\clvertalt\clbrdrt\brdrs\brdrw15 \clbrdrl\brdrnone \clbrdrb\brdrnone
\clbrdrr\brdrnone \cltxlrtb\clftsWidth3\clwWidth9370\clshdrawnil
\cellx9370\clvertalt\clbrdrt\brdrs\brdrw15 \clbrdrl\brdrnone
\clbrdrb\brdrnone \clbrdrr\brdrnone \cltxlrtb\clftsWidth3\clwWidth9240
\clshdrawnil \cellx18610\pard \ql \li0\ri0\sb100
\widctlpar\intbl\faauto\adjustright\rin0\lin0 {\b\f2\fs20
\insrsid4223016 TEXT 2
-----

Can someone tell me what I am doing wrong, and why the Java RegEx
parser works in such a perculiar way?

Share: 

 

2 Answers Found

 
Answer #1    Answered By: Lughaidh Fischer     Answered On: Apr 11


I got your regular  expression to work but I had to make some minor modifications
myself. The program  I wrote is included below as well as my output. I just
output the whole match. I'm also assuming that you're using Apache's Regexp
package. The text  string might come out all garbled by posting it this way
(looks like a hyperlink in Yahoo's mail window) but you get the idea.

import org.apache.regexp.*;
public class TestRegexp {
public static void main(String[] args) {
RE re = new RE("insrsid\\d*([^\\{\\}]*)\\\\cell \\}");
String text = "{\\b\\f2\\fs20\\insrsid4223016 TEXT 1\\cell }\\pard \\ql
\\li0\\ri0" +
"\\widctlpar\\intbl\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\\lin0
{\\f2" +
"\\fs20\\insrsid4223016 \\trowd
\\irow0\\irowband0\\ts11\\trleft0\\trftsWidth1 " +
"\\clvertalt\\clbrdrt\\brdrnone \\clbrdrl\\brdrnone
\\clbrdrb\\brdrs\\brdrw15 " +
"\\clbrdrr\\brdrnone \\cltxlrtb\\clftsWidth3\\clwWidth18610\\clshdrawnil "
+
"\\cellx18610\\row }\\trowd \\irow1\\irowband1\\ts11\\trleft0\\trftsWidth1
" +
"\\clvertalt\\clbrdrt\\brdrs\\brdrw15 \\clbrdrl\\brdrnone
\\clbrdrb\\brdrnone " +
"\\clbrdrr\\brdrnone \\cltxlrtb\\clftsWidth3\\clwWidth9370\\clshdrawnil" +
"\\cellx9370\\clvertalt\\clbrdrt\\brdrs\\brdrw15 \\clbrdrl\\brdrnone " +
"\\clbrdrb\\brdrnone \\clbrdrr\\brdrnone
\\cltxlrtb\\clftsWidth3\\clwWidth9240" +
"\\clshdrawnil \\cellx18610\\pard \\ql \\li0\\ri0\\sb100" +
"\\widctlpar\\intbl\\faauto\\adjustright\\rin0\\lin0 {\\b\\f2\\fs20"+
"\\insrsid4223016 TEXT 2\\cell }";
boolean foundMatch = false;
int startPt = 0;
do {
foundMatch = re.match(text, startPt);
System.out.println(foundMatch);
if (foundMatch)
System.out.println(re.getParen(0));
startPt = re.getParenEnd(0);
}
while (foundMatch);
}
}


The output I get is:

true
insrsid4223016 TEXT 1\cell }
true
insrsid4223016 TEXT 2\cell }
false

 
Answer #2    Answered By: Aalia Arain     Answered On: Apr 11

Actually I am using java.util.regex.* But you did point out a couple
mistakes in my expression. Below is the function that I am using.

//*****************************************
void reduce_rtf(String rtf)
{
Pattern rtf_patter = Pattern.compile(RTF);
Matcher rtf_match = rtf_pattern.matcher(rtf);
if (rtf_match.matches())
{
System.out.println(rtf_match.group(1));
}
else
{
System.out.println("No Match");
}
}
private final String rtf  = "insrsid\\d*(.*?)\\\\cell \\}";
//*****************************************
and rtf is

----
{\\b\\f2\\fs20\\insrsid4223016 TEXT 1\\cell }\\pard \\ql \\li0\\ri0
\\widctlpar\\intbl\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\\lin0
{\\f2 \\fs20\\insrsid4223016 \\trowd \\irow0\\irowband0\\ts11\\trleft0
\\trftsWidth1 \\clvertalt\\clbrdrt\\brdrnone \\clbrdrl\\brdrnone
\\clbrdrb\\brdrs\\brdrw15 \\clbrdrr\\brdrnone \\cltxlrtb\\clftsWidth3
\\clwWidth18610\\clshdrawnil \\cellx18610\\row }\\trowd \\irow1
\\irowband1\\ts11\\trleft0\\trftsWidth1
\\clvertalt\\clbrdrt\\brdrs\\brdrw15 \\clbrdrl\\brdrnone
\\clbrdrb\\brdrnone \\clbrdrr\\brdrnone \\cltxlrtb\\clftsWidth3
\\clwWidth9370\\clshdrawnil \\cellx9370
\\clvertalt\\clbrdrt\\brdrs\\brdrw15 \\clbrdrl\\brdrnone
\\clbrdrb\\brdrnone \\clbrdrr\\brdrnone \\cltxlrtb\\clftsWidth3
\\clwWidth9240 \\clshdrawnil \\cellx18610\\pard \\ql \\li0\\ri0
\\sb100 \\widctlpar\\intbl\\faauto\\adjustright\\rin0\\lin0 {\\b\\f2
\\fs20 \\insrsid4223016 TEXT 2\\cell }
----

 
Didn't find what you were looking for? Find more on Regular expression help Or get search suggestion and latest updates.




Tagged: