正则表达式练习 from regexone

一些来自于regexone的正则表达式练习

Problem 1: Matching a decimal numbers

At first glance, writing a regular expression to match a number should be easy right?

We have the \d special character to match any digit, and all we need to do is match the decimal point right? For simple numbers, that may be right, but when working with scientific or financial numbers, you often have to deal with positive and negative numbers, significant digitsexponents, and even different representations (like the comma used to separate thousands and millions).

Below are a few different formats of numbers that you might encounter. Notice how you will have to match the decimal point itself and not an arbitrary character using the dot metacharacter. If you are having trouble skipping the last number, notice how that number ends the line compared to the rest.Exercise 1: Matching Numbers

TaskText
Match3.14529
Match-255.34
Match128
Match1.9e10
Match123,340.00
Skip720p

solution

提取数字,包括正数和复数,逗号或点号分割的数字,科学计数法

-?[\d.,e]*\d\b

Problem 2: Matching phone numbers

Validating phone numbers is another tricky task depending on the type of input that you get. Having phone numbers from out of the state which require an area code, or international numbers which require a prefix will add complexity to the regular expression, as does the individual preferences that people have for entering phone numbers (some put dashes or whitespace while others do not for example).

Below are a few phone numbers that you might encounter when using real data, write a single regular expressions that matches the number and captures the proper area code.Exercise 2: Matching Phone Numbers

TaskTextCapture Groups
Capture415-555-1234415
Capture650-555-2345650
Capture(416)555-3456416
Capture202 555 4567202
Capture4035555678403
Capture1 416 555 9292416

solution

提取前缀区号

(\d{3})

Problem 3: Matching emails

When you are dealing with HTML forms, it’s often useful to validate the form input against regular expressions. In particular, emails are difficult to match correctly due to the complexity of the specification and I would recommend using a built-in language or framework function instead of rolling your own. However, you can build a pretty robust regular expression that matches a great deal of common emails pretty easily using what we’ve learned so far.

One thing to watch out for is that many people use plus addressing for one time use, such as “name+filter@gmail.com”, which gets directly to “name@gmail.com” but can be filtered with the extra information. In addition, some domains have more than one component, for example, you can register a domain at “hellokitty.hk.com” and have an email with the form “ilove@hellokitty.hk.com”, so you will have to be careful when matching the domain portion of the email.

Below are a few common emails, in this example, try to capture the name of the email, excluding the filter (+ character and afterwards) and domain (@ character and afterwards).Exercise 3: Matching Emails

TaskTextCapture Groups
Capturetom@hogwarts.comtom
Capturetom.riddle@hogwarts.comtom.riddle
Capturetom.riddle+regexone@hogwarts.comtom.riddle
Capturetom@hogwarts.eu.comtom
Capturepotter@hogwarts.compotter
Captureharry@hogwarts.comharry
Capturehermione+regexone@hogwarts.comhermione

solution

提取邮箱名

([\w.]+)

Problem 4: Matching HTML

If you are looking for a robust way to parse HTML, regular expressions are usually not the answer due to the fragility of html pages on the internet today — common mistakes like missing end tags, mismatched tags, forgetting to close an attribute quote, would all derail a perfectly good regular expression. Instead, you can use libraries like Beautiful Soup or html5lib (both Python) or phpQuery (PHP) which not only parse the HTML but allow you to walk to DOM quickly and easily.

That said, there are often times when you want to quickly match tags and tag content in an editor, and if you can vouch for the input, regular expressions are a good tool to do this. As you can see in the examples below, some things that you might want to be careful about odd attributes that have extra escaped quotes and nested tags.

Go ahead and write regular expressions for the following examples.Exercise 4: Capturing HTML Tags

TaskTextCapture Groups
Capture<a>This is a link</a>a
Capture<a href=’https://regexone.com’>Link</a>a
Capture<div class=’test_style’>Test</div>div
Capture<div>Hello <span>world</span></div>div

solution

提取HTML标签

<(\w+)\s?

Problem 5: Matching specific filenames

If you use Linux or the command line frequently, are often dealing with lists of files. Most files have a filename component as well as an extension, but in Linux, it is also common to have hidden files that have no filename.

In this simple example, extract the filenames and extension types of only image files (not including temporary files for images currently being edited). Image files are defined as .jpg,.png, and .gif.Exercise 5: Capturing Filename Data

TaskTextCapture Groups
Skip.bash_profile
Skipworkspace.doc
Captureimg0912.jpgimg0912 jpg
Captureupdated_img0912.pngupdated_img0912 png
Skipdocumentation.html
Capturefavicon.giffavicon gif
Skipimg0912.jpg.tmp
Skipaccess.lock

solution

提取特定文件格式的文件名和对应拓展名

(\w+).(jpg|png|gif)$

Problem 6: Trimming whitespace from start and end of line

Occasionally, you’ll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor’s search a replace and a regular expression to extract the content of the lines without the extra whitespace.

We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.

Write a simple regular expression to capture the content of each line, without the extra whitespace.Exercise 6: Matching Lines

TaskTextCapture Groups
CaptureThe quick brown fox…The quick brown fox…
Capturejumps over the lazy dog.jumps over the lazy dog.

solution

提取由于空白过多或不足引起的格式错误

\s+([\s\w.]+)

Problem 7: Extracting information from a log file

In this example, we are going to use actual output from an Android adb debugging session. Your goal is to use any regular expression techniques that we’ve learned so far to extract the filenamemethod name and line number of line of the stack trace (they follow the form “at package.class.methodname(filename:linenumber)”).

Good luck!Exercise 7: Extracting Data From Log Entries

TaskTextCapture Groups
SkipW/dalvikvm( 1553): threadid=1: uncaught exception
SkipE/( 1553): FATAL EXCEPTION: main
SkipE/( 1553): java.lang.StringIndexOutOfBoundsException
CaptureE/( 1553): at widget.List.makeView(ListView.java:1727)makeView ListView.java 1727
CaptureE/( 1553): at widget.List.fillDown(ListView.java:652)fillDown ListView.java 652
CaptureE/( 1553): at widget.List.fillFrom(ListView.java:709)fillFrom ListView.java 709

solution

提取调试日志中发生错误的文件名、方法名和对应行号

.(\w+)(([\w.]+):(\d+))$

Problem 8: Parsing and extracting data from a URL

When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.

URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.http://regexone.com:80/page

The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.

In the exercise below, try to extract the protocol, host and port of the all the resources listed.Exercise 8: Extracting Data From URLs

TaskTextCapture Groups
Captureftp://file_server.com:21/top_secret/life_changing_plans.pdfftp file_server.com 21
Capturehttps://regexone.com/lesson/introduction#sectionhttps regexone.com
Capturefile://localhost:4040/zip_filefile localhost 4040
Capturehttps://s3cur3-server.com:9999/https s3cur3-server.com 9999
Capturemarket://search/angry%20birdsmarket search

solution

提取URL中的内容

(\w+)://([\w.-]+)[/:]?(\d+)?

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注