"일꾼이 일을 잘하려면 먼저 도구를 갈고 닦아야 한다." - 공자, 『논어』.
첫 장 > 프로그램 작성 > Python高效去除文本中HTML标签方法

Python高效去除文本中HTML标签方法

2025-06-15에 게시되었습니다
검색:236

How Can I Efficiently Strip HTML Tags from Text in Python?

Stripping HTML Tags in Python for a Pristine Textual Representation

Manipulating HTML responses often involves extracting relevant text content while eliminating the formatting tags. This can be achieved by effectively stripping HTML tags, leaving you with the desired plain text.

Achieving Text-Only Extraction with Python's MLStripper

To streamline the stripping process, the Python standard library provides an efficient function, MLStripper, designed specifically for this purpose. MLStripper takes HTML input and parses it, preserving only non-markup content.

Implementation for Python 3 and 2

Depending on your Python version, you can utilize the following code snippets:

Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Usage:

Simply call the strip_tags function passing the HTML input as a string argument. The returned value will be a stripped string with all HTML tags removed.

This technique proves invaluable when you need to work with textual data extracted from HTML sources, ensuring a clean and manageable text representation.

최신 튜토리얼 더>

부인 성명: 제공된 모든 리소스는 부분적으로 인터넷에서 가져온 것입니다. 귀하의 저작권이나 기타 권리 및 이익이 침해된 경우 자세한 이유를 설명하고 저작권 또는 권리 및 이익에 대한 증거를 제공한 후 이메일([email protected])로 보내주십시오. 최대한 빨리 처리해 드리겠습니다.

Copyright© 2022 湘ICP备2022001581号-3