Notes on App-Surface Data Collection
中文Some product data is easy to fetch from a web endpoint. Some of it only exists on the app screen. The interesting design problem is not just “how do I scrape it”, but how to build a small system that can combine both surfaces without becoming fragile.
That mobile-only part changes the architecture. If the signal is only visible in the app, the collector needs real Android workers. ADB becomes the control plane: open the app, navigate to the product, dump the UI, parse the screen, and report the result back into the same product data model.
For a shopping or content platform, a useful internal tool usually needs three kinds of information:
- stable identifiers, like item IDs and canonical URLs
- structured public data, like title, shop, price, sales text, and SKU variants
- app-only signals, like badges, tags, ranking hints, recent cart activity, or visible social proof
These do not arrive through one clean API. The design has to accept that reality and make the collection path observable, retryable, and boring enough to operate.
1. Normalize Input Before Doing Anything Expensive
The first boundary is input normalization.
Users paste messy things: short links, long links, copied share text, deep links, or just an item ID. The rest of the system should not care. It should receive one canonical product identity.
function resolveProductInput(input): |
This keeps every downstream operation simple. The web collector, app collector, product table, query history, and metric jobs all speak in item IDs.
I like this pattern because it moves chaos to the edge. The user can paste almost anything, but inside the system there is one durable key.
2. Split Web Data From App Detail Data
The second design choice is separating cheap web data from expensive app detail data.
Web data is usually faster and safer:
- product title
- sales text
- shop name
- shop URL
- location
- SKU variants
- price fields
App UI collection is slower and more operationally fragile:
- it needs a real Android device or emulator
- it depends on current UI structure
- it can fail because of loading states, login state, app updates, or network conditions
- it consumes a scarce device slot
So the default flow should be layered:
function addProduct(input, includeDetail): |
This gives the user a useful record quickly. App detail collection becomes an optional enrichment step instead of a blocker for the whole product.
3. Treat Devices as a Pool, Not a Constant
Remote Android collection should not assume one perfect device. The reason for having Android workers at all is that some information is only exposed through the mobile app surface. Once a phone becomes part of the data path, it should be modeled as infrastructure, not as a hidden script dependency.
A better model is a small device pool. Each device has metadata:
- name
- ADB serial
- phone IP
- reverse SSH port
- app package
- online state
- busy or idle state
- last successful collection time
Then task dispatch can be conservative:
function collectMobileOnlySignals(product): |
This matters because the device is not just an implementation detail. It is a scarce worker. If two tasks use the same phone at the same time, the UI state becomes non-deterministic.
4. Make App Collection a Queue
Once devices are treated as workers, app detail collection naturally becomes a queue.
The task record should store:
- item ID
- input and canonical URL
- status: pending, running, success, failed
- assigned device
- API key or user that requested it
- started and finished timestamps
- extracted tags
- raw UI nodes
- error message
- elapsed time
The dispatcher can stay simple:
function dispatchPendingTasks(): |
The point is not to build a large job system too early. The point is to make the state explicit enough that failures are visible and retryable.
5. Parse the App Screen as Evidence
App UI collection is different from API collection. The screen is the data source.
A practical approach is:
- open the product deep link
- wait for the screen to settle
- dump the UI XML
- scan text and content descriptions
- filter likely tag strings
- use screen bounds to ignore irrelevant regions
- swipe horizontally when tags overflow
- deduplicate while preserving order
Pseudocode:
function collectFromApp(itemId, serial): |
This is intentionally heuristic. UI extraction is not a clean contract. The best design is to preserve enough raw evidence to debug bad matches later.
That is why storing raw UI node information is useful. It turns “the collector missed something” into a problem that can be inspected.
6. Do Not Store Only Final Tags
Final tags are convenient, but they are not enough.
For analysis, I want three layers:
- the product record
- the query or collection history
- extracted metrics over time
The product record answers “what is this item now?” The query history answers “what happened during collection?” The metric table answers “how did a signal change?”
For example, a tag like 24小时内123人加购 is both display text and structured data. It should be saved as raw text, but it can also become a metric:
patterns = [ |
This makes the tool more than a scraper. It becomes a small time-series surface for product signals.
7. Keep the Console Operational
The frontend should feel like an operations console, not a marketing dashboard.
The important screens are practical:
- product list with filters
- detail enrichment status
- latest query result
- device status and busy state
- API key management
- marketplace collection progress
- metric summary and history
The UI should show state instead of hiding it. pending, running, success, and failed are not backend details. They are the product.
When collection depends on a remote phone, silence is expensive. A user needs to know whether the device is offline, busy, waiting, collecting, or failed.
8. Keep Secrets and Permissions Narrow
Even a small internal tool needs a security shape.
The simple version is enough:
- raw API keys are shown once
- only SHA-256 hashes are stored
- keys can expire
- keys can be revoked
- device management and key management are separate permissions
Pseudocode:
function createKey(name, expiresAt, permissions): |
This keeps operational access simple without storing reusable secrets in plain text.
Current Rule
For app-surface data collection, my rule is:
normalize early, collect in layers, store evidence, extract metrics late |
The useful system is not the one that perfectly scrapes every screen once. It is the one that can explain what it tried, what it found, which worker did it, why it failed, and which signals are worth tracking over time.